Re: spark streaming actor receiver doesn't play well with kryoserializer

2014-07-31 Thread Prashant Sharma
This looks like a bug to me. This happens because we serialize the code that starts the receiver and send it across. And since we have not registered the classes of akka library it does not work. I have not tried myself, but may be by including something like chill-akka (

java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]

2014-07-31 Thread Bin
Hi All, The data size of my task is about 30mb. It runs smoothly in local mode. However, when I submit it to the cluster, it throws the titled error (Please see below for the complete output). Actually, my output is almost the same with

Re: Ports required for running spark

2014-07-31 Thread Konstantin Kudryavtsev
Hi Haiyang, you are right, YARN takes over the resource management, bot I constantly got Exception ConnectionRefused on mentioned port. So, I suppose some spark internal communications are done via this port... but I don't know what exactly and how can I change it... Thank you, Konstantin

configuration needed to run twitter(25GB) dataset

2014-07-31 Thread Jiaxin Shi
We have a 6-nodes cluster , each node has 64GB memory. here is the command: ./bin/spark-submit --class org.apache.spark.examples.graphx.LiveJournalPageRank examples/target/scala-2.10/spark-examples-1.0.1-hadoop1.0.4.jar hdfs://dataset/twitter --tol=0.01 --numEPart=144 --numIter=10 But it ran out

Re: Ports required for running spark

2014-07-31 Thread Haiyang Fu
Hi Konstantin, Could you please post your first container's stderr log here which is always the AM log?As far as I know, ports except 8020,8080,8081,50070,50071 are all random socket ports determined by each job. So 33007 maybe just a temporary port for data transferation. The deeper reason for

How to share a NonSerializable variable among tasks in the same worker node?

2014-07-31 Thread Fengyun RAO
As shown here: 2 - Why Is My Spark Job so Slow and Only Using a Single Thread? http://engineering.sharethrough.com/blog/2013/09/13/top-3-troubleshooting-tips-to-keep-you-sparking/ 123456789101112131415 object JSONParser { def parse(raw: String): String = ...}object MyFirstSparkJob { def

java.lang.OutOfMemoryError: Java heap space

2014-07-31 Thread Sameer Tilak
Hi everyone,I have the following configuration. I am currently running my app in local mode. val conf = new SparkConf().setMaster(local[2]).setAppName(ApproxStrMatch).set(spark.executor.memory, 3g).set(spark.storage.memoryFraction, 0.1) I am getting the following error. I tried setting up

Re: Shark/Spark running on EC2 can read from S3 bucket but cannot write to it - Wrong FS

2014-07-31 Thread William Cox
I am running Spark 0.9.1 and Shark 0.9.1. Sorry I didn't include that. On Thu, Jul 31, 2014 at 9:50 AM, William Cox william@distilnetworks.com wrote: *The Shark-specific group appears to be in moderation pause, so I'm asking here.* I'm running Shark/Spark on EC2. I am using Shark to

Re: the EC2 setup script often will not allow me to SSH into my machines. Ideas?

2014-07-31 Thread William Cox
Ah, thanks for the help! That worked great. On Wed, Jul 30, 2014 at 10:31 AM, Zongheng Yang zonghen...@gmail.com wrote: To add to this: for this many (= 20) machines I usually use at least --wait 600. On Wed, Jul 30, 2014 at 9:10 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote:

Inconsistent Spark SQL behavior when column names contain dots

2014-07-31 Thread Budde, Adam
I’m working with a dataset where each row is stored as a single-line flat JSON object. I want to leverage Spark SQL to run relational queries on this data. Many of the object keys in this dataset have dots in them, e.g.: { “key.number1”: “value1”, “key.number2”: “value2” … } I can successfully

store spark streaming dstream in hdfs or cassandra

2014-07-31 Thread salemi
Hi, I was wondering what is the best way to store off dstreams in hdfs or casandra. Could somebody provide an example? Thanks, Ali -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/store-spark-streaming-dstream-in-hdfs-or-cassandra-tp11064.html Sent from

Re: Inconsistent Spark SQL behavior when column names contain dots

2014-07-31 Thread Budde, Adam
I still see the same “Unresolved attributes” error when using hql + backticks. Here’s a code snippet that replicates this behavior: val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) val sampleRDD = sc.parallelize(Array({key.one: value1, key.two: value2})) val sampleTable =

Re: Inconsistent Spark SQL behavior when column names contain dots

2014-07-31 Thread Yin Huai
I have created https://issues.apache.org/jira/browse/SPARK-2775 to track it. On Thu, Jul 31, 2014 at 11:47 AM, Budde, Adam bu...@amazon.com wrote: I still see the same “Unresolved attributes” error when using hql + backticks. Here’s a code snippet that replicates this behavior: val

Re: Number of partitions and Number of concurrent tasks

2014-07-31 Thread Darin McBeath
Ok, I set the number of spark worker instances to 2 (below is my startup command).  But, this essentially had the effect of increasing my number of workers from 3 to 6 (which was good) but it also reduced my number of cores per worker from 8 to 4 (which was not so good).  In the end, I would

Re: store spark streaming dstream in hdfs or cassandra

2014-07-31 Thread Hari Shreedharan
Off the top of my head, you can use the ForEachDStream to which you pass in the code that writes to Hadoop, and then register that as an output stream, so the function you pass in is periodically executed and causes the data to be written to HDFS. If you are ok with the data being in text

SchemaRDD select expression

2014-07-31 Thread buntu
I'm looking to write a select statement to get a distinct count on userId grouped by keyword column on a parquet file SchemaRDD equivalent of: SELECT keyword, count(distinct(userId)) from table group by keyword How to write it using the chained select().groupBy() operations? Thanks! --

Re: spark.shuffle.consolidateFiles seems not working

2014-07-31 Thread Jianshi Huang
I got the number from the Hadoop admin. It's 1M actually. I suspect the consolidation didn't work as expected? Any other reason? On Thu, Jul 31, 2014 at 11:01 AM, Shao, Saisai saisai.s...@intel.com wrote: I don’t think it’s a bug of consolidated shuffle, it’s a Linux configuration problem.

RDD operation examples with data?

2014-07-31 Thread Chris Curtin
Hi, I'm learning Spark and I am confused about when to use the many different operations on RDDs. Does anyone have any examples which show example inputs and resulting outputs for the various RDD operations and if the operation takes an Function a simple example of the code? For example,

Re: How do you debug a PythonException?

2014-07-31 Thread Nicholas Chammas
So if I try this again but in the Scala shell (as opposed to the Python one), this is what I get: scala val a = sc.textFile(s3n://some-path/*.json, minPartitions=sc.defaultParallelism * 3).cache() a: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at console:12 scala

Re: RDD operation examples with data?

2014-07-31 Thread Jacob Eisinger
I would check out the source examples on Spark's Github: https://github.com/apache/spark/tree/master/examples/src/main/scala/org/apache/spark/examples And, Zhen He put together a great web page with summaries and examples of each function:

Re: SchemaRDD select expression

2014-07-31 Thread Zongheng Yang
countDistinct is recently added and is in 1.0.2. If you are using that or the master branch, you could try something like: r.select('keyword, countDistinct('userId)).groupBy('keyword) On Thu, Jul 31, 2014 at 12:27 PM, buntu buntu...@gmail.com wrote: I'm looking to write a select statement

Re: store spark streaming dstream in hdfs or cassandra

2014-07-31 Thread Gerard Maas
To read/write from/to Cassandra I recommend you to use the Spark-Cassandra connector at [1] Using it, saving a Spark Streaming RDD to Cassandra is fairly easy: sparkConfig.set(CassandraConnectionHost, cassandraHost) val sc = new SparkContext(sparkConfig) val ssc = new StreamingContext(sc,

Re: How do you debug a PythonException?

2014-07-31 Thread Nicholas Chammas
Davies, That was it. Removing the call to cache() let the job run successfully, but this challenges my understanding of how Spark handles caching data. I thought it was safe to cache data sets larger than the cluster could hold in memory. What Spark would do is cache as much as it could and

Re: SchemaRDD select expression

2014-07-31 Thread Zongheng Yang
Looking at what this patch [1] has to do to achieve it, I am not sure if you can do the same thing in 1.0.0 using DSL only. Just curious, why don't you use the hql() / sql() methods and pass a query string in? [1] https://github.com/apache/spark/pull/1211/files On Thu, Jul 31, 2014 at 2:20 PM,

Re: SchemaRDD select expression

2014-07-31 Thread Buntu Dev
I was not sure if registerAsTable() and then query against that table have additional performance impact and if DSL eliminates that. On Thu, Jul 31, 2014 at 2:33 PM, Zongheng Yang zonghen...@gmail.com wrote: Looking at what this patch [1] has to do to achieve it, I am not sure if you can do

Re: SchemaRDD select expression

2014-07-31 Thread Michael Armbrust
The performance should be the same using the DSL or SQL strings. On Thu, Jul 31, 2014 at 2:36 PM, Buntu Dev buntu...@gmail.com wrote: I was not sure if registerAsTable() and then query against that table have additional performance impact and if DSL eliminates that. On Thu, Jul 31, 2014 at

Re: SchemaRDD select expression

2014-07-31 Thread Buntu Dev
Thanks Michael for confirming! On Thu, Jul 31, 2014 at 2:43 PM, Michael Armbrust mich...@databricks.com wrote: The performance should be the same using the DSL or SQL strings. On Thu, Jul 31, 2014 at 2:36 PM, Buntu Dev buntu...@gmail.com wrote: I was not sure if registerAsTable() and then

Re: Installing Spark 0.9.1 on EMR Cluster

2014-07-31 Thread chaitu reddy
Hi Rahul, I am not sure about bootstrapping while creating but we have downloaded the tar ball , extracted and configured accordingly and it worked fine. I believe you would want to write a custom script which does all these things and add it like a bootstrap action. Thanks, Sai On Jul 31, 2014

Issue with Spark on EC2 using spark-ec2 script

2014-07-31 Thread Ryan Tabora
Hey all, I was able to spawn up a cluster, but when I'm trying to submit a simple jar via spark-submit it fails to run. I am trying to run the simple Standalone Application from the quickstart. Oddly enough, I could get another application running through the spark-shell. What am I doing wrong

makeLinkRDDs in MLlib ALS

2014-07-31 Thread alwayforver
It seems to me that the way makeLinkRDDs works is by taking advantage of the fact that partition IDs happen to coincide with what we get from userPartitioner, since the HashPartitioner in *val grouped = ratingsByUserBlock.partitionBy(new HashPartitioner(numUserBlocks))* is actually preserving

Spark job finishes then command shell is blocked/hangs?

2014-07-31 Thread bumble123
Hi, My spark job finishes with this output: 14/07/31 16:33:25 INFO SparkContext: Job finished: count at RetrieveData.scala:18, took 0.013189 s However, the command line doesn't go back to normal and instead just hangs. This is my first time running a spark job - is this normal? If not, how do I

Re: Installing Spark 0.9.1 on EMR Cluster

2014-07-31 Thread nit
Have you tried flag --spark-version of spark-ec2 ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Installing-Spark-0-9-1-on-EMR-Cluster-tp11084p11096.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Example standalone app error!

2014-07-31 Thread Tathagata Das
When are you guys getting the error? When Sparkcontext is created? Or when it is being shutdown? If this error is being thrown when the SparkContext is created, then one possible reason maybe conflicting versions of Akka. Spark depends on a version of Akka which is different from that of Scala,

Re: spark.scheduler.pool seems not working in spark streaming

2014-07-31 Thread Tathagata Das
Whoa! That worked! I was half afraid it wont, since I hadnt tried it myself. TD On Wed, Jul 30, 2014 at 8:32 PM, liuwei stupi...@126.com wrote: Hi, Tathagata Das: I followed your advice and solved this problem, thank you very much! 在 2014年7月31日,上午3:07,Tathagata Das

Re: spark.scheduler.pool seems not working in spark streaming

2014-07-31 Thread Tathagata Das
I filed a JIRA for this task for future reference. https://issues.apache.org/jira/browse/SPARK-2780 On Thu, Jul 31, 2014 at 5:37 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Whoa! That worked! I was half afraid it wont, since I hadnt tried it myself. TD On Wed, Jul 30, 2014 at 8:32

Re: sbt package failed: wrong libraryDependencies for spark-streaming?

2014-07-31 Thread durin
Hi Tathagata, I didn't mean to say this was an error. According to the other thread I linked, right now there shouldn't be any conflicts, so I wanted to use streaming in the shell for easy testing. I thought I had to create my own project in which I'd add streaming as a dependency, but if I can

spark-submit registers the driver twice

2014-07-31 Thread salemi
Hi All, I am using the spark-submit command to submit my jar to a standalone cluster with two executor. When I use the spark-submit it deploys the application twice and I see two application entries in the master UI. The master logs as shown below also indicate that submit try to deploy the

Re: sbt package failed: wrong libraryDependencies for spark-streaming?

2014-07-31 Thread Tathagata Das
Hey Simon, The stuff you are trying to show - logs, contents of spark-env.sh, etc. are missing from the email. At least I am not able to see it (viewing through gmail). Are you pasting screenshots? Those might get blocked out somehow! TD On Thu, Jul 31, 2014 at 6:55 PM, durin

Re: Issue with Spark on EC2 using spark-ec2 script

2014-07-31 Thread ratabora
Hey Dean! Thanks! Did you try running this on a local environment or one generated by the spark-ec2 script? The environment I am running on is a 4 data node 1 master spark cluster generated by the spark-ec2 script. I haven't modified anything in the environment except for adding data to the

Re: SQLCtx cacheTable

2014-07-31 Thread Michael Armbrust
cacheTable uses a special columnar caching technique that is optimized for SchemaRDDs. It something similar to MEMORY_ONLY_SER but not quite. You can specify the persistence level on the SchemaRDD itself and register that as a temporary table, however it is likely you will not get as good

Re: HiveContext is creating metastore warehouse locally instead of in hdfs

2014-07-31 Thread Andrew Lee
Could you enable HistoryServer and provide the properties and CLASSPATH for the spark-shell? And 'env' command to list your environment variables? By the way, what does the spark logs says? Enable debug mode to see what's going on in spark-shell when it tries to interact and init HiveContext.

Re: Re: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]

2014-07-31 Thread Haiyang Fu
Glad to help you On Fri, Aug 1, 2014 at 11:28 AM, Bin wubin_phi...@126.com wrote: Hi Haiyang, Thanks, it really is the reason. Best, Bin 在 2014-07-31 08:05:34,Haiyang Fu haiyangfu...@gmail.com 写道: Have you tried to increase the dirver memory? On Thu, Jul 31, 2014 at 3:54 PM, Bin

Re: java.lang.OutOfMemoryError: Java heap space

2014-07-31 Thread Haiyang Fu
Hi, here are two tips for you, 1. increase the parallism level 2.increase the driver memory On Fri, Aug 1, 2014 at 12:58 AM, Sameer Tilak ssti...@live.com wrote: Hi everyone, I have the following configuration. I am currently running my app in local mode. val conf = new