java.lang.NoClassDefFoundError: org/apache/spark/util/Vector

2014-03-27 Thread Kal El
I am getting this error when I try to run K-Means in spark-0.9.0: java.lang.NoClassDefFoundError: org/apache/spark/util/Vector         at java.lang.Class.getDeclaredMethods0(Native Method)         at java.lang.Class.privateGetDeclaredMethods(Class.java:2531)         at

Re: Announcing Spark SQL

2014-03-27 Thread Pascal Voitot Dev
Le 27 mars 2014 09:47, andy petrella andy.petre...@gmail.com a écrit : I hijack the thread, but my2c is that this feature is also important to enable ad-hoc queries which is done at runtime. It doesn't remove interests for such macro for precompiled jobs of course, but it may not be the first

Re: Change print() in JavaNetworkWordCount

2014-03-27 Thread Eduardo Costa Alfaia
Thank you very much Sourav BR Em 3/26/14, 17:29, Sourav Chandra escreveu: def print() { def foreachFunc = (rdd: RDD[T], time: Time) = { val total = rdd.collect().toList println (---) println (Time: + time) println

Re: Announcing Spark SQL

2014-03-27 Thread Pascal Voitot Dev
On Thu, Mar 27, 2014 at 10:22 AM, andy petrella andy.petre...@gmail.comwrote: I just mean queries sent at runtime ^^, like for any RDBMS. In our project we have such requirement to have a layer to play with the data (custom and low level service layer of a lambda arch), and something like

Re: Announcing Spark SQL

2014-03-27 Thread andy petrella
nope (what I said :-P) On Thu, Mar 27, 2014 at 11:05 AM, Pascal Voitot Dev pascal.voitot@gmail.com wrote: On Thu, Mar 27, 2014 at 10:22 AM, andy petrella andy.petre...@gmail.comwrote: I just mean queries sent at runtime ^^, like for any RDBMS. In our project we have such

Re: Announcing Spark SQL

2014-03-27 Thread Pascal Voitot Dev
On Thu, Mar 27, 2014 at 11:08 AM, andy petrella andy.petre...@gmail.comwrote: nope (what I said :-P) That's also my answer to my own question :D but I didn't understand that in your sentence: my2c is that this feature is also important to enable ad-hoc queries which is done at runtime.

Re: Announcing Spark SQL

2014-03-27 Thread yana
Does Shark not suit your needs? That's what we use at the moment and it's been good Sent from my Samsung Galaxy S®4 Original message From: andy petrella andy.petre...@gmail.com Date:03/27/2014 6:08 AM (GMT-05:00) To: user@spark.apache.org Subject: Re: Announcing Spark

Re: Announcing Spark SQL

2014-03-27 Thread andy petrella
Yes it could, of course. I didn't say that there is no tool to do it, though ;-). Andy On Thu, Mar 27, 2014 at 12:49 PM, yana yana.kadiy...@gmail.com wrote: Does Shark not suit your needs? That's what we use at the moment and it's been good Sent from my Samsung Galaxy S®4

Re: Announcing Spark SQL

2014-03-27 Thread Pascal Voitot Dev
when there is something new, it's also cool to let imagination fly far away ;) On Thu, Mar 27, 2014 at 2:20 PM, andy petrella andy.petre...@gmail.comwrote: Yes it could, of course. I didn't say that there is no tool to do it, though ;-). Andy On Thu, Mar 27, 2014 at 12:49 PM, yana

WikipediaPageRank Data Set

2014-03-27 Thread Niko Stahl
Hello, I would like to run the WikipediaPageRankhttps://github.com/amplab/graphx/blob/f8544981a6d05687fa950639cb1eb3c31e9b6bf5/examples/src/main/scala/org/apache/spark/examples/bagel/WikipediaPageRank.scalaexample, but the Wikipedia dump XML files are no longer available on Freebase. Does anyone

spark streaming: what is awaitTermination()?

2014-03-27 Thread Diana Carroll
The API docs for ssc.awaitTermination say simply Wait for the execution to stop. Any exceptions that occurs during the execution will be thrown in this thread. Can someone help me understand what this means? What causes execution to stop? Why do we need to wait for that to happen? I tried

spark streaming and the spark shell

2014-03-27 Thread Diana Carroll
I'm working with spark streaming using spark-shell, and hoping folks could answer a few questions I have. I'm doing WordCount on a socket stream: import org.apache.spark.streaming.StreamingContext import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming.Seconds var

StreamingContext.transform on a DStream

2014-03-27 Thread Adrian Mocanu
Found this transform fn in StreamingContext which takes in a DStream[_] and a function which acts on each of its RDDs Unfortunately I can't figure out how to transform my DStream[(String,Int)] into DStream[_] /*** Create a new DStream in which each RDD is generated by applying a function on

RE: StreamingContext.transform on a DStream

2014-03-27 Thread Adrian Mocanu
Please disregard I didn't see the Seq wrapper. From: Adrian Mocanu [mailto:amoc...@verticalscope.com] Sent: March-27-14 11:57 AM To: u...@spark.incubator.apache.org Subject: StreamingContext.transform on a DStream Found this transform fn in StreamingContext which takes in a DStream[_] and a

Re: Running a task once on each executor

2014-03-27 Thread dmpour23
How exactly does rdd.mapPartitions be executed once in each VM? I am running mapPartitions and the call function seems not to execute the code? JavaPairRDDString, String twos = input.map(new Split()).sortByKey().partitionBy(new HashPartitioner(k)); twos.values().saveAsTextFile(args[2]);

Re: GC overhead limit exceeded

2014-03-27 Thread Ognen Duzlevski
Look at the tuning guide on Spark's webpage for strategies to cope with this. I have run into quite a few memory issues like these, some are resolved by changing the StorageLevel strategy and employing things like Kryo, some are solved by specifying the number of tasks to break down a given

Re: Spark Streaming + Kafka + Mesos/Marathon strangeness

2014-03-27 Thread Scott Clasen
I think now that this is because spark.local.dir is defaulting to /tmp, and since the tasks are not running on the same machine, the file is not found when the second task takes over. How do you set spark.local.dir appropriately when running on mesos? -- View this message in context:

Re: GC overhead limit exceeded

2014-03-27 Thread Andrew Or
Are you caching a lot of RDD's? If so, maybe you should unpersist() the ones that you're not using. Also, if you're on 0.9, make sure spark.shuffle.spill is enabled (which it is by default). This allows your application to spill in-memory content to disk if necessary. How much memory are you

KafkaInputDStream mapping of partitions to tasks

2014-03-27 Thread Scott Clasen
I have a simple streaming job that creates a kafka input stream on a topic with 8 partitions, and does a forEachRDD The job and tasks are running on mesos, and there are two tasks running, but only 1 task doing anything. I also set spark.streaming.concurrentJobs=8 but still there is only 1 task

Re: YARN problem using an external jar in worker nodes Inbox x

2014-03-27 Thread Sung Hwan Chung
Yea it's in a standalone mode and I did use SparkContext.addJar method and tried setting setExecutorEnv SPARK_CLASSPATH, etc. but none of it worked. I finally made it work by modifying the ClientBase.scala code where I set 'appMasterOnly' to false before the addJars contents were added to

Re: GC overhead limit exceeded

2014-03-27 Thread Sai Prasanna
No i am running on 0.8.1. Yes i am caching a lot, i am benchmarking a simple code in spark where in 512mb, 1g and 2g text files are taken, some basic intermediate operations are done while the intermediate result which will be used in subsequent operations are cached. I thought that, we need not

Re: Announcing Spark SQL

2014-03-27 Thread Patrick Wendell
Hey Rohit, I think external tables based on Cassandra or other datastores will work out-of-the box if you build Catalyst with Hive support. Michael may have feelings about this but I'd guess the longer term design for having schema support for Cassandra/HBase etc likely wouldn't rely on hive

Re:

2014-03-27 Thread Mayur Rustagi
You have to raise the global limit as root. Also you have to do that on the whole cluster. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, Mar 27, 2014 at 4:07 AM, Hahn Jiang

Re: GC overhead limit exceeded

2014-03-27 Thread Syed A. Hashmi
Which storage scheme are you using? I am guessing it is MEMORY_ONLY. In large datasets, MEMORY_AND_DISK or MEMORY_AND_DISK_SER work better. You can call unpersist on an RDD to remove it from Cache though. On Thu, Mar 27, 2014 at 11:57 AM, Sai Prasanna ansaiprasa...@gmail.comwrote: No i am

how to create a DStream from bunch of RDDs

2014-03-27 Thread Adrian Mocanu
I create several RDDs by merging several consecutive RDDs from a DStream. Is there a way to add these new RDDs to a DStream? -Adrian

Re: Spark Streaming + Kafka + Mesos/Marathon strangeness

2014-03-27 Thread Scott Clasen
Heh sorry that wasnt a clear question, I know 'how' to set it but dont know what value to use in a mesos cluster, since the processes are running in lxc containers they wont be sharing a filesystem (or machine for that matter) I cant use an s3n:// url for local dir can I? -- View this

Re: KafkaInputDStream mapping of partitions to tasks

2014-03-27 Thread Scott Clasen
Actually looking closer it is stranger than I thought, in the spark UI, one executor has executed 4 tasks, and one has executed 1928 Can anyone explain the workings of a KafkaInputStream wrt kafka partitions and mapping to spark executors and tasks? -- View this message in context:

Re: spark streaming and the spark shell

2014-03-27 Thread Evgeny Shishkin
2. I notice that once I start ssc.start(), my stream starts processing and continues indefinitely...even if I close the socket on the server end (I'm using unix command nc to mimic a server as explained in the streaming programming guide .) Can I tell my stream to detect if it's lost a

Re: YARN problem using an external jar in worker nodes Inbox x

2014-03-27 Thread Sung Hwan Chung
Well, it says that the jar was successfully added but can't reference classes from it. Does this have anything to do with this bug? http://stackoverflow.com/questions/22457645/when-to-use-spark-classpath-or-sparkcontext-addjar On Thu, Mar 27, 2014 at 2:57 PM, Sandy Ryza sandy.r...@cloudera.com

Re: KafkaInputDStream mapping of partitions to tasks

2014-03-27 Thread Evgeny Shishkin
On 28 Mar 2014, at 00:34, Scott Clasen scott.cla...@gmail.com wrote: Actually looking closer it is stranger than I thought, in the spark UI, one executor has executed 4 tasks, and one has executed 1928 Can anyone explain the workings of a KafkaInputStream wrt kafka partitions and mapping to

Re: KafkaInputDStream mapping of partitions to tasks

2014-03-27 Thread Scott Clasen
Evgeniy Shishkin wrote So, at the bottom — kafka input stream just does not work. That was the conclusion I was coming to as well. Are there open tickets around fixing this up? -- View this message in context:

Re: YARN problem using an external jar in worker nodes Inbox x

2014-03-27 Thread Sandy Ryza
That bug only appears to apply to spark-shell. Do things work in yarn-client mode or on a standalone cluster? Are you passing a path with parent directories to addJar? On Thu, Mar 27, 2014 at 3:01 PM, Sung Hwan Chung coded...@cs.stanford.eduwrote: Well, it says that the jar was successfully

Re: spark streaming and the spark shell

2014-03-27 Thread Tathagata Das
Seems like the configuration of the Spark worker is not right. Either the worker has not been given enough memory or the allocation of the memory to the RDD storage needs to be fixed. If configured correctly, the Spark workers should not get OOMs. On Thu, Mar 27, 2014 at 2:52 PM, Evgeny

Re: Spark Streaming + Kafka + Mesos/Marathon strangeness

2014-03-27 Thread Evgeny Shishkin
On 28 Mar 2014, at 01:44, Tathagata Das tathagata.das1...@gmail.com wrote: The more I think about it the problem is not about /tmp, its more about the workers not having enough memory. Blocks of received data could be falling out of memory before it is getting processed. BTW, what is the

Re: KafkaInputDStream mapping of partitions to tasks

2014-03-27 Thread Scott Clasen
Thanks everyone for the discussion. Just to note, I restarted the job yet again, and this time there are indeed tasks being executed by both worker nodes. So the behavior does seem inconsistent/broken atm. Then I added a third node to the cluster, and a third executor came up, and everything

Re: Running a task once on each executor

2014-03-27 Thread deenar.toraskar
Christopher Sorry I might be missing the obvious, but how do i get my function called on all Executors used by the app? I dont want to use RDDs unless necessary. once I start my shell or app, how do I get TaskNonce.getSingleton().doThisOnce() executed on each executor? @dmpour

Re: KafkaInputDStream mapping of partitions to tasks

2014-03-27 Thread Evgeny Shishkin
On 28 Mar 2014, at 02:10, Scott Clasen scott.cla...@gmail.com wrote: Thanks everyone for the discussion. Just to note, I restarted the job yet again, and this time there are indeed tasks being executed by both worker nodes. So the behavior does seem inconsistent/broken atm. Then I added

Re: GC overhead limit exceeded

2014-03-27 Thread Sai Prasanna
I dint mention anything, so by default it should be MEMORY_AND_DISK right? My doubt was, between two different experiments, are the RDDs cached in memory need to be unpersisted??? Or it doesnt matter ? On Fri, Mar 28, 2014 at 1:43 AM, Syed A. Hashmi shas...@cloudera.comwrote: Which storage

Re: pySpark memory usage

2014-03-27 Thread Matei Zaharia
I see, did this also fail with previous versions of Spark (0.9 or 0.8)? We’ll try to look into these, seems like a serious error. Matei On Mar 27, 2014, at 7:27 PM, Jim Blomo jim.bl...@gmail.com wrote: Thanks, Matei. I am running Spark 1.0.0-SNAPSHOT built for Hadoop 1.0.4 from GitHub on

Re: Configuring shuffle write directory

2014-03-27 Thread Tsai Li Ming
Anyone can help? How can I configure a different spark.local.dir for each executor? On 23 Mar, 2014, at 12:11 am, Tsai Li Ming mailingl...@ltsai.com wrote: Hi, Each of my worker node has its own unique spark.local.dir. However, when I run spark-shell, the shuffle writes are always

Setting SPARK_MEM higher than available memory in driver

2014-03-27 Thread Tsai Li Ming
Hi, My worker nodes have more memory than the host that I’m submitting my driver program, but it seems that SPARK_MEM is also setting the Xmx of the spark shell? $ SPARK_MEM=100g MASTER=spark://XXX:7077 bin/spark-shell Java HotSpot(TM) 64-Bit Server VM warning: INFO: