Re: Why Spark require this object to be serializerable?

2014-04-29 Thread Earthson
The code is here:https://github.com/Earthson/sparklda/blob/master/src/main/scala/net/earthson/nlp/lda/lda.scala I've change it to from Broadcast to Serializable. Now it works:) But There are too many rdd cache, It is the problem? -- View this message in context:

Issue during Spark streaming with ZeroMQ source

2014-04-29 Thread Francis . Hu
Hi, all I installed spark-0.9.1 and zeromq 4.0.1 , and then run below example: ./bin/run-example org.apache.spark.streaming.examples.SimpleZeroMQPublisher tcp://127.0.1.1:1234 foo.bar` ./bin/run-example org.apache.spark.streaming.examples.ZeroMQWordCount local[2] tcp://127.0.1.1:1234 foo`

Re: Issue during Spark streaming with ZeroMQ source

2014-04-29 Thread Prashant Sharma
Unfortunately zeromq 4.0.1 is not supported. https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/streaming/examples/ZeroMQWordCount.scala#L63Says about the version. You will need that version of zeromq to see it work. Basically I have seen it working nicely with

Re: 答复: Issue during Spark streaming with ZeroMQ source

2014-04-29 Thread Prashant Sharma
Well that is not going to be easy, simply because we depend on akka-zeromq for zeromq support. And since akka does not support the latest zeromq library yet, I doubt if there is something simple that can be done to support it. Prashant Sharma On Tue, Apr 29, 2014 at 2:44 PM, Francis.Hu

Re: launching concurrent jobs programmatically

2014-04-29 Thread ishaaq
Very interesting. One of spark's attractive features is being able to do stuff interactively via spark-shell. Is something like that still available via Ooyala's job server? Or do you use the spark-shell independently of that? If the latter then how do you manage custom jars for spark-shell? Our

Fwd: Spark RDD cache memory usage

2014-04-29 Thread Han JU
Hi, By default a fraction of the executor memory (60%) is reserved for RDD caching, so if there's no explicit caching in the code (eg. rdd.cache() etc.), or if we persist RDD with StorageLevel.DISK_ONLY, is this part of memory wated? Does Spark allocates the RDD cache memory dynamically? Or does

Re: Joining not-pair RDDs in Spark

2014-04-29 Thread Daniel Darabos
Create a key and join on that. val callPricesByHour = callPrices.map(p = ((p.year, p.month, p.day, p.hour), p)) val callsByHour = calls.map(c = ((c.year, c.month, c.day, c.hour), c)) val bills = callPricesByHour.join(callsByHour).mapValues({ case (p, c) = BillRow(c.customer, c.hour, c.minutes *

Re: User/Product Clustering with pySpark ALS

2014-04-29 Thread Nick Pentreath
There's no easy way to d this currently. The pieces are there from the PySpark code for regression which should be adaptable. But you'd have to roll your own solution. This is something I also want so I intend to put together a pull request for this soon — Sent from Mailbox On Tue, Apr

Re: Storage information about an RDD from the API

2014-04-29 Thread Koert Kuipers
SparkContext.getRDDStorageInfo On Tue, Apr 29, 2014 at 12:34 PM, Andras Nemeth andras.nem...@lynxanalytics.com wrote: Hi, Is it possible to know from code about an RDD if it is cached, and more precisely, how many of its partitions are cached in memory and how many are cached on disk? I

Re: How to declare Tuple return type for a function

2014-04-29 Thread Roger Hoover
The return type should be RDD[(Int, Int, Int)] because sc.textFile() returns an RDD. Try adding an import for the RDD type to get rid of the compile error. import org.apache.spark.rdd.RDD On Mon, Apr 28, 2014 at 6:22 PM, SK skrishna...@gmail.com wrote: Hi, I am a new user of Spark. I have

Python Spark on YARN

2014-04-29 Thread Guanhua Yan
Hi all: Is it possible to develop Spark programs in Python and run them on YARN? From the Python SparkContext class, it doesn't seem to have such an option. Thank you, - Guanhua === Guanhua Yan, Ph.D. Information Sciences Group (CCS-3) Los Alamos National Laboratory

How to declare Tuple return type for a function

2014-04-29 Thread SK
Hi, I am a new user of Spark. I have a class that defines a function as follows. It returns a tuple : (Int, Int, Int). class Sim extends VectorSim { override def input(master:String): (Int,Int,Int) = { sc = new SparkContext(master, Test) val ratings =

What is Seq[V] in updateStateByKey?

2014-04-29 Thread Adrian Mocanu
What is Seq[V] in updateStateByKey? Does this store the collected tuples of the RDD in a collection? Method signature: def updateStateByKey[S: ClassTag] ( updateFunc: (Seq[V], Option[S]) = Option[S] ): DStream[(K, S)] In my case I used Seq[Double] assuming a sequence of Doubles in the RDD; the

packaging time

2014-04-29 Thread SK
Each time I run sbt/sbt assembly to compile my program, the packaging time takes about 370 sec (about 6 min). How can I reduce this time? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/packaging-time-tp5048.html Sent from the Apache Spark User

Delayed Scheduling - Setting spark.locality.wait.node parameter in interactive shell

2014-04-29 Thread Sai Prasanna
Hi All, I have replication factor 3 in my HDFS. With 3 datanodes, i ran my experiments. Now i just added another node to it with no data in it. When i ran, SPARK launches non-local tasks in it and the time taken is more than what it took for 3 node cluster. Here delayed scheduling fails i think

Re: Spark: issues with running a sbt fat jar due to akka dependencies

2014-04-29 Thread Koert Kuipers
you need to merge reference.conf files and its no longer an issue. see the Build for for spark itself: case reference.conf = MergeStrategy.concat On Tue, Apr 29, 2014 at 3:32 PM, Shivani Rao raoshiv...@gmail.com wrote: Hello folks, I was going to post this question to spark user group as

Re: packaging time

2014-04-29 Thread Mark Hamstra
Tip: read the wiki -- https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools On Tue, Apr 29, 2014 at 12:48 PM, Daniel Darabos daniel.dara...@lynxanalytics.com wrote: Tips from my experience. Disable scaladoc: sources in doc in Compile := List() Do not package the source:

Re: What is Seq[V] in updateStateByKey?

2014-04-29 Thread Sean Owen
The original DStream is of (K,V). This function creates a DStream of (K,S). Each time slice brings one or more new V for each K. The old state S (can be different from V!) for each K -- possibly non-existent -- is updated in some way by a bunch of new V, to produce a new state S -- which also

Re: What is Seq[V] in updateStateByKey?

2014-04-29 Thread Tathagata Das
You may have already seen it, but I will mention it anyways. This example may help. https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/streaming/examples/StatefulNetworkWordCount.scala Here the state is essentially a running count of the words seen. So the value

Re: Python Spark on YARN

2014-04-29 Thread Matei Zaharia
This will be possible in 1.0 after this pull request: https://github.com/apache/spark/pull/30 Matei On Apr 29, 2014, at 9:51 AM, Guanhua Yan gh...@lanl.gov wrote: Hi all: Is it possible to develop Spark programs in Python and run them on YARN? From the Python SparkContext class, it

Spark's behavior

2014-04-29 Thread Eduardo Costa Alfaia
Hi TD, In my tests with spark streaming, I'm using JavaNetworkWordCount(modified) code and a program that I wrote that sends words to the Spark worker, I use TCP as transport. I verified that after starting Spark, it connects to my source which actually starts sending, but the first word count

rdd ordering gets scrambled

2014-04-29 Thread Mohit Jaggi
Hi, I started with a text file(CSV) of sorted data (by first column), parsed it into Scala objects using map operation in Scala. Then I used more maps to add some extra info to the data and saved it as text file. The final text file is not sorted. What do I need to do to keep the order from the

Re: Spark's behavior

2014-04-29 Thread Eduardo Costa Alfaia
Hi TD, We are not using stream context with master local, we have 1 Master and 8 Workers and 1 word source. The command line that we are using is: bin/run-example org.apache.spark.streaming.examples.JavaNetworkWordCount spark://192.168.0.13:7077 On Apr 30, 2014, at 0:09, Tathagata Das

Re: Python Spark on YARN

2014-04-29 Thread Guanhua Yan
Thanks, Matei. Will take a look at it. Best regards, Guanhua From: Matei Zaharia matei.zaha...@gmail.com Reply-To: user@spark.apache.org Date: Tue, 29 Apr 2014 14:19:30 -0700 To: user@spark.apache.org Subject: Re: Python Spark on YARN This will be possible in 1.0 after this pull request:

Re: Spark's behavior

2014-04-29 Thread Tathagata Das
Strange! Can you just do lines.print() to print the raw data instead of doing word count. Beyond that we can do two things. 1. Can see the Spark stage UI to see whether there are stages running during the 30 second period you referred to? 2. If you upgrade to using Spark master branch (or Spark

RE: Shuffle Spill Issue

2014-04-29 Thread Liu, Raymond
Hi Daniel Thanks for your reply, While I think for reduceByKey, it will also do map side combine, thus extra the result is the same, say, for each partition, one entry per distinct word. In my case with javaserializer, 240MB dataset yield to around 70MB shuffle data. Only that shuffle

Re: NoSuchMethodError from Spark Java

2014-04-29 Thread wxhsdp
i met with the same question when update to spark 0.9.1 (svn checkout https://github.com/apache/spark/) Exception in thread main java.lang.NoSuchMethodError: org.apache.spark.SparkContext$.jarOfClass(Ljava/lang/Class;)Lscala/collection/Seq; at

How fast would you expect shuffle serialize to be?

2014-04-29 Thread Liu, Raymond
Hi I am running a WordCount program which count words from HDFS, and I noticed that the serializer part of code takes a lot of CPU time. On a 16core/32thread node, the total throughput is around 50MB/s by JavaSerializer, and if I switching to KryoSerializer, it doubles to around

Re: How fast would you expect shuffle serialize to be?

2014-04-29 Thread Patrick Wendell
Is this the serialization throughput per task or the serialization throughput for all the tasks? On Tue, Apr 29, 2014 at 9:34 PM, Liu, Raymond raymond@intel.com wrote: Hi I am running a WordCount program which count words from HDFS, and I noticed that the serializer part of code

RE: How fast would you expect shuffle serialize to be?

2014-04-29 Thread Liu, Raymond
For all the tasks, say 32 task on total Best Regards, Raymond Liu -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Is this the serialization throughput per task or the serialization throughput for all the tasks? On Tue, Apr 29, 2014 at 9:34 PM, Liu, Raymond

Re: Union of 2 RDD's only returns the first one

2014-04-29 Thread Mingyu Kim
Hi Patrick, I¹m a little confused about your comment that RDDs are not ordered. As far as I know, RDDs keep list of partitions that are ordered and this is why I can call RDD.take() and get the same first k rows every time I call it and RDD.take() returns the same entries as RDD.map(Š).take()

RE: How fast would you expect shuffle serialize to be?

2014-04-29 Thread Liu, Raymond
By the way, to be clear, I run repartition firstly to make all data go through shuffle instead of run ReduceByKey etc directly ( which reduce the data need to be shuffle and serialized), thus say all 50MB/s data from HDFS will go to serializer. ( in fact, I also tried generate data in memory

Re: JavaSparkConf

2014-04-29 Thread Patrick Wendell
This class was made to be java friendly so that we wouldn't have to use two versions. The class itself is simple. But I agree adding java setters would be nice. On Tue, Apr 29, 2014 at 8:32 PM, Soren Macbeth so...@yieldbot.com wrote: There is a JavaSparkContext, but no JavaSparkConf object. I

RE: How fast would you expect shuffle serialize to be?

2014-04-29 Thread Liu, Raymond
Later case, total throughput aggregated from all cores. Best Regards, Raymond Liu -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Wednesday, April 30, 2014 1:22 PM To: user@spark.apache.org Subject: Re: How fast would you expect shuffle serialize to be? Hm -

Re: Union of 2 RDD's only returns the first one

2014-04-29 Thread Patrick Wendell
You are right, once you sort() the RDD, then yes it has a well defined ordering. But that ordering is lost as soon as you transform the RDD, including if you union it with another RDD. On Tue, Apr 29, 2014 at 10:22 PM, Mingyu Kim m...@palantir.com wrote: Hi Patrick, I¹m a little confused