Re: Use Spark Streaming to update result whenever data come

2014-07-10 Thread Bill Jay
Tobias, Your help on the problems I have met have been very helpful. Thanks a lot! Bill On Wed, Jul 9, 2014 at 6:04 PM, Tobias Pfeiffer t...@preferred.jp wrote: Bill, good to know you found your bottleneck. Unfortunately, I don't know how to solve this; until know, I have used Spark only

Re: Use Spark Streaming to update result whenever data come

2014-07-09 Thread Bill Jay
Hi Tobias, I was using Spark 0.9 before and the master I used was yarn-standalone. In Spark 1.0, the master will be either yarn-cluster or yarn-client. I am not sure whether it is the reason why more machines do not provide better scalability. What is the difference between these two modes in

Re: Use Spark Streaming to update result whenever data come

2014-07-09 Thread Tobias Pfeiffer
Bill, I haven't worked with Yarn, but I would try adding a repartition() call after you receive your data from Kafka. I would be surprised if that didn't help. On Thu, Jul 10, 2014 at 6:23 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi Tobias, I was using Spark 0.9 before and the master

Re: Use Spark Streaming to update result whenever data come

2014-07-09 Thread Bill Jay
Hi Tobias, Now I did the re-partition and ran the program again. I find a bottleneck of the whole program. In the streaming, there is a stage marked as *combineByKey at ShuffledDStream.scala:42 *in spark UI. This stage is repeatedly executed. However, during some batches, the number of executors

Re: Use Spark Streaming to update result whenever data come

2014-07-09 Thread Tobias Pfeiffer
Bill, good to know you found your bottleneck. Unfortunately, I don't know how to solve this; until know, I have used Spark only with embarassingly parallel operations such as map or filter. I hope someone else might provide more insight here. Tobias On Thu, Jul 10, 2014 at 9:57 AM, Bill Jay

Re: Use Spark Streaming to update result whenever data come

2014-07-08 Thread Bill Jay
Hi Tobias, Thanks for the suggestion. I have tried to add more nodes from 300 to 400. It seems the running time did not get improved. On Wed, Jul 2, 2014 at 6:47 PM, Tobias Pfeiffer t...@preferred.jp wrote: Bill, can't you just add more nodes in order to speed up the processing? Tobias

Re: Use Spark Streaming to update result whenever data come

2014-07-08 Thread Tobias Pfeiffer
Bill, do the additional 100 nodes receive any tasks at all? (I don't know which cluster you use, but with Mesos you could check client logs in the web interface.) You might want to try something like repartition(N) or repartition(N*2) (with N the number of your nodes) after you receive your data.

Use Spark Streaming to update result whenever data come

2014-07-02 Thread Bill Jay
Hi all, I have a problem of using Spark Streaming to accept input data and update a result. The input of the data is from Kafka and the output is to report a map which is updated by historical data in every minute. My current method is to set batch size as 1 minute and use foreachRDD to update