Tobias, Your help on the problems I have met have been very helpful. Thanks a lot!
Bill On Wed, Jul 9, 2014 at 6:04 PM, Tobias Pfeiffer <t...@preferred.jp> wrote: > Bill, > > good to know you found your bottleneck. Unfortunately, I don't know how to > solve this; until know, I have used Spark only with embarassingly parallel > operations such as map or filter. I hope someone else might provide more > insight here. > > Tobias > > > On Thu, Jul 10, 2014 at 9:57 AM, Bill Jay <bill.jaypeter...@gmail.com> > wrote: > >> Hi Tobias, >> >> Now I did the re-partition and ran the program again. I find a bottleneck >> of the whole program. In the streaming, there is a stage marked as >> *"combineByKey >> at ShuffledDStream.scala:42" *in spark UI. This stage is repeatedly >> executed. However, during some batches, the number of executors allocated >> to this step is only 2 although I used 300 workers and specified the >> partition number as 300. In this case, the program is very slow although >> the data that are processed are not big. >> >> Do you know how to solve this issue? >> >> Thanks! >> >> >> On Wed, Jul 9, 2014 at 5:51 PM, Tobias Pfeiffer <t...@preferred.jp> wrote: >> >>> Bill, >>> >>> I haven't worked with Yarn, but I would try adding a repartition() call >>> after you receive your data from Kafka. I would be surprised if that didn't >>> help. >>> >>> >>> On Thu, Jul 10, 2014 at 6:23 AM, Bill Jay <bill.jaypeter...@gmail.com> >>> wrote: >>> >>>> Hi Tobias, >>>> >>>> I was using Spark 0.9 before and the master I used was yarn-standalone. >>>> In Spark 1.0, the master will be either yarn-cluster or yarn-client. I am >>>> not sure whether it is the reason why more machines do not provide better >>>> scalability. What is the difference between these two modes in terms of >>>> efficiency? Thanks! >>>> >>>> >>>> On Tue, Jul 8, 2014 at 5:26 PM, Tobias Pfeiffer <t...@preferred.jp> >>>> wrote: >>>> >>>>> Bill, >>>>> >>>>> do the additional 100 nodes receive any tasks at all? (I don't know >>>>> which cluster you use, but with Mesos you could check client logs in the >>>>> web interface.) You might want to try something like repartition(N) or >>>>> repartition(N*2) (with N the number of your nodes) after you receive your >>>>> data. >>>>> >>>>> Tobias >>>>> >>>>> >>>>> On Wed, Jul 9, 2014 at 3:09 AM, Bill Jay <bill.jaypeter...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Tobias, >>>>>> >>>>>> Thanks for the suggestion. I have tried to add more nodes from 300 to >>>>>> 400. It seems the running time did not get improved. >>>>>> >>>>>> >>>>>> On Wed, Jul 2, 2014 at 6:47 PM, Tobias Pfeiffer <t...@preferred.jp> >>>>>> wrote: >>>>>> >>>>>>> Bill, >>>>>>> >>>>>>> can't you just add more nodes in order to speed up the processing? >>>>>>> >>>>>>> Tobias >>>>>>> >>>>>>> >>>>>>> On Thu, Jul 3, 2014 at 7:09 AM, Bill Jay <bill.jaypeter...@gmail.com >>>>>>> > wrote: >>>>>>> >>>>>>>> Hi all, >>>>>>>> >>>>>>>> I have a problem of using Spark Streaming to accept input data and >>>>>>>> update a result. >>>>>>>> >>>>>>>> The input of the data is from Kafka and the output is to report a >>>>>>>> map which is updated by historical data in every minute. My current >>>>>>>> method >>>>>>>> is to set batch size as 1 minute and use foreachRDD to update this map >>>>>>>> and >>>>>>>> output the map at the end of the foreachRDD function. However, the >>>>>>>> current >>>>>>>> issue is the processing cannot be finished within one minute. >>>>>>>> >>>>>>>> I am thinking of updating the map whenever the new data come >>>>>>>> instead of doing the update when the whoe RDD comes. Is there any idea >>>>>>>> on >>>>>>>> how to achieve this in a better running time? Thanks! >>>>>>>> >>>>>>>> Bill >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >