Re: Use Spark Streaming to update result whenever data come

Bill Jay Thu, 10 Jul 2014 09:53:42 -0700

Tobias,

Your help on the problems I have met have been very helpful. Thanks a lot!


Bill


On Wed, Jul 9, 2014 at 6:04 PM, Tobias Pfeiffer <t...@preferred.jp> wrote:

> Bill,
>
> good to know you found your bottleneck. Unfortunately, I don't know how to
> solve this; until know, I have used Spark only with embarassingly parallel
> operations such as map or filter. I hope someone else might provide more
> insight here.
>
> Tobias
>
>
> On Thu, Jul 10, 2014 at 9:57 AM, Bill Jay <bill.jaypeter...@gmail.com>
> wrote:
>
>> Hi Tobias,
>>
>> Now I did the re-partition and ran the program again. I find a bottleneck
>> of the whole program. In the streaming, there is a stage marked as 
>> *"combineByKey
>> at ShuffledDStream.scala:42" *in spark UI. This stage is repeatedly
>> executed. However, during some batches, the number of executors allocated
>> to this step is only 2 although I used 300 workers and specified the
>> partition number as 300. In this case, the program is very slow although
>> the data that are processed are not big.
>>
>> Do you know how to solve this issue?
>>
>> Thanks!
>>
>>
>> On Wed, Jul 9, 2014 at 5:51 PM, Tobias Pfeiffer <t...@preferred.jp> wrote:
>>
>>> Bill,
>>>
>>> I haven't worked with Yarn, but I would try adding a repartition() call
>>> after you receive your data from Kafka. I would be surprised if that didn't
>>> help.
>>>
>>>
>>> On Thu, Jul 10, 2014 at 6:23 AM, Bill Jay <bill.jaypeter...@gmail.com>
>>> wrote:
>>>
>>>> Hi Tobias,
>>>>
>>>> I was using Spark 0.9 before and the master I used was yarn-standalone.
>>>> In Spark 1.0, the master will be either yarn-cluster or yarn-client. I am
>>>> not sure whether it is the reason why more machines do not provide better
>>>> scalability. What is the difference between these two modes in terms of
>>>> efficiency? Thanks!
>>>>
>>>>
>>>> On Tue, Jul 8, 2014 at 5:26 PM, Tobias Pfeiffer <t...@preferred.jp>
>>>> wrote:
>>>>
>>>>> Bill,
>>>>>
>>>>> do the additional 100 nodes receive any tasks at all? (I don't know
>>>>> which cluster you use, but with Mesos you could check client logs in the
>>>>> web interface.) You might want to try something like repartition(N) or
>>>>> repartition(N*2) (with N the number of your nodes) after you receive your
>>>>> data.
>>>>>
>>>>> Tobias
>>>>>
>>>>>
>>>>> On Wed, Jul 9, 2014 at 3:09 AM, Bill Jay <bill.jaypeter...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Tobias,
>>>>>>
>>>>>> Thanks for the suggestion. I have tried to add more nodes from 300 to
>>>>>> 400. It seems the running time did not get improved.
>>>>>>
>>>>>>
>>>>>> On Wed, Jul 2, 2014 at 6:47 PM, Tobias Pfeiffer <t...@preferred.jp>
>>>>>> wrote:
>>>>>>
>>>>>>> Bill,
>>>>>>>
>>>>>>> can't you just add more nodes in order to speed up the processing?
>>>>>>>
>>>>>>> Tobias
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jul 3, 2014 at 7:09 AM, Bill Jay <bill.jaypeter...@gmail.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I have a problem of using Spark Streaming to accept input data and
>>>>>>>> update a result.
>>>>>>>>
>>>>>>>> The input of the data is from Kafka and the output is to report a
>>>>>>>> map which is updated by historical data in every minute. My current 
>>>>>>>> method
>>>>>>>> is to set batch size as 1 minute and use foreachRDD to update this map 
>>>>>>>> and
>>>>>>>> output the map at the end of the foreachRDD function. However, the 
>>>>>>>> current
>>>>>>>> issue is the processing cannot be finished within one minute.
>>>>>>>>
>>>>>>>> I am thinking of updating the map whenever the new data come
>>>>>>>> instead of doing the update when the whoe RDD comes. Is there any idea 
>>>>>>>> on
>>>>>>>> how to achieve this in a better running time? Thanks!
>>>>>>>>
>>>>>>>> Bill
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Use Spark Streaming to update result whenever data come

Reply via email to