Hi Tobias,

I was using Spark 0.9 before and the master I used was yarn-standalone. In
Spark 1.0, the master will be either yarn-cluster or yarn-client. I am not
sure whether it is the reason why more machines do not provide better
scalability. What is the difference between these two modes in terms of
efficiency? Thanks!


On Tue, Jul 8, 2014 at 5:26 PM, Tobias Pfeiffer <t...@preferred.jp> wrote:

> Bill,
>
> do the additional 100 nodes receive any tasks at all? (I don't know which
> cluster you use, but with Mesos you could check client logs in the web
> interface.) You might want to try something like repartition(N) or
> repartition(N*2) (with N the number of your nodes) after you receive your
> data.
>
> Tobias
>
>
> On Wed, Jul 9, 2014 at 3:09 AM, Bill Jay <bill.jaypeter...@gmail.com>
> wrote:
>
>> Hi Tobias,
>>
>> Thanks for the suggestion. I have tried to add more nodes from 300 to
>> 400. It seems the running time did not get improved.
>>
>>
>> On Wed, Jul 2, 2014 at 6:47 PM, Tobias Pfeiffer <t...@preferred.jp> wrote:
>>
>>> Bill,
>>>
>>> can't you just add more nodes in order to speed up the processing?
>>>
>>> Tobias
>>>
>>>
>>> On Thu, Jul 3, 2014 at 7:09 AM, Bill Jay <bill.jaypeter...@gmail.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I have a problem of using Spark Streaming to accept input data and
>>>> update a result.
>>>>
>>>> The input of the data is from Kafka and the output is to report a map
>>>> which is updated by historical data in every minute. My current method is
>>>> to set batch size as 1 minute and use foreachRDD to update this map and
>>>> output the map at the end of the foreachRDD function. However, the current
>>>> issue is the processing cannot be finished within one minute.
>>>>
>>>> I am thinking of updating the map whenever the new data come instead of
>>>> doing the update when the whoe RDD comes. Is there any idea on how to
>>>> achieve this in a better running time? Thanks!
>>>>
>>>> Bill
>>>>
>>>
>>>
>>
>

Reply via email to