Re: Load balancing

Jeffrey Jedele Sun, 22 Mar 2015 02:11:16 -0700

Hi Mohit,
please make sure you use the "Reply to all" button and include the mailing
list, otherwise only I will get your message ;)


Regarding your question:
Yes, that's also my understanding. You can partition streaming RDDs only by
time intervals, not by size. So depending on your incoming rate, they may
vary.

I do not know exactly what the life cycle of the receiver is, but I don't
think sth actually happens when you create the DStream. My guess would be
that the receiver is allocated when you call
StreamingContext#startStreams(),

Regards,
Jeff

2015-03-21 21:19 GMT+01:00 Mohit Anchlia <mohitanch...@gmail.com>:

> Could somebody help me understand the question I posted earlier?
>
> On Fri, Mar 20, 2015 at 9:44 AM, Mohit Anchlia <mohitanch...@gmail.com>
> wrote:
>
>> Thanks for the pointer, looking at the below description from the site it
>> looks like in spark block size is not fixed, it's determined by block
>> interval and in fact for the same batch you could have different block
>> sizes. Did I get it right?
>>
>> -------------
>> Another parameter that should be considered is the receiver’s blocking
>> interval, which is determined by the configuration parameter
>> <http://spark.apache.org/docs/1.2.1/configuration.html#spark-streaming>
>> spark.streaming.blockInterval. For most receivers, the received data is
>> coalesced together into blocks of data before storing inside Spark’s
>> memory. The number of blocks in each batch determines the number of tasks
>> that will be used to process those the received data in a map-like
>> transformation. The number of tasks per receiver per batch will be
>> approximately (batch interval / block interval). For example, block
>> interval of 200 ms will create 10 tasks per 2 second batches. Too low the
>> number of tasks (that is, less than the number of cores per machine), then
>> it will be inefficient as all available cores will not be used to process
>> the data. To increase the number of tasks for a given batch interval,
>> reduce the block interval. However, the recommended minimum value of block
>> interval is about 50 ms, below which the task launching overheads may be a
>> problem.
>> --------------
>>
>>
>> Also, I am not clear about the data flow of the receiver. When client
>> gets handle to a spark context and calls something like "val lines = ssc.
>> socketTextStream("localhost", 9999)", is this the point when spark
>> master is contacted to determine which spark worker node the data is going
>> to go to?
>>
>> On Fri, Mar 20, 2015 at 1:33 AM, Jeffrey Jedele <jeffrey.jed...@gmail.com
>> > wrote:
>>
>>> Hi Mohit,
>>> it also depends on what the source for your streaming application is.
>>>
>>> If you use Kafka, you can easily partition topics and have multiple
>>> receivers on different machines.
>>>
>>> If you have sth like a HTTP, socket, etc stream, you probably can't do
>>> that. The Spark RDDs generated by your receiver will be partitioned and
>>> processed in a distributed manner like usual Spark RDDs however. There are
>>> parameters to control that behavior (e.g. defaultParallelism and
>>> blockInterval).
>>>
>>> See here for more details:
>>>
>>> http://spark.apache.org/docs/1.2.1/streaming-programming-guide.html#performance-tuning
>>>
>>> Regards,
>>> Jeff
>>>
>>> 2015-03-20 8:02 GMT+01:00 Akhil Das <ak...@sigmoidanalytics.com>:
>>>
>>>> 1. If you are consuming data from Kafka or any other receiver based
>>>> sources, then you can start 1-2 receivers per worker (assuming you'll have
>>>> min 4 core per worker)
>>>>
>>>> 2. If you are having single receiver or is a fileStream then what you
>>>> can do to distribute the data across machines is to do a repartition.
>>>>
>>>> Thanks
>>>> Best Regards
>>>>
>>>> On Thu, Mar 19, 2015 at 11:32 PM, Mohit Anchlia <mohitanch...@gmail.com
>>>> > wrote:
>>>>
>>>>> I am trying to understand how to load balance the incoming data to
>>>>> multiple spark streaming workers. Could somebody help me understand how I
>>>>> can distribute my incoming data from various sources such that incoming
>>>>> data is going to multiple spark streaming nodes? Is it done by spark 
>>>>> client
>>>>> with help of spark master similar to hadoop client asking namenodes for 
>>>>> the
>>>>> list of datanodes?
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Load balancing

Reply via email to