Re: Load balancing

Mohit Anchlia Sun, 22 Mar 2015 10:29:27 -0700

posting my question again :)

Thanks for the pointer, looking at the below description from the site it
looks like in spark block size is not fixed, it's determined by block
interval and in fact for the same batch you could have different block
sizes. Did I get it right?


-------------
Another parameter that should be considered is the receiver’s blocking
interval, which is determined by the configuration parameter
<http://spark.apache.org/docs/1.2.1/configuration.html#spark-streaming>
spark.streaming.blockInterval. For most receivers, the received data is
coalesced together into blocks of data before storing inside Spark’s
memory. The number of blocks in each batch determines the number of tasks
that will be used to process those the received data in a map-like
transformation. The number of tasks per receiver per batch will be
approximately (batch interval / block interval). For example, block
interval of 200 ms will create 10 tasks per 2 second batches. Too low the
number of tasks (that is, less than the number of cores per machine), then
it will be inefficient as all available cores will not be used to process
the data. To increase the number of tasks for a given batch interval,
reduce the block interval. However, the recommended minimum value of block
interval is about 50 ms, below which the task launching overheads may be a
problem.
--------------


Also, I am not clear about the data flow of the receiver. When client gets
handle to a spark context and calls something like
"val lines = ssc.socketTextStream("localhost", 9999)", is this the point
when spark master is contacted to determine which spark worker node the
data is going to go to?

On Sun, Mar 22, 2015 at 2:10 AM, Jeffrey Jedele <jeffrey.jed...@gmail.com>
wrote:

> Hi Mohit,
> please make sure you use the "Reply to all" button and include the mailing
> list, otherwise only I will get your message ;)
>
> Regarding your question:
> Yes, that's also my understanding. You can partition streaming RDDs only
> by time intervals, not by size. So depending on your incoming rate, they
> may vary.
>
> I do not know exactly what the life cycle of the receiver is, but I don't
> think sth actually happens when you create the DStream. My guess would be
> that the receiver is allocated when you call
> StreamingContext#startStreams(),
>
> Regards,
> Jeff
>
> 2015-03-21 21:19 GMT+01:00 Mohit Anchlia <mohitanch...@gmail.com>:
>
>> Could somebody help me understand the question I posted earlier?
>>
>> On Fri, Mar 20, 2015 at 9:44 AM, Mohit Anchlia <mohitanch...@gmail.com>
>> wrote:
>>
>>> Thanks for the pointer, looking at the below description from the site
>>> it looks like in spark block size is not fixed, it's determined by block
>>> interval and in fact for the same batch you could have different block
>>> sizes. Did I get it right?
>>>
>>> -------------
>>> Another parameter that should be considered is the receiver’s blocking
>>> interval, which is determined by the configuration parameter
>>> <http://spark.apache.org/docs/1.2.1/configuration.html#spark-streaming>
>>> spark.streaming.blockInterval. For most receivers, the received data is
>>> coalesced together into blocks of data before storing inside Spark’s
>>> memory. The number of blocks in each batch determines the number of tasks
>>> that will be used to process those the received data in a map-like
>>> transformation. The number of tasks per receiver per batch will be
>>> approximately (batch interval / block interval). For example, block
>>> interval of 200 ms will create 10 tasks per 2 second batches. Too low the
>>> number of tasks (that is, less than the number of cores per machine), then
>>> it will be inefficient as all available cores will not be used to process
>>> the data. To increase the number of tasks for a given batch interval,
>>> reduce the block interval. However, the recommended minimum value of block
>>> interval is about 50 ms, below which the task launching overheads may be a
>>> problem.
>>> --------------
>>>
>>>
>>> Also, I am not clear about the data flow of the receiver. When client
>>> gets handle to a spark context and calls something like "val lines = ssc
>>> .socketTextStream("localhost", 9999)", is this the point when spark
>>> master is contacted to determine which spark worker node the data is going
>>> to go to?
>>>
>>> On Fri, Mar 20, 2015 at 1:33 AM, Jeffrey Jedele <
>>> jeffrey.jed...@gmail.com> wrote:
>>>
>>>> Hi Mohit,
>>>> it also depends on what the source for your streaming application is.
>>>>
>>>> If you use Kafka, you can easily partition topics and have multiple
>>>> receivers on different machines.
>>>>
>>>> If you have sth like a HTTP, socket, etc stream, you probably can't do
>>>> that. The Spark RDDs generated by your receiver will be partitioned and
>>>> processed in a distributed manner like usual Spark RDDs however. There are
>>>> parameters to control that behavior (e.g. defaultParallelism and
>>>> blockInterval).
>>>>
>>>> See here for more details:
>>>>
>>>> http://spark.apache.org/docs/1.2.1/streaming-programming-guide.html#performance-tuning
>>>>
>>>> Regards,
>>>> Jeff
>>>>
>>>> 2015-03-20 8:02 GMT+01:00 Akhil Das <ak...@sigmoidanalytics.com>:
>>>>
>>>>> 1. If you are consuming data from Kafka or any other receiver based
>>>>> sources, then you can start 1-2 receivers per worker (assuming you'll have
>>>>> min 4 core per worker)
>>>>>
>>>>> 2. If you are having single receiver or is a fileStream then what you
>>>>> can do to distribute the data across machines is to do a repartition.
>>>>>
>>>>> Thanks
>>>>> Best Regards
>>>>>
>>>>> On Thu, Mar 19, 2015 at 11:32 PM, Mohit Anchlia <
>>>>> mohitanch...@gmail.com> wrote:
>>>>>
>>>>>> I am trying to understand how to load balance the incoming data to
>>>>>> multiple spark streaming workers. Could somebody help me understand how I
>>>>>> can distribute my incoming data from various sources such that incoming
>>>>>> data is going to multiple spark streaming nodes? Is it done by spark 
>>>>>> client
>>>>>> with help of spark master similar to hadoop client asking namenodes for 
>>>>>> the
>>>>>> list of datanodes?
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Load balancing

Reply via email to