Hi Mohit,
it also depends on what the source for your streaming application is.

If you use Kafka, you can easily partition topics and have multiple
receivers on different machines.

If you have sth like a HTTP, socket, etc stream, you probably can't do
that. The Spark RDDs generated by your receiver will be partitioned and
processed in a distributed manner like usual Spark RDDs however. There are
parameters to control that behavior (e.g. defaultParallelism and
blockInterval).

See here for more details:
http://spark.apache.org/docs/1.2.1/streaming-programming-guide.html#performance-tuning

Regards,
Jeff

2015-03-20 8:02 GMT+01:00 Akhil Das <ak...@sigmoidanalytics.com>:

> 1. If you are consuming data from Kafka or any other receiver based
> sources, then you can start 1-2 receivers per worker (assuming you'll have
> min 4 core per worker)
>
> 2. If you are having single receiver or is a fileStream then what you can
> do to distribute the data across machines is to do a repartition.
>
> Thanks
> Best Regards
>
> On Thu, Mar 19, 2015 at 11:32 PM, Mohit Anchlia <mohitanch...@gmail.com>
> wrote:
>
>> I am trying to understand how to load balance the incoming data to
>> multiple spark streaming workers. Could somebody help me understand how I
>> can distribute my incoming data from various sources such that incoming
>> data is going to multiple spark streaming nodes? Is it done by spark client
>> with help of spark master similar to hadoop client asking namenodes for the
>> list of datanodes?
>>
>
>

Reply via email to