Hi Mohit, please make sure you use the "Reply to all" button and include the mailing list, otherwise only I will get your message ;)
Regarding your question: Yes, that's also my understanding. You can partition streaming RDDs only by time intervals, not by size. So depending on your incoming rate, they may vary. I do not know exactly what the life cycle of the receiver is, but I don't think sth actually happens when you create the DStream. My guess would be that the receiver is allocated when you call StreamingContext#startStreams(), Regards, Jeff 2015-03-21 21:19 GMT+01:00 Mohit Anchlia <mohitanch...@gmail.com>: > Could somebody help me understand the question I posted earlier? > > On Fri, Mar 20, 2015 at 9:44 AM, Mohit Anchlia <mohitanch...@gmail.com> > wrote: > >> Thanks for the pointer, looking at the below description from the site it >> looks like in spark block size is not fixed, it's determined by block >> interval and in fact for the same batch you could have different block >> sizes. Did I get it right? >> >> ------------- >> Another parameter that should be considered is the receiver’s blocking >> interval, which is determined by the configuration parameter >> <http://spark.apache.org/docs/1.2.1/configuration.html#spark-streaming> >> spark.streaming.blockInterval. For most receivers, the received data is >> coalesced together into blocks of data before storing inside Spark’s >> memory. The number of blocks in each batch determines the number of tasks >> that will be used to process those the received data in a map-like >> transformation. The number of tasks per receiver per batch will be >> approximately (batch interval / block interval). For example, block >> interval of 200 ms will create 10 tasks per 2 second batches. Too low the >> number of tasks (that is, less than the number of cores per machine), then >> it will be inefficient as all available cores will not be used to process >> the data. To increase the number of tasks for a given batch interval, >> reduce the block interval. However, the recommended minimum value of block >> interval is about 50 ms, below which the task launching overheads may be a >> problem. >> -------------- >> >> >> Also, I am not clear about the data flow of the receiver. When client >> gets handle to a spark context and calls something like "val lines = ssc. >> socketTextStream("localhost", 9999)", is this the point when spark >> master is contacted to determine which spark worker node the data is going >> to go to? >> >> On Fri, Mar 20, 2015 at 1:33 AM, Jeffrey Jedele <jeffrey.jed...@gmail.com >> > wrote: >> >>> Hi Mohit, >>> it also depends on what the source for your streaming application is. >>> >>> If you use Kafka, you can easily partition topics and have multiple >>> receivers on different machines. >>> >>> If you have sth like a HTTP, socket, etc stream, you probably can't do >>> that. The Spark RDDs generated by your receiver will be partitioned and >>> processed in a distributed manner like usual Spark RDDs however. There are >>> parameters to control that behavior (e.g. defaultParallelism and >>> blockInterval). >>> >>> See here for more details: >>> >>> http://spark.apache.org/docs/1.2.1/streaming-programming-guide.html#performance-tuning >>> >>> Regards, >>> Jeff >>> >>> 2015-03-20 8:02 GMT+01:00 Akhil Das <ak...@sigmoidanalytics.com>: >>> >>>> 1. If you are consuming data from Kafka or any other receiver based >>>> sources, then you can start 1-2 receivers per worker (assuming you'll have >>>> min 4 core per worker) >>>> >>>> 2. If you are having single receiver or is a fileStream then what you >>>> can do to distribute the data across machines is to do a repartition. >>>> >>>> Thanks >>>> Best Regards >>>> >>>> On Thu, Mar 19, 2015 at 11:32 PM, Mohit Anchlia <mohitanch...@gmail.com >>>> > wrote: >>>> >>>>> I am trying to understand how to load balance the incoming data to >>>>> multiple spark streaming workers. Could somebody help me understand how I >>>>> can distribute my incoming data from various sources such that incoming >>>>> data is going to multiple spark streaming nodes? Is it done by spark >>>>> client >>>>> with help of spark master similar to hadoop client asking namenodes for >>>>> the >>>>> list of datanodes? >>>>> >>>> >>>> >>> >> >