Re: Receiver and Parallelization

2015-09-25 Thread Adrian Tanase
Good catch, I was not aware of this setting. I’m wondering though if it also generates a shuffle or if the data is still processed by the node on which it’s ingested - so that you’re not gated by the number of cores on one machine. -adrian On 9/25/15, 5:27 PM, "Silvio Fiorito"

Receiver and Parallelization

2015-09-25 Thread nibiau
Hello, I used a custom receiver in order to receive JMS messages from MQ Servers. I want to benefit of Yarn cluster, my questions are : - Is it possible to have only one node receiving JMS messages and parralelize the RDD over all the cluster nodes ? - Is it possible to parallelize also the

Re: Receiver and Parallelization

2015-09-25 Thread Adrian Tanase
1) yes, just use .repartition on the inbound stream, this will shuffle data across your whole cluster and process in parallel as specified. 2) yes, although I’m not sure how to do it for a totally custom receiver. Does this help as a starting point?

Re: Receiver and Parallelization

2015-09-25 Thread Silvio Fiorito
One thing you should look at is your batch duration and spark.streaming.blockInterval Those 2 things control how many partitions are generated for each RDD (batch) of the DStream when using a receiver (vs direct approach). So if you have a 2 second batch duration and the default blockInterval