Good catch, I was not aware of this setting.
I’m wondering though if it also generates a shuffle or if the data is still
processed by the node on which it’s ingested - so that you’re not gated by the
number of cores on one machine.
-adrian
On 9/25/15, 5:27 PM, "Silvio Fiorito"
Hello,
I used a custom receiver in order to receive JMS messages from MQ Servers.
I want to benefit of Yarn cluster, my questions are :
- Is it possible to have only one node receiving JMS messages and parralelize
the RDD over all the cluster nodes ?
- Is it possible to parallelize also the
1) yes, just use .repartition on the inbound stream, this will shuffle data
across your whole cluster and process in parallel as specified.
2) yes, although I’m not sure how to do it for a totally custom receiver. Does
this help as a starting point?
One thing you should look at is your batch duration and
spark.streaming.blockInterval
Those 2 things control how many partitions are generated for each RDD (batch)
of the DStream when using a receiver (vs direct approach).
So if you have a 2 second batch duration and the default blockInterval