You can call .repartition on the Dstream created by the Kafka direct consumer. You take the one-time hit of a shuffle but gain the ability to scale out processing beyond your number of partitions.
We’re doing this to scale up from 36 partitions / topic to 140 partitions (20 cores * 7 nodes) and it works great. -adrian From: varun sharma Date: Thursday, October 29, 2015 at 8:27 AM To: user Subject: Need more tasks in KafkaDirectStream Right now, there is one to one correspondence between kafka partitions and spark partitions. I dont have a requirement of one to one semantics. I need more tasks to be generated in the job so that it can be parallelised and batch can be completed fast. In the previous Receiver based approach number of tasks created were independent of kafka partitions, I need something like that only. Any config available if I dont need one to one semantics? Is there any way I can repartition without incurring any additional cost. Thanks VARUN SHARMA