I have a slightly different understanding.

Direct stream generates 1 RDD per batch, however, number of partitions in
that RDD = number of partitions in kafka topic.

On Wed, Feb 17, 2016 at 12:18 PM, Cyril Scetbon <cyril.scet...@free.fr>
wrote:

> Hi guys,
>
> I'm making some tests with Spark and Kafka using a Python script. I use
> the second method that doesn't need any receiver (Direct Approach). It
> should adapt the number of RDDs to the number of partitions in the topic.
> I'm trying to verify it. What's the easiest way to verify it ? I also tried
> to co-locate Yarn, Spark and Kafka to check if RDDs are created depending
> on the leaders of partitions in a topic, and they are not. Can you confirm
> that RDDs are not created depending on the location of partitions and that
> co-locating Kafka with Spark is not a must-have or that Spark does not take
> advantage of it ?
>
> As the parallelism is simplified (by creating as many RDDs as there are
> partitions) I suppose that the biggest part of the tuning is playing with
> KafKa partitions (not talking about network configuration or management of
> Spark resources) ?
>
> Thank you
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Best Regards,
Ayan Guha

Reply via email to