I have a slightly different understanding. Direct stream generates 1 RDD per batch, however, number of partitions in that RDD = number of partitions in kafka topic.
On Wed, Feb 17, 2016 at 12:18 PM, Cyril Scetbon <cyril.scet...@free.fr> wrote: > Hi guys, > > I'm making some tests with Spark and Kafka using a Python script. I use > the second method that doesn't need any receiver (Direct Approach). It > should adapt the number of RDDs to the number of partitions in the topic. > I'm trying to verify it. What's the easiest way to verify it ? I also tried > to co-locate Yarn, Spark and Kafka to check if RDDs are created depending > on the leaders of partitions in a topic, and they are not. Can you confirm > that RDDs are not created depending on the location of partitions and that > co-locating Kafka with Spark is not a must-have or that Spark does not take > advantage of it ? > > As the parallelism is simplified (by creating as many RDDs as there are > partitions) I suppose that the biggest part of the tuning is playing with > KafKa partitions (not talking about network configuration or management of > Spark resources) ? > > Thank you > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards, Ayan Guha