Hi You can always use RDD properties, which already has partition information.
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/how_many_partitions_does_an_rdd_have.html On Wed, Feb 17, 2016 at 2:36 PM, Cyril Scetbon <cyril.scet...@free.fr> wrote: > Your understanding is the right one (having re-read the documentation). > Still wondering how I can verify that 5 partitions have been created. My > job is reading from a topic in Kafka that has 5 partitions and sends the > data to E/S. I can see that when there is one task to read from Kafka there > are 5 tasks writing to E/S. So I'm supposing that the task reading from > Kafka does it in // using 5 partitions and that's why there are then 5 > tasks to write to E/S. But I'm supposing ... > > On Feb 16, 2016, at 21:12, ayan guha <guha.a...@gmail.com> wrote: > > I have a slightly different understanding. > > Direct stream generates 1 RDD per batch, however, number of partitions in > that RDD = number of partitions in kafka topic. > > On Wed, Feb 17, 2016 at 12:18 PM, Cyril Scetbon <cyril.scet...@free.fr> > wrote: > >> Hi guys, >> >> I'm making some tests with Spark and Kafka using a Python script. I use >> the second method that doesn't need any receiver (Direct Approach). It >> should adapt the number of RDDs to the number of partitions in the topic. >> I'm trying to verify it. What's the easiest way to verify it ? I also tried >> to co-locate Yarn, Spark and Kafka to check if RDDs are created depending >> on the leaders of partitions in a topic, and they are not. Can you confirm >> that RDDs are not created depending on the location of partitions and that >> co-locating Kafka with Spark is not a must-have or that Spark does not take >> advantage of it ? >> >> As the parallelism is simplified (by creating as many RDDs as there are >> partitions) I suppose that the biggest part of the tuning is playing with >> KafKa partitions (not talking about network configuration or management of >> Spark resources) ? >> >> Thank you >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> > > > -- > Best Regards, > Ayan Guha > > > -- Best Regards, Ayan Guha