Spark Streaming with Kafka DirectStream

Cyril Scetbon Tue, 16 Feb 2016 17:19:31 -0800

Hi guys,

I'm making some tests with Spark and Kafka using a Python script. I use the 
second method that doesn't need any receiver (Direct Approach). It should adapt 
the number of RDDs to the number of partitions in the topic. I'm trying to 
verify it. What's the easiest way to verify it ? I also tried to co-locate 
Yarn, Spark and Kafka to check if RDDs are created depending on the leaders of 
partitions in a topic, and they are not. Can you confirm that RDDs are not 
created depending on the location of partitions and that co-locating Kafka with 
Spark is not a must-have or that Spark does not take advantage of it ?


As the parallelism is simplified (by creating as many RDDs as there are 
partitions) I suppose that the biggest part of the tuning is playing with KafKa 
partitions (not talking about network configuration or management of Spark 
resources) ?

Thank you
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark Streaming with Kafka DirectStream

Reply via email to