In general you cannot guarantee which node an RDD will be processed on.

The preferred location for a kafkardd is the kafka leader for that
partition, if they're deployed on the same machine. If you want to try to
override that behavior, the method is getPreferredLocations

But even in that case, location preferences are just a scheduler hint, the
rdd can still be scheduled elsewhere.  You can turn up spark.locality.wait
to a very high value to decrease the likelihood.



On Thu, Aug 20, 2015 at 5:47 PM, nehalsyed <nehal_s...@cable.comcast.com>
wrote:

> I have data in Kafka topic-partition and I am reading it from Spark like
> this: JavaPairReceiverInputDStream<String, String> directKafkaStream =
> KafkaUtils.createDirectStream(streamingContext, [key class], [value class],
> [key decoder class], [value decoder class], [map of Kafka parameters], [set
> of topics to consume]); I want that message from a kafka partition always
> land on same machine on Spark rdd so I can cache some decoration data
> locally and later reuse with other messages (that belong to same key). Can
> anyone tell me how can I achieve it? Thanks
> ------------------------------
> View this message in context: Kafka Spark Partition Mapping
> <http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Spark-Partition-Mapping-tp24372.html>
> Sent from the Apache Spark User List mailing list archive
> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>

Reply via email to