Re: Kafka Spark Partition Mapping

Syed, Nehal (Contractor) Mon, 24 Aug 2015 11:07:12 -0700

Dear Cody,
Thanks for your response, I am trying to do decoration which means when a 
message comes from Kafka (partitioned by key) in to the Spark I want to add 
more fields/data to it.
How Does normally people do it in Spark? If it were you how would you decorate 
message without hitting database for every message?


Our current strategy is,  decoration data comes from local in Memory Cache 
(Guava LoadingCache) and/or from SQL DB if not in cache.  If we take this 
approach we want cached decoration data available locally to RDDs most of the 
time.
Our Kafka and Spark run on separate machines and thats why I just wants 
kafka-partition to go to same Spark RDD partition most of time so I can 
utilized cached decoration Data.

Do you think if I Create JdbcRDD for décorarion data and join it with 
JavaPairReceiverInputDStream it will always stays where JdbcRDD lives?

Nehal

From: Cody Koeninger <c...@koeninger.org<mailto:c...@koeninger.org>>
Date: Thursday, August 20, 2015 at 6:33 PM
To: Microsoft Office User 
<nehal_s...@cable.comcast.com<mailto:nehal_s...@cable.comcast.com>>
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: Kafka Spark Partition Mapping

In general you cannot guarantee which node an RDD will be processed on.

The preferred location for a kafkardd is the kafka leader for that partition, 
if they're deployed on the same machine. If you want to try to override that 
behavior, the method is getPreferredLocations

But even in that case, location preferences are just a scheduler hint, the rdd 
can still be scheduled elsewhere.  You can turn up spark.locality.wait to a 
very high value to decrease the likelihood.



On Thu, Aug 20, 2015 at 5:47 PM, nehalsyed 
<nehal_s...@cable.comcast.com<mailto:nehal_s...@cable.comcast.com>> wrote:
I have data in Kafka topic-partition and I am reading it from Spark like this: 
JavaPairReceiverInputDStream<String, String> directKafkaStream = 
KafkaUtils.createDirectStream(streamingContext, [key class], [value class], 
[key decoder class], [value decoder class], [map of Kafka parameters], [set of 
topics to consume]); I want that message from a kafka partition always land on 
same machine on Spark rdd so I can cache some decoration data locally and later 
reuse with other messages (that belong to same key). Can anyone tell me how can 
I achieve it? Thanks
________________________________
View this message in context: Kafka Spark Partition 
Mapping<http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Spark-Partition-Mapping-tp24372.html>
Sent from the Apache Spark User List mailing list 
archive<http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.

Re: Kafka Spark Partition Mapping

Reply via email to