Hi, 

I have the following use case:

1. Reference data stored in an RDD that is persisted and partitioned using a
simple custom partitioner. 
2. Input stream from kafka that uses the same partitioner algorithm as the
ref data RDD - this partitioning is done in kafka. 

I am using kafka direct streams so the number of kafka partitions map to the
number of partitions in the spark RDD. From testing and the documentation I
see Spark does not know anything about how the data has been partitioned in
kafka.

In my use case I need to join the reference data RDD and the input stream
RDD.  Due to the fact I have manually ensured the incoming data from kafka
uses the same partitioning algorithm I know the data has been grouped
correctly in the input stream RDD in Spark but I cannot do a join without a
shuffle step due to the fact Spark has no knowledge of how the data has been
partitioned.

I have two ways to do this. 
1. Explicitly call PartitionBy(CutomParitioner) on the input stream RDD
followed by a join. This results in a shuffle of the input stream RDD and
then the co-partitioned join to take place.
2. Call join on the reference data RDD passing in the input stream RDD.
Spark will do a shuffle under the hood in this case and the join will take
place. The join will do its best to run on a node that has local access to
the reference data RDD. 

Is there any difference between the 2 methods above or will both cause the
same sequence of events to take place in Spark?
Is all I have stated above correct?

Finally, is there any road map feature for looking at allowing the user to
push a partitioner into an already created RDD and not to do a shuffle.
Spark in this case trusts that the data is setup correctly (as in the use
case above) and simply fills in the necessary meta data on the RDD
partitions i.e. check the first entry in each partition to determine the
partition number of the data. 

Thank you in advance for any help on this issue. 
Dave.  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Streaming-and-partitioning-tp25955.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to