If you call a transformation on an rdd using the same partitioner as that rdd, no shuffle will occur. KafkaRDD doesn't have a partitioner, there's no consistent partitioning scheme that works for all kafka uses. You can wrap each kafkardd with an rdd that has a custom partitioner that you write to match your kafka partitioning scheme, and avoid a shuffle.
The danger there is if you have any misbehaving producers, or translate the partitioning wrongly, you'll get bad results. It's safer just to shuffle. On Oct 31, 2016 04:31, "Andrii Biletskyi" <andrii.bilets...@yahoo.com.invalid> wrote: Hi all, I'm using Spark Streaming mapWithState operation to do a stateful operation on my Kafka stream (though I think similar arguments would apply for any source). Trying to understand a way to control mapWithState's partitioning schema. My transformations are simple: 1) create KafkaDStream 2) mapPartitions to get a key-value stream where `key` corresponds to Kafka message key 3) apply mapWithState operation on key-value stream, the state stream shares keys with the original stream, the resulting streams doesn't change keys either The problem is that, as I understand, mapWithState stream has a different partitioning schema and thus I see shuffles in Spark Web UI. >From the mapWithState implementation I see that: mapwithState uses Partitioner if specified, otherwise partitions data with HashPartitioner(<default-parallelism-conf>). The thing is that original KafkaDStream has a specific partitioning schema: Kafka partitions correspond Spark RDD partitions. Question: is there a way for mapWithState stream to inherit partitioning schema from the original stream (i.e. correspond to Kafka partitions). Thanks, Andrii