For a little background, the difference between this partitioner and the default one is that it breaks the deterministic mapping from key to partition. Instead, messages for a given key can end up in either of two partitions. This means that the consumer generally won't see all messages for a given key. Instead the consumer would compute an aggregate for each key on the partitions it consumes and write them to a separate topic. For example, if you are writing log messages to a "logs" topic with the hostname as the key, you could this partitioning strategy to compute message counts for each host in each partition and write them to a "log-counts" topic. Then a consumer of the "log-counts" topic would compute total aggregates based on the two intermediate aggregates. The benefit is that you are generally going to get better load balancing across partitions than if you used the default partitioner. (Please correct me if my understanding is incorrect, Gianmarco)
So I think the question is whether this is a useful primitive for Kafka to provide out of the box? I was a little concerned that this use case is a little esoteric for a core feature, but it may make more sense in the context of KIP-28 which would provide some higher-level processing capabilities (though it doesn't seem like the KStream abstraction would provide a direct way to leverage this partitioner without custom logic). Thanks, Jason On Wed, Jul 22, 2015 at 12:14 AM, Gianmarco De Francisci Morales < g...@apache.org> wrote: > Hello folks, > > I'd like to ask the community about its opinion on the partitioning > functions in Kafka. > > With KAFKA-2091 <https://issues.apache.org/jira/browse/KAFKA-2091> > integrated we are now able to have custom partitioners in the producer. > The question now becomes *which* partitioners should ship with Kafka? > This issue arose in the context of KAFKA-2092 > <https://issues.apache.org/jira/browse/KAFKA-2092>, which implements a > specific load-balanced partitioning. This partitioner however assumes some > stages of processing on top of it to make proper use of the data, i.e., it > envisions Kafka as a substrate for stream processing, and not only as the > I/O component. > Is this a direction that Kafka wants to go towards? Or is this a role > better left to the internal communication systems of other stream > processing engines (e.g., Storm)? > And if the answer is the latter, how would something such a Samza (which > relies mostly on Kafka as its communication substrate) be able to implement > advanced partitioning schemes? > > Cheers, > -- > Gianmarco >