If you are used to map-reduce patterns, this sounds like a perfectly natural way to process streams of data.
Call the first consumer "map-combine-log", the topic "shuffle-log" and the second consumer "reduce-log" :) I like that a lot. It works well for either "embarrassingly parallel" cases, or "so much data that more parallelism is worth the extra overhead" cases. I personally don't care if its in core-Kafka, KIP-28 or a github project elsewhere, but I find it useful and non-esoteric. On Mon, Jul 27, 2015 at 12:51 PM, Jason Gustafson <ja...@confluent.io> wrote: > For a little background, the difference between this partitioner and the > default one is that it breaks the deterministic mapping from key to > partition. Instead, messages for a given key can end up in either of two > partitions. This means that the consumer generally won't see all messages > for a given key. Instead the consumer would compute an aggregate for each > key on the partitions it consumes and write them to a separate topic. For > example, if you are writing log messages to a "logs" topic with the > hostname as the key, you could this partitioning strategy to compute > message counts for each host in each partition and write them to a > "log-counts" topic. Then a consumer of the "log-counts" topic would compute > total aggregates based on the two intermediate aggregates. The benefit is > that you are generally going to get better load balancing across partitions > than if you used the default partitioner. (Please correct me if my > understanding is incorrect, Gianmarco) > > So I think the question is whether this is a useful primitive for Kafka to > provide out of the box? I was a little concerned that this use case is a > little esoteric for a core feature, but it may make more sense in the > context of KIP-28 which would provide some higher-level processing > capabilities (though it doesn't seem like the KStream abstraction would > provide a direct way to leverage this partitioner without custom logic). > > Thanks, > Jason > > > On Wed, Jul 22, 2015 at 12:14 AM, Gianmarco De Francisci Morales < > g...@apache.org> wrote: > >> Hello folks, >> >> I'd like to ask the community about its opinion on the partitioning >> functions in Kafka. >> >> With KAFKA-2091 <https://issues.apache.org/jira/browse/KAFKA-2091> >> integrated we are now able to have custom partitioners in the producer. >> The question now becomes *which* partitioners should ship with Kafka? >> This issue arose in the context of KAFKA-2092 >> <https://issues.apache.org/jira/browse/KAFKA-2092>, which implements a >> specific load-balanced partitioning. This partitioner however assumes some >> stages of processing on top of it to make proper use of the data, i.e., it >> envisions Kafka as a substrate for stream processing, and not only as the >> I/O component. >> Is this a direction that Kafka wants to go towards? Or is this a role >> better left to the internal communication systems of other stream >> processing engines (e.g., Storm)? >> And if the answer is the latter, how would something such a Samza (which >> relies mostly on Kafka as its communication substrate) be able to implement >> advanced partitioning schemes? >> >> Cheers, >> -- >> Gianmarco >>