For a little background, the difference between this partitioner and the
default one is that it breaks the deterministic mapping from key to
partition. Instead, messages for a given key can end up in either of two
partitions. This means that the consumer generally won't see all messages
for a given key. Instead the consumer would compute an aggregate for each
key on the partitions it consumes and write them to a separate topic. For
example, if you are writing log messages to a "logs" topic with the
hostname as the key, you could this partitioning strategy to compute
message counts for each host in each partition and write them to a
"log-counts" topic. Then a consumer of the "log-counts" topic would compute
total aggregates based on the two intermediate aggregates. The benefit is
that you are generally going to get better load balancing across partitions
than if you used the default partitioner. (Please correct me if my
understanding is incorrect, Gianmarco)

So I think the question is whether this is a useful primitive for Kafka to
provide out of the box? I was a little concerned that this use case is a
little esoteric for a core feature, but it may make more sense in the
context of KIP-28 which would provide some higher-level processing
capabilities (though it doesn't seem like the KStream abstraction would
provide a direct way to leverage this partitioner without custom logic).

Thanks,
Jason


On Wed, Jul 22, 2015 at 12:14 AM, Gianmarco De Francisci Morales <
g...@apache.org> wrote:

> Hello folks,
>
> I'd like to ask the community about its opinion on the partitioning
> functions in Kafka.
>
> With KAFKA-2091 <https://issues.apache.org/jira/browse/KAFKA-2091>
> integrated we are now able to have custom partitioners in the producer.
> The question now becomes *which* partitioners should ship with Kafka?
> This issue arose in the context of KAFKA-2092
> <https://issues.apache.org/jira/browse/KAFKA-2092>, which implements a
> specific load-balanced partitioning. This partitioner however assumes some
> stages of processing on top of it to make proper use of the data, i.e., it
> envisions Kafka as a substrate for stream processing, and not only as the
> I/O component.
> Is this a direction that Kafka wants to go towards? Or is this a role
> better left to the internal communication systems of other stream
> processing engines (e.g., Storm)?
> And if the answer is the latter, how would something such a Samza (which
> relies mostly on Kafka as its communication substrate) be able to implement
> advanced partitioning schemes?
>
> Cheers,
> --
> Gianmarco
>

Reply via email to