[ 
https://issues.apache.org/jira/browse/KAFKA-2092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14588420#comment-14588420
 ] 

Jay Kreps commented on KAFKA-2092:
----------------------------------

Hmm, not sure if I follow this. My understanding of the power-of-two-choices 
thing was to choose the least loaded of two options. However this doesn't 
really do that, right because it doesn't use the load? It instead looks at the 
total number of messages sent over the life of a single producer. In general 
the lifecycle of that producer is unpredictable and there would be many of them 
so it is hard to infer much from that, right?

Doesn't this also break the key=>partition mapping being deterministic, which 
means you need to have a second stage of processing to combine results an the 
results need to be combinable (e.g. if you are doing word count, a single word 
would be sent to two counters, so the final count requires a new stage to 
combine both of these).

Also if the skew comes from the distribution of keys then this doesn't help too 
much, right because the heavy-hitter keys will not just be split over two 
partitions instead of one. 

Maybe you can explain a bit more?

> New partitioning for better load balancing
> ------------------------------------------
>
>                 Key: KAFKA-2092
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2092
>             Project: Kafka
>          Issue Type: Improvement
>          Components: producer 
>            Reporter: Gianmarco De Francisci Morales
>            Assignee: Jun Rao
>         Attachments: KAFKA-2092-v1.patch
>
>
> We have recently studied the problem of load balancing in distributed stream 
> processing systems such as Samza [1].
> In particular, we focused on what happens when the key distribution of the 
> stream is skewed when using key grouping.
> We developed a new stream partitioning scheme (which we call Partial Key 
> Grouping). It achieves better load balancing than hashing while being more 
> scalable than round robin in terms of memory.
> In the paper we show a number of mining algorithms that are easy to implement 
> with partial key grouping, and whose performance can benefit from it. We 
> think that it might also be useful for a larger class of algorithms.
> PKG has already been integrated in Storm [2], and I would like to be able to 
> use it in Samza as well. As far as I understand, Kafka producers are the ones 
> that decide how to partition the stream (or Kafka topic).
> I do not have experience with Kafka, however partial key grouping is very 
> easy to implement: it requires just a few lines of code in Java when 
> implemented as a custom grouping in Storm [3].
> I believe it should be very easy to integrate.
> For all these reasons, I believe it will be a nice addition to Kafka/Samza. 
> If the community thinks it's a good idea, I will be happy to offer support in 
> the porting.
> References:
> [1] 
> https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancing-for-distributed-stream-processing-engines.pdf
> [2] https://issues.apache.org/jira/browse/STORM-632
> [3] https://github.com/gdfm/partial-key-grouping



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to