Gianmarco De Francisci Morales created KAFKA-2092:
-----------------------------------------------------

             Summary: New partitioning for better load balancing
                 Key: KAFKA-2092
                 URL: https://issues.apache.org/jira/browse/KAFKA-2092
             Project: Kafka
          Issue Type: Improvement
          Components: producer 
            Reporter: Gianmarco De Francisci Morales
            Assignee: Jun Rao


We have recently studied the problem of load balancing in distributed stream 
processing systems such as Samza [1].
In particular, we focused on what happens when the key distribution of the 
stream is skewed when using key grouping.
We developed a new stream partitioning scheme (which we call Partial Key 
Grouping). It achieves better load balancing than hashing while being more 
scalable than round robin in terms of memory.

In the paper we show a number of mining algorithms that are easy to implement 
with partial key grouping, and whose performance can benefit from it. We think 
that it might also be useful for a larger class of algorithms.

PKG has already been integrated in Storm [2], and I would like to be able to 
use it in Samza as well. As far as I understand, Kafka producers are the ones 
that decide how to partition the stream (or Kafka topic). Even after doing a 
bit of reading, I am still not sure if I should be writing this email here or 
on the Samza dev list. Anyway, my first guess is Kafka.

I do not have experience with Kafka, however partial key grouping is very easy 
to implement: it requires just a few lines of code in Java when implemented as 
a custom grouping in Storm [3].
I believe it should be very easy to integrate.

For all these reasons, I believe it will be a nice addition to Kafka/Samza. If 
the community thinks it's a good idea, I will be happy to offer support in the 
porting.

References:
[1] 
https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancing-for-distributed-stream-processing-engines.pdf
[2] https://issues.apache.org/jira/browse/STORM-632
[3] https://github.com/gdfm/partial-key-grouping



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to