[ 
https://issues.apache.org/jira/browse/KAFKA-2092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14634927#comment-14634927
 ] 

Gianmarco De Francisci Morales edited comment on KAFKA-2092 at 7/21/15 10:42 AM:
---------------------------------------------------------------------------------

[~hachikuji] thanks for your comment.
Indeed this kind of partitioning is more elaborate than simple round robin or 
hashing.
However, if Kafka aims to be the substrate for stream processing (and not only 
the equivalent of a file system for I/O), then I think it should keep an eye on 
performance and allow for more flexible partitioning schemes for more advanced 
use cases.

PKG is pretty straightforward, so I don't think it would be misplaced in Kafka.
That said, if we wanted to go the Samza route, it is not clear to me how this 
could go into Samza without support for it in Kafka.
My understanding is that the partitioning available in Samza are directly those 
supported by Kafka.
Could you elaborate on how we could integrate this in Samza?



was (Author: azaroth):
[hachikuji] thanks for your comment.
Indeed this kind of partitioning is more elaborate than simple round robin or 
hashing.
However, if Kafka aims to be the substrate for stream processing (and not only 
the equivalent of a file system for I/O), then I think it should keep an eye on 
performance and allow for more flexible partitioning schemes for more advanced 
use cases.

PKG is pretty straightforward, so I don't think it would be misplaced in Kafka.
That said, if we wanted to go the Samza route, it is not clear to me how this 
could go into Samza without support for it in Kafka.
My understanding is that the partitioning available in Samza are directly those 
supported by Kafka.
Could you elaborate on how we could integrate this in Samza?


> New partitioning for better load balancing
> ------------------------------------------
>
>                 Key: KAFKA-2092
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2092
>             Project: Kafka
>          Issue Type: Improvement
>          Components: producer 
>            Reporter: Gianmarco De Francisci Morales
>            Assignee: Jun Rao
>         Attachments: KAFKA-2092-v1.patch, KAFKA-2092-v2.patch
>
>
> We have recently studied the problem of load balancing in distributed stream 
> processing systems such as Samza [1].
> In particular, we focused on what happens when the key distribution of the 
> stream is skewed when using key grouping.
> We developed a new stream partitioning scheme (which we call Partial Key 
> Grouping). It achieves better load balancing than hashing while being more 
> scalable than round robin in terms of memory.
> In the paper we show a number of mining algorithms that are easy to implement 
> with partial key grouping, and whose performance can benefit from it. We 
> think that it might also be useful for a larger class of algorithms.
> PKG has already been integrated in Storm [2], and I would like to be able to 
> use it in Samza as well. As far as I understand, Kafka producers are the ones 
> that decide how to partition the stream (or Kafka topic).
> I do not have experience with Kafka, however partial key grouping is very 
> easy to implement: it requires just a few lines of code in Java when 
> implemented as a custom grouping in Storm [3].
> I believe it should be very easy to integrate.
> For all these reasons, I believe it will be a nice addition to Kafka/Samza. 
> If the community thinks it's a good idea, I will be happy to offer support in 
> the porting.
> References:
> [1] 
> https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancing-for-distributed-stream-processing-engines.pdf
> [2] https://issues.apache.org/jira/browse/STORM-632
> [3] https://github.com/gdfm/partial-key-grouping



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to