Re: [DISCUSS] New partitioning for better load balancing

Gianmarco De Francisci Morales Tue, 07 Apr 2015 01:52:14 -0700

Hi Guozhang,

Thanks for your comments.


1) Yes, ordering cannot be guaranteed in PKG. In general, algorithms that
use PGK should compute commutative and associative functions of the input.
If you need strict ordering (i.e., the function is not commutative) within
a partition, use KG.

2) I am not sure I understand the issue. PKG does not deal with inter-topic
load balancing. Topic A and topic B are completely independent in our
framework.

Cheers,

--
Gianmarco

On 7 April 2015 at 02:56, Guozhang Wang <[email protected]> wrote:

> Gianmarco,
>
> I browse through your paper (congrats for the ICDE publication BTW!), and
> here are some questions / comments on the algorithm:
>
> 1. One motivation of enabling key-based partitioned in Kafka is to achieve
> per-key ordering, i.e. with all messages with the same key sent to the same
> partition their ordering is preserved. However with "key-splitting" that
> seems to break this guarantee and now messages with the same key may be
> sent to 2 (or generally speaking many) partitions.
>
> 2. As for the local load estimation, there is a second mapping from
> partitions (workers in your paper) to broker hosts beside the mapping from
> keys to partitions, and not all broker hosts maintain each of the
> partitions. For example, there are 4 brokers, and broker-1/2 each takes one
> of the two partitions of topic A, while broker-3/4 each takes one of the
> two partitions of topic B, etc.
>
> I am wondering if those two issues can be resolved with the PKG framework?
>
> Guozhang
>
> On Sun, Apr 5, 2015 at 12:19 AM, Gianmarco De Francisci Morales <
> [email protected]> wrote:
>
> > Hi Jay,
> >
> > Thanks, that sounds a necessary step. I guess I expected something like
> > that to be already there, at least internally.
> > I created KAFKA-2092 to track the PKG integration.
> >
> > Cheers,
> >
> > --
> > Gianmarco
> >
> > On 4 April 2015 at 23:50, Jay Kreps <[email protected]> wrote:
> >
> > > Hey guys,
> > >
> > > I think the first step here would be to expose a partitioner interface
> > for
> > > the new producer that would make it easy to plug in these different
> > > strategies. I filed a JIRA for this:
> > > https://issues.apache.org/jira/browse/KAFKA-2091
> > >
> > > -Jay
> > >
> > > On Fri, Apr 3, 2015 at 9:36 AM, Harsha <[email protected]> wrote:
> > >
> > >> Gianmarco,
> > >>                  I am coming from storm community. I think PKG is a
> very
> > >> interesting and we can provide an implementation of Partitioner for
> PKG.
> > >> Can you open a JIRA for this.
> > >>
> > >> --
> > >> Harsha
> > >> Sent with Airmail
> > >>
> > >> On April 3, 2015 at 4:49:15 AM, Gianmarco De Francisci Morales (
> > >> [email protected]) wrote:
> > >>
> > >> Hi,
> > >>
> > >> We have recently studied the problem of load balancing in distributed
> > >> stream processing systems such as Samza [1].
> > >> In particular, we focused on what happens when the key distribution of
> > the
> > >> stream is skewed when using key grouping.
> > >> We developed a new stream partitioning scheme (which we call Partial
> Key
> > >> Grouping). It achieves better load balancing than hashing while being
> > more
> > >> scalable than round robin in terms of memory.
> > >>
> > >> In the paper we show a number of mining algorithms that are easy to
> > >> implement with partial key grouping, and whose performance can benefit
> > >> from
> > >> it. We think that it might also be useful for a larger class of
> > >> algorithms.
> > >>
> > >> PKG has already been integrated in Storm [2], and I would like to be
> > able
> > >> to use it in Samza as well. As far as I understand, Kafka producers
> are
> > >> the
> > >> ones that decide how to partition the stream (or Kafka topic). Even
> > after
> > >> doing a bit of reading, I am still not sure if I should be writing
> this
> > >> email here or on the Samza dev list. Anyway, my first guess is Kafka.
> > >>
> > >> I do not have experience with Kafka, however partial key grouping is
> > very
> > >> easy to implement: it requires just a few lines of code in Java when
> > >> implemented as a custom grouping in Storm [3].
> > >> I believe it should be very easy to integrate.
> > >>
> > >> For all these reasons, I believe it will be a nice addition to
> > >> Kafka/Samza.
> > >> If the community thinks it's a good idea, I will be happy to offer
> > support
> > >> in the porting.
> > >>
> > >> References:
> > >> [1]
> > >>
> > >>
> >
> https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancing-for-distributed-stream-processing-engines.pdf
> > >> [2] https://issues.apache.org/jira/browse/STORM-632
> > >> [3] https://github.com/gdfm/partial-key-grouping
> > >> --
> > >> Gianmarco
> > >>
> > >
> > >
> >
>
>
>
> --
> -- Guozhang
>

Re: [DISCUSS] New partitioning for better load balancing

Reply via email to