Re: [DISCUSS] New partitioning for better load balancing

2015-04-07 Thread Guozhang Wang
I see, thanks for the clarification.

Guozhang

On Tue, Apr 7, 2015 at 1:50 AM, Gianmarco De Francisci Morales <
g...@apache.org> wrote:

> Hi Guozhang,
>
> Thanks for your comments.
>
> 1) Yes, ordering cannot be guaranteed in PKG. In general, algorithms that
> use PGK should compute commutative and associative functions of the input.
> If you need strict ordering (i.e., the function is not commutative) within
> a partition, use KG.
>
> 2) I am not sure I understand the issue. PKG does not deal with inter-topic
> load balancing. Topic A and topic B are completely independent in our
> framework.
>
> Cheers,
>
> --
> Gianmarco
>
> On 7 April 2015 at 02:56, Guozhang Wang  wrote:
>
> > Gianmarco,
> >
> > I browse through your paper (congrats for the ICDE publication BTW!), and
> > here are some questions / comments on the algorithm:
> >
> > 1. One motivation of enabling key-based partitioned in Kafka is to
> achieve
> > per-key ordering, i.e. with all messages with the same key sent to the
> same
> > partition their ordering is preserved. However with "key-splitting" that
> > seems to break this guarantee and now messages with the same key may be
> > sent to 2 (or generally speaking many) partitions.
> >
> > 2. As for the local load estimation, there is a second mapping from
> > partitions (workers in your paper) to broker hosts beside the mapping
> from
> > keys to partitions, and not all broker hosts maintain each of the
> > partitions. For example, there are 4 brokers, and broker-1/2 each takes
> one
> > of the two partitions of topic A, while broker-3/4 each takes one of the
> > two partitions of topic B, etc.
> >
> > I am wondering if those two issues can be resolved with the PKG
> framework?
> >
> > Guozhang
> >
> > On Sun, Apr 5, 2015 at 12:19 AM, Gianmarco De Francisci Morales <
> > g...@apache.org> wrote:
> >
> > > Hi Jay,
> > >
> > > Thanks, that sounds a necessary step. I guess I expected something like
> > > that to be already there, at least internally.
> > > I created KAFKA-2092 to track the PKG integration.
> > >
> > > Cheers,
> > >
> > > --
> > > Gianmarco
> > >
> > > On 4 April 2015 at 23:50, Jay Kreps  wrote:
> > >
> > > > Hey guys,
> > > >
> > > > I think the first step here would be to expose a partitioner
> interface
> > > for
> > > > the new producer that would make it easy to plug in these different
> > > > strategies. I filed a JIRA for this:
> > > > https://issues.apache.org/jira/browse/KAFKA-2091
> > > >
> > > > -Jay
> > > >
> > > > On Fri, Apr 3, 2015 at 9:36 AM, Harsha  wrote:
> > > >
> > > >> Gianmarco,
> > > >>  I am coming from storm community. I think PKG is a
> > very
> > > >> interesting and we can provide an implementation of Partitioner for
> > PKG.
> > > >> Can you open a JIRA for this.
> > > >>
> > > >> --
> > > >> Harsha
> > > >> Sent with Airmail
> > > >>
> > > >> On April 3, 2015 at 4:49:15 AM, Gianmarco De Francisci Morales (
> > > >> g...@apache.org) wrote:
> > > >>
> > > >> Hi,
> > > >>
> > > >> We have recently studied the problem of load balancing in
> distributed
> > > >> stream processing systems such as Samza [1].
> > > >> In particular, we focused on what happens when the key distribution
> of
> > > the
> > > >> stream is skewed when using key grouping.
> > > >> We developed a new stream partitioning scheme (which we call Partial
> > Key
> > > >> Grouping). It achieves better load balancing than hashing while
> being
> > > more
> > > >> scalable than round robin in terms of memory.
> > > >>
> > > >> In the paper we show a number of mining algorithms that are easy to
> > > >> implement with partial key grouping, and whose performance can
> benefit
> > > >> from
> > > >> it. We think that it might also be useful for a larger class of
> > > >> algorithms.
> > > >>
> > > >> PKG has already been integrated in Storm [2], and I would like to be
> > > able
> > > >> to use it in Samza as well. As far as I understand, Kafka producers
> > are
> > > >> the
> > > >> ones that decide how to partition the stream (or Kafka topic). Even
> > > after
> > > >> doing a bit of reading, I am still not sure if I should be writing
> > this
> > > >> email here or on the Samza dev list. Anyway, my first guess is
> Kafka.
> > > >>
> > > >> I do not have experience with Kafka, however partial key grouping is
> > > very
> > > >> easy to implement: it requires just a few lines of code in Java when
> > > >> implemented as a custom grouping in Storm [3].
> > > >> I believe it should be very easy to integrate.
> > > >>
> > > >> For all these reasons, I believe it will be a nice addition to
> > > >> Kafka/Samza.
> > > >> If the community thinks it's a good idea, I will be happy to offer
> > > support
> > > >> in the porting.
> > > >>
> > > >> References:
> > > >> [1]
> > > >>
> > > >>
> > >
> >
> https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancing-for-distributed-stream-processing-engines.pdf
> > > >> [2] https://issues.apache.org/jira

Re: [DISCUSS] New partitioning for better load balancing

2015-04-07 Thread Gianmarco De Francisci Morales
Hi Guozhang,

Thanks for your comments.

1) Yes, ordering cannot be guaranteed in PKG. In general, algorithms that
use PGK should compute commutative and associative functions of the input.
If you need strict ordering (i.e., the function is not commutative) within
a partition, use KG.

2) I am not sure I understand the issue. PKG does not deal with inter-topic
load balancing. Topic A and topic B are completely independent in our
framework.

Cheers,

--
Gianmarco

On 7 April 2015 at 02:56, Guozhang Wang  wrote:

> Gianmarco,
>
> I browse through your paper (congrats for the ICDE publication BTW!), and
> here are some questions / comments on the algorithm:
>
> 1. One motivation of enabling key-based partitioned in Kafka is to achieve
> per-key ordering, i.e. with all messages with the same key sent to the same
> partition their ordering is preserved. However with "key-splitting" that
> seems to break this guarantee and now messages with the same key may be
> sent to 2 (or generally speaking many) partitions.
>
> 2. As for the local load estimation, there is a second mapping from
> partitions (workers in your paper) to broker hosts beside the mapping from
> keys to partitions, and not all broker hosts maintain each of the
> partitions. For example, there are 4 brokers, and broker-1/2 each takes one
> of the two partitions of topic A, while broker-3/4 each takes one of the
> two partitions of topic B, etc.
>
> I am wondering if those two issues can be resolved with the PKG framework?
>
> Guozhang
>
> On Sun, Apr 5, 2015 at 12:19 AM, Gianmarco De Francisci Morales <
> g...@apache.org> wrote:
>
> > Hi Jay,
> >
> > Thanks, that sounds a necessary step. I guess I expected something like
> > that to be already there, at least internally.
> > I created KAFKA-2092 to track the PKG integration.
> >
> > Cheers,
> >
> > --
> > Gianmarco
> >
> > On 4 April 2015 at 23:50, Jay Kreps  wrote:
> >
> > > Hey guys,
> > >
> > > I think the first step here would be to expose a partitioner interface
> > for
> > > the new producer that would make it easy to plug in these different
> > > strategies. I filed a JIRA for this:
> > > https://issues.apache.org/jira/browse/KAFKA-2091
> > >
> > > -Jay
> > >
> > > On Fri, Apr 3, 2015 at 9:36 AM, Harsha  wrote:
> > >
> > >> Gianmarco,
> > >>  I am coming from storm community. I think PKG is a
> very
> > >> interesting and we can provide an implementation of Partitioner for
> PKG.
> > >> Can you open a JIRA for this.
> > >>
> > >> --
> > >> Harsha
> > >> Sent with Airmail
> > >>
> > >> On April 3, 2015 at 4:49:15 AM, Gianmarco De Francisci Morales (
> > >> g...@apache.org) wrote:
> > >>
> > >> Hi,
> > >>
> > >> We have recently studied the problem of load balancing in distributed
> > >> stream processing systems such as Samza [1].
> > >> In particular, we focused on what happens when the key distribution of
> > the
> > >> stream is skewed when using key grouping.
> > >> We developed a new stream partitioning scheme (which we call Partial
> Key
> > >> Grouping). It achieves better load balancing than hashing while being
> > more
> > >> scalable than round robin in terms of memory.
> > >>
> > >> In the paper we show a number of mining algorithms that are easy to
> > >> implement with partial key grouping, and whose performance can benefit
> > >> from
> > >> it. We think that it might also be useful for a larger class of
> > >> algorithms.
> > >>
> > >> PKG has already been integrated in Storm [2], and I would like to be
> > able
> > >> to use it in Samza as well. As far as I understand, Kafka producers
> are
> > >> the
> > >> ones that decide how to partition the stream (or Kafka topic). Even
> > after
> > >> doing a bit of reading, I am still not sure if I should be writing
> this
> > >> email here or on the Samza dev list. Anyway, my first guess is Kafka.
> > >>
> > >> I do not have experience with Kafka, however partial key grouping is
> > very
> > >> easy to implement: it requires just a few lines of code in Java when
> > >> implemented as a custom grouping in Storm [3].
> > >> I believe it should be very easy to integrate.
> > >>
> > >> For all these reasons, I believe it will be a nice addition to
> > >> Kafka/Samza.
> > >> If the community thinks it's a good idea, I will be happy to offer
> > support
> > >> in the porting.
> > >>
> > >> References:
> > >> [1]
> > >>
> > >>
> >
> https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancing-for-distributed-stream-processing-engines.pdf
> > >> [2] https://issues.apache.org/jira/browse/STORM-632
> > >> [3] https://github.com/gdfm/partial-key-grouping
> > >> --
> > >> Gianmarco
> > >>
> > >
> > >
> >
>
>
>
> --
> -- Guozhang
>


Re: [DISCUSS] New partitioning for better load balancing

2015-04-06 Thread Guozhang Wang
Gianmarco,

I browse through your paper (congrats for the ICDE publication BTW!), and
here are some questions / comments on the algorithm:

1. One motivation of enabling key-based partitioned in Kafka is to achieve
per-key ordering, i.e. with all messages with the same key sent to the same
partition their ordering is preserved. However with "key-splitting" that
seems to break this guarantee and now messages with the same key may be
sent to 2 (or generally speaking many) partitions.

2. As for the local load estimation, there is a second mapping from
partitions (workers in your paper) to broker hosts beside the mapping from
keys to partitions, and not all broker hosts maintain each of the
partitions. For example, there are 4 brokers, and broker-1/2 each takes one
of the two partitions of topic A, while broker-3/4 each takes one of the
two partitions of topic B, etc.

I am wondering if those two issues can be resolved with the PKG framework?

Guozhang

On Sun, Apr 5, 2015 at 12:19 AM, Gianmarco De Francisci Morales <
g...@apache.org> wrote:

> Hi Jay,
>
> Thanks, that sounds a necessary step. I guess I expected something like
> that to be already there, at least internally.
> I created KAFKA-2092 to track the PKG integration.
>
> Cheers,
>
> --
> Gianmarco
>
> On 4 April 2015 at 23:50, Jay Kreps  wrote:
>
> > Hey guys,
> >
> > I think the first step here would be to expose a partitioner interface
> for
> > the new producer that would make it easy to plug in these different
> > strategies. I filed a JIRA for this:
> > https://issues.apache.org/jira/browse/KAFKA-2091
> >
> > -Jay
> >
> > On Fri, Apr 3, 2015 at 9:36 AM, Harsha  wrote:
> >
> >> Gianmarco,
> >>  I am coming from storm community. I think PKG is a very
> >> interesting and we can provide an implementation of Partitioner for PKG.
> >> Can you open a JIRA for this.
> >>
> >> --
> >> Harsha
> >> Sent with Airmail
> >>
> >> On April 3, 2015 at 4:49:15 AM, Gianmarco De Francisci Morales (
> >> g...@apache.org) wrote:
> >>
> >> Hi,
> >>
> >> We have recently studied the problem of load balancing in distributed
> >> stream processing systems such as Samza [1].
> >> In particular, we focused on what happens when the key distribution of
> the
> >> stream is skewed when using key grouping.
> >> We developed a new stream partitioning scheme (which we call Partial Key
> >> Grouping). It achieves better load balancing than hashing while being
> more
> >> scalable than round robin in terms of memory.
> >>
> >> In the paper we show a number of mining algorithms that are easy to
> >> implement with partial key grouping, and whose performance can benefit
> >> from
> >> it. We think that it might also be useful for a larger class of
> >> algorithms.
> >>
> >> PKG has already been integrated in Storm [2], and I would like to be
> able
> >> to use it in Samza as well. As far as I understand, Kafka producers are
> >> the
> >> ones that decide how to partition the stream (or Kafka topic). Even
> after
> >> doing a bit of reading, I am still not sure if I should be writing this
> >> email here or on the Samza dev list. Anyway, my first guess is Kafka.
> >>
> >> I do not have experience with Kafka, however partial key grouping is
> very
> >> easy to implement: it requires just a few lines of code in Java when
> >> implemented as a custom grouping in Storm [3].
> >> I believe it should be very easy to integrate.
> >>
> >> For all these reasons, I believe it will be a nice addition to
> >> Kafka/Samza.
> >> If the community thinks it's a good idea, I will be happy to offer
> support
> >> in the porting.
> >>
> >> References:
> >> [1]
> >>
> >>
> https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancing-for-distributed-stream-processing-engines.pdf
> >> [2] https://issues.apache.org/jira/browse/STORM-632
> >> [3] https://github.com/gdfm/partial-key-grouping
> >> --
> >> Gianmarco
> >>
> >
> >
>



-- 
-- Guozhang


Re: [DISCUSS] New partitioning for better load balancing

2015-04-05 Thread Gianmarco De Francisci Morales
Hi Jay,

Thanks, that sounds a necessary step. I guess I expected something like
that to be already there, at least internally.
I created KAFKA-2092 to track the PKG integration.

Cheers,

--
Gianmarco

On 4 April 2015 at 23:50, Jay Kreps  wrote:

> Hey guys,
>
> I think the first step here would be to expose a partitioner interface for
> the new producer that would make it easy to plug in these different
> strategies. I filed a JIRA for this:
> https://issues.apache.org/jira/browse/KAFKA-2091
>
> -Jay
>
> On Fri, Apr 3, 2015 at 9:36 AM, Harsha  wrote:
>
>> Gianmarco,
>>  I am coming from storm community. I think PKG is a very
>> interesting and we can provide an implementation of Partitioner for PKG.
>> Can you open a JIRA for this.
>>
>> --
>> Harsha
>> Sent with Airmail
>>
>> On April 3, 2015 at 4:49:15 AM, Gianmarco De Francisci Morales (
>> g...@apache.org) wrote:
>>
>> Hi,
>>
>> We have recently studied the problem of load balancing in distributed
>> stream processing systems such as Samza [1].
>> In particular, we focused on what happens when the key distribution of the
>> stream is skewed when using key grouping.
>> We developed a new stream partitioning scheme (which we call Partial Key
>> Grouping). It achieves better load balancing than hashing while being more
>> scalable than round robin in terms of memory.
>>
>> In the paper we show a number of mining algorithms that are easy to
>> implement with partial key grouping, and whose performance can benefit
>> from
>> it. We think that it might also be useful for a larger class of
>> algorithms.
>>
>> PKG has already been integrated in Storm [2], and I would like to be able
>> to use it in Samza as well. As far as I understand, Kafka producers are
>> the
>> ones that decide how to partition the stream (or Kafka topic). Even after
>> doing a bit of reading, I am still not sure if I should be writing this
>> email here or on the Samza dev list. Anyway, my first guess is Kafka.
>>
>> I do not have experience with Kafka, however partial key grouping is very
>> easy to implement: it requires just a few lines of code in Java when
>> implemented as a custom grouping in Storm [3].
>> I believe it should be very easy to integrate.
>>
>> For all these reasons, I believe it will be a nice addition to
>> Kafka/Samza.
>> If the community thinks it's a good idea, I will be happy to offer support
>> in the porting.
>>
>> References:
>> [1]
>>
>> https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancing-for-distributed-stream-processing-engines.pdf
>> [2] https://issues.apache.org/jira/browse/STORM-632
>> [3] https://github.com/gdfm/partial-key-grouping
>> --
>> Gianmarco
>>
>
>


Re: [DISCUSS] New partitioning for better load balancing

2015-04-04 Thread Jay Kreps
Hey guys,

I think the first step here would be to expose a partitioner interface for
the new producer that would make it easy to plug in these different
strategies. I filed a JIRA for this:
https://issues.apache.org/jira/browse/KAFKA-2091

-Jay

On Fri, Apr 3, 2015 at 9:36 AM, Harsha  wrote:

> Gianmarco,
>  I am coming from storm community. I think PKG is a very
> interesting and we can provide an implementation of Partitioner for PKG.
> Can you open a JIRA for this.
>
> --
> Harsha
> Sent with Airmail
>
> On April 3, 2015 at 4:49:15 AM, Gianmarco De Francisci Morales (
> g...@apache.org) wrote:
>
> Hi,
>
> We have recently studied the problem of load balancing in distributed
> stream processing systems such as Samza [1].
> In particular, we focused on what happens when the key distribution of the
> stream is skewed when using key grouping.
> We developed a new stream partitioning scheme (which we call Partial Key
> Grouping). It achieves better load balancing than hashing while being more
> scalable than round robin in terms of memory.
>
> In the paper we show a number of mining algorithms that are easy to
> implement with partial key grouping, and whose performance can benefit from
> it. We think that it might also be useful for a larger class of algorithms.
>
> PKG has already been integrated in Storm [2], and I would like to be able
> to use it in Samza as well. As far as I understand, Kafka producers are the
> ones that decide how to partition the stream (or Kafka topic). Even after
> doing a bit of reading, I am still not sure if I should be writing this
> email here or on the Samza dev list. Anyway, my first guess is Kafka.
>
> I do not have experience with Kafka, however partial key grouping is very
> easy to implement: it requires just a few lines of code in Java when
> implemented as a custom grouping in Storm [3].
> I believe it should be very easy to integrate.
>
> For all these reasons, I believe it will be a nice addition to Kafka/Samza.
> If the community thinks it's a good idea, I will be happy to offer support
> in the porting.
>
> References:
> [1]
>
> https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancing-for-distributed-stream-processing-engines.pdf
> [2] https://issues.apache.org/jira/browse/STORM-632
> [3] https://github.com/gdfm/partial-key-grouping
> --
> Gianmarco
>


Re: [DISCUSS] New partitioning for better load balancing

2015-04-03 Thread Harsha
Gianmarco,
                 I am coming from storm community. I think PKG is a very 
interesting and we can provide an implementation of Partitioner for PKG. Can 
you open a JIRA for this.

-- 
Harsha
Sent with Airmail

On April 3, 2015 at 4:49:15 AM, Gianmarco De Francisci Morales 
(g...@apache.org) wrote:

Hi,  

We have recently studied the problem of load balancing in distributed  
stream processing systems such as Samza [1].  
In particular, we focused on what happens when the key distribution of the  
stream is skewed when using key grouping.  
We developed a new stream partitioning scheme (which we call Partial Key  
Grouping). It achieves better load balancing than hashing while being more  
scalable than round robin in terms of memory.  

In the paper we show a number of mining algorithms that are easy to  
implement with partial key grouping, and whose performance can benefit from  
it. We think that it might also be useful for a larger class of algorithms.  

PKG has already been integrated in Storm [2], and I would like to be able  
to use it in Samza as well. As far as I understand, Kafka producers are the  
ones that decide how to partition the stream (or Kafka topic). Even after  
doing a bit of reading, I am still not sure if I should be writing this  
email here or on the Samza dev list. Anyway, my first guess is Kafka.  

I do not have experience with Kafka, however partial key grouping is very  
easy to implement: it requires just a few lines of code in Java when  
implemented as a custom grouping in Storm [3].  
I believe it should be very easy to integrate.  

For all these reasons, I believe it will be a nice addition to Kafka/Samza.  
If the community thinks it's a good idea, I will be happy to offer support  
in the porting.  

References:  
[1]  
https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancing-for-distributed-stream-processing-engines.pdf
  
[2] https://issues.apache.org/jira/browse/STORM-632  
[3] https://github.com/gdfm/partial-key-grouping  
--  
Gianmarco  


[DISCUSS] New partitioning for better load balancing

2015-04-03 Thread Gianmarco De Francisci Morales
Hi,

We have recently studied the problem of load balancing in distributed
stream processing systems such as Samza [1].
In particular, we focused on what happens when the key distribution of the
stream is skewed when using key grouping.
We developed a new stream partitioning scheme (which we call Partial Key
Grouping). It achieves better load balancing than hashing while being more
scalable than round robin in terms of memory.

In the paper we show a number of mining algorithms that are easy to
implement with partial key grouping, and whose performance can benefit from
it. We think that it might also be useful for a larger class of algorithms.

PKG has already been integrated in Storm [2], and I would like to be able
to use it in Samza as well. As far as I understand, Kafka producers are the
ones that decide how to partition the stream (or Kafka topic). Even after
doing a bit of reading, I am still not sure if I should be writing this
email here or on the Samza dev list. Anyway, my first guess is Kafka.

I do not have experience with Kafka, however partial key grouping is very
easy to implement: it requires just a few lines of code in Java when
implemented as a custom grouping in Storm [3].
I believe it should be very easy to integrate.

For all these reasons, I believe it will be a nice addition to Kafka/Samza.
If the community thinks it's a good idea, I will be happy to offer support
in the porting.

References:
[1]
https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancing-for-distributed-stream-processing-engines.pdf
[2] https://issues.apache.org/jira/browse/STORM-632
[3] https://github.com/gdfm/partial-key-grouping
--
Gianmarco