Re: quick question about new consumer api
We plan to have a working prototype ready end of September. Guozhang On Mon, Jul 7, 2014 at 11:05 AM, Jason Rosenberg wrote: > Great, that's reassuring! > > What's the time frame for having a more or less stable version to try out? > > Jason > > > On Mon, Jul 7, 2014 at 12:59 PM, Guozhang Wang wrote: > > > I see your point now. The old consumer does have a hard-coded > > "round-robin-per-topic" logic which have this issue. In the new consumer, > > we will make the assignment logic customizable so that people can specify > > different rebalance algorithms they like. > > > > Also I will soon send out a new consumer design summary email for more > > comments. Feel free to give us more thoughts you have about the new > > consumer design. > > > > Guozhang > > > > > > On Mon, Jul 7, 2014 at 8:44 AM, Jason Rosenberg > wrote: > > > > > Guozhang, > > > > > > I'm not suggesting we parallelize within a partition > > > > > > The problem with the current high-level consumer is, if you use a regex > > to > > > select multiple topics, and then have multiple consumers in the same > > group, > > > usually the first consumer will 'own' all the topics, and no amount of > > > sub-sequent rebalancing will allow other consumers in the group to own > > some > > > of the topics. Re-balancing does allow other consumers to own multiple > > > partitions, but if a topic has only 1 partition, only the first > consumer > > to > > > initialize will get all the work. > > > > > > So, I'm wondering if the new api will be better about re-balancing the > > work > > > at the partition level, and not the topic level, as such. > > > > > > Jason > > > > > > > > > On Mon, Jul 7, 2014 at 11:17 AM, Guozhang Wang > > wrote: > > > > > > > Hi Jason, > > > > > > > > In the new design the consumption is still at the per-partition > > > > granularity. The main rationale of doing this is ordering: Within a > > > > partition we want to preserve the ordering such that message B > produced > > > > after message A will also be consumed and processed after message A. > > And > > > > producers can use keys to make sure messages with the same ordering > > group > > > > will be in the same partition. To do this we have to make one > partition > > > > only being consumed by a single client at a time. On the other hand, > > when > > > > one wants to add the number of consumers beyond the number of > > partitions, > > > > he can always use the topic tool to dynamically add more partitions > to > > > the > > > > topic. > > > > > > > > Do you have a specific scenario in mind that would require > > > single-partition > > > > topics? > > > > > > > > Guozhang > > > > > > > > > > > > > > > > On Mon, Jul 7, 2014 at 7:43 AM, Jason Rosenberg > > > wrote: > > > > > > > > > I've been looking at the new consumer api outlined here: > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/Kafka+0.9+Consumer+Rewrite+Design > > > > > > > > > > One issue in the current high-level consumer, is that it does not > do > > a > > > > good > > > > > job of distributing a set of topics between multiple consumers, > > unless > > > > each > > > > > topic has multiple partitions. This has always seemed strange to > me, > > > > since > > > > > at the end of the day, even for single partition topics, the basic > > unit > > > > of > > > > > consumption is still at the partition level (so you'd expect > > > rebalancing > > > > to > > > > > try to evenly distribute partitions (regardless of the topic)). > > > > > > > > > > It's not clearly spelled out in the new consumer api wiki, so I'll > > just > > > > > ask, will this issue be addressed in the new api? I think I've > asked > > > > this > > > > > before, but I wanted to go check again, and am not seeing this > > > explicitly > > > > > addressed in the design. > > > > > > > > > > Thanks > > > > > > > > > > Jason > > > > > > > > > > > > > > > > > > > > > -- > > > > -- Guozhang > > > > > > > > > > > > > > > -- > > -- Guozhang > > > -- -- Guozhang
Re: quick question about new consumer api
Great, that's reassuring! What's the time frame for having a more or less stable version to try out? Jason On Mon, Jul 7, 2014 at 12:59 PM, Guozhang Wang wrote: > I see your point now. The old consumer does have a hard-coded > "round-robin-per-topic" logic which have this issue. In the new consumer, > we will make the assignment logic customizable so that people can specify > different rebalance algorithms they like. > > Also I will soon send out a new consumer design summary email for more > comments. Feel free to give us more thoughts you have about the new > consumer design. > > Guozhang > > > On Mon, Jul 7, 2014 at 8:44 AM, Jason Rosenberg wrote: > > > Guozhang, > > > > I'm not suggesting we parallelize within a partition > > > > The problem with the current high-level consumer is, if you use a regex > to > > select multiple topics, and then have multiple consumers in the same > group, > > usually the first consumer will 'own' all the topics, and no amount of > > sub-sequent rebalancing will allow other consumers in the group to own > some > > of the topics. Re-balancing does allow other consumers to own multiple > > partitions, but if a topic has only 1 partition, only the first consumer > to > > initialize will get all the work. > > > > So, I'm wondering if the new api will be better about re-balancing the > work > > at the partition level, and not the topic level, as such. > > > > Jason > > > > > > On Mon, Jul 7, 2014 at 11:17 AM, Guozhang Wang > wrote: > > > > > Hi Jason, > > > > > > In the new design the consumption is still at the per-partition > > > granularity. The main rationale of doing this is ordering: Within a > > > partition we want to preserve the ordering such that message B produced > > > after message A will also be consumed and processed after message A. > And > > > producers can use keys to make sure messages with the same ordering > group > > > will be in the same partition. To do this we have to make one partition > > > only being consumed by a single client at a time. On the other hand, > when > > > one wants to add the number of consumers beyond the number of > partitions, > > > he can always use the topic tool to dynamically add more partitions to > > the > > > topic. > > > > > > Do you have a specific scenario in mind that would require > > single-partition > > > topics? > > > > > > Guozhang > > > > > > > > > > > > On Mon, Jul 7, 2014 at 7:43 AM, Jason Rosenberg > > wrote: > > > > > > > I've been looking at the new consumer api outlined here: > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/Kafka+0.9+Consumer+Rewrite+Design > > > > > > > > One issue in the current high-level consumer, is that it does not do > a > > > good > > > > job of distributing a set of topics between multiple consumers, > unless > > > each > > > > topic has multiple partitions. This has always seemed strange to me, > > > since > > > > at the end of the day, even for single partition topics, the basic > unit > > > of > > > > consumption is still at the partition level (so you'd expect > > rebalancing > > > to > > > > try to evenly distribute partitions (regardless of the topic)). > > > > > > > > It's not clearly spelled out in the new consumer api wiki, so I'll > just > > > > ask, will this issue be addressed in the new api? I think I've asked > > > this > > > > before, but I wanted to go check again, and am not seeing this > > explicitly > > > > addressed in the design. > > > > > > > > Thanks > > > > > > > > Jason > > > > > > > > > > > > > > > > -- > > > -- Guozhang > > > > > > > > > -- > -- Guozhang >
Re: quick question about new consumer api
I see your point now. The old consumer does have a hard-coded "round-robin-per-topic" logic which have this issue. In the new consumer, we will make the assignment logic customizable so that people can specify different rebalance algorithms they like. Also I will soon send out a new consumer design summary email for more comments. Feel free to give us more thoughts you have about the new consumer design. Guozhang On Mon, Jul 7, 2014 at 8:44 AM, Jason Rosenberg wrote: > Guozhang, > > I'm not suggesting we parallelize within a partition > > The problem with the current high-level consumer is, if you use a regex to > select multiple topics, and then have multiple consumers in the same group, > usually the first consumer will 'own' all the topics, and no amount of > sub-sequent rebalancing will allow other consumers in the group to own some > of the topics. Re-balancing does allow other consumers to own multiple > partitions, but if a topic has only 1 partition, only the first consumer to > initialize will get all the work. > > So, I'm wondering if the new api will be better about re-balancing the work > at the partition level, and not the topic level, as such. > > Jason > > > On Mon, Jul 7, 2014 at 11:17 AM, Guozhang Wang wrote: > > > Hi Jason, > > > > In the new design the consumption is still at the per-partition > > granularity. The main rationale of doing this is ordering: Within a > > partition we want to preserve the ordering such that message B produced > > after message A will also be consumed and processed after message A. And > > producers can use keys to make sure messages with the same ordering group > > will be in the same partition. To do this we have to make one partition > > only being consumed by a single client at a time. On the other hand, when > > one wants to add the number of consumers beyond the number of partitions, > > he can always use the topic tool to dynamically add more partitions to > the > > topic. > > > > Do you have a specific scenario in mind that would require > single-partition > > topics? > > > > Guozhang > > > > > > > > On Mon, Jul 7, 2014 at 7:43 AM, Jason Rosenberg > wrote: > > > > > I've been looking at the new consumer api outlined here: > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/Kafka+0.9+Consumer+Rewrite+Design > > > > > > One issue in the current high-level consumer, is that it does not do a > > good > > > job of distributing a set of topics between multiple consumers, unless > > each > > > topic has multiple partitions. This has always seemed strange to me, > > since > > > at the end of the day, even for single partition topics, the basic unit > > of > > > consumption is still at the partition level (so you'd expect > rebalancing > > to > > > try to evenly distribute partitions (regardless of the topic)). > > > > > > It's not clearly spelled out in the new consumer api wiki, so I'll just > > > ask, will this issue be addressed in the new api? I think I've asked > > this > > > before, but I wanted to go check again, and am not seeing this > explicitly > > > addressed in the design. > > > > > > Thanks > > > > > > Jason > > > > > > > > > > > -- > > -- Guozhang > > > -- -- Guozhang
Re: quick question about new consumer api
Guozhang, I'm not suggesting we parallelize within a partition The problem with the current high-level consumer is, if you use a regex to select multiple topics, and then have multiple consumers in the same group, usually the first consumer will 'own' all the topics, and no amount of sub-sequent rebalancing will allow other consumers in the group to own some of the topics. Re-balancing does allow other consumers to own multiple partitions, but if a topic has only 1 partition, only the first consumer to initialize will get all the work. So, I'm wondering if the new api will be better about re-balancing the work at the partition level, and not the topic level, as such. Jason On Mon, Jul 7, 2014 at 11:17 AM, Guozhang Wang wrote: > Hi Jason, > > In the new design the consumption is still at the per-partition > granularity. The main rationale of doing this is ordering: Within a > partition we want to preserve the ordering such that message B produced > after message A will also be consumed and processed after message A. And > producers can use keys to make sure messages with the same ordering group > will be in the same partition. To do this we have to make one partition > only being consumed by a single client at a time. On the other hand, when > one wants to add the number of consumers beyond the number of partitions, > he can always use the topic tool to dynamically add more partitions to the > topic. > > Do you have a specific scenario in mind that would require single-partition > topics? > > Guozhang > > > > On Mon, Jul 7, 2014 at 7:43 AM, Jason Rosenberg wrote: > > > I've been looking at the new consumer api outlined here: > > > > > https://cwiki.apache.org/confluence/display/KAFKA/Kafka+0.9+Consumer+Rewrite+Design > > > > One issue in the current high-level consumer, is that it does not do a > good > > job of distributing a set of topics between multiple consumers, unless > each > > topic has multiple partitions. This has always seemed strange to me, > since > > at the end of the day, even for single partition topics, the basic unit > of > > consumption is still at the partition level (so you'd expect rebalancing > to > > try to evenly distribute partitions (regardless of the topic)). > > > > It's not clearly spelled out in the new consumer api wiki, so I'll just > > ask, will this issue be addressed in the new api? I think I've asked > this > > before, but I wanted to go check again, and am not seeing this explicitly > > addressed in the design. > > > > Thanks > > > > Jason > > > > > > -- > -- Guozhang >
Re: quick question about new consumer api
Hi Jason, In the new design the consumption is still at the per-partition granularity. The main rationale of doing this is ordering: Within a partition we want to preserve the ordering such that message B produced after message A will also be consumed and processed after message A. And producers can use keys to make sure messages with the same ordering group will be in the same partition. To do this we have to make one partition only being consumed by a single client at a time. On the other hand, when one wants to add the number of consumers beyond the number of partitions, he can always use the topic tool to dynamically add more partitions to the topic. Do you have a specific scenario in mind that would require single-partition topics? Guozhang On Mon, Jul 7, 2014 at 7:43 AM, Jason Rosenberg wrote: > I've been looking at the new consumer api outlined here: > > https://cwiki.apache.org/confluence/display/KAFKA/Kafka+0.9+Consumer+Rewrite+Design > > One issue in the current high-level consumer, is that it does not do a good > job of distributing a set of topics between multiple consumers, unless each > topic has multiple partitions. This has always seemed strange to me, since > at the end of the day, even for single partition topics, the basic unit of > consumption is still at the partition level (so you'd expect rebalancing to > try to evenly distribute partitions (regardless of the topic)). > > It's not clearly spelled out in the new consumer api wiki, so I'll just > ask, will this issue be addressed in the new api? I think I've asked this > before, but I wanted to go check again, and am not seeing this explicitly > addressed in the design. > > Thanks > > Jason > -- -- Guozhang
quick question about new consumer api
I've been looking at the new consumer api outlined here: https://cwiki.apache.org/confluence/display/KAFKA/Kafka+0.9+Consumer+Rewrite+Design One issue in the current high-level consumer, is that it does not do a good job of distributing a set of topics between multiple consumers, unless each topic has multiple partitions. This has always seemed strange to me, since at the end of the day, even for single partition topics, the basic unit of consumption is still at the partition level (so you'd expect rebalancing to try to evenly distribute partitions (regardless of the topic)). It's not clearly spelled out in the new consumer api wiki, so I'll just ask, will this issue be addressed in the new api? I think I've asked this before, but I wanted to go check again, and am not seeing this explicitly addressed in the design. Thanks Jason