Re: Review Request 34450: Fix KAFKA-2017; rebased

Guozhang Wang Wed, 20 May 2015 10:44:59 -0700


> On May 20, 2015, 5:15 p.m., Onur Karaman wrote:
> > I only did a brief skim. This optimization tries to switch consumers over 
> > to a new coordinator without a rebalance. From my understanding, the 
> > consumers would detect a coordinator failure, discover the new coordinator 
> > to work with, and try heartbeating that new coordinator withouth a 
> > rebalance.
> > 
> > So it seems to me that putting the logic in handleJoinGroup isn't right, as 
> > the rebalance is what we're trying to avoid. The code should be in 
> > handleHeartbeat. It should lookup zk for the group info, add it to 
> > CoordinatorMetadata, and start up a DelayedHeartbeat for every consumer of 
> > that group.
> > 
> > **More importantly: given that this is just an optimization, and we haven't 
> > even seen the performance hit without this, I think KAFKA-2017 should be 
> > very low priority.**
> > 
> > The following are higher priority:
> > 1. Getting the consumer to properly handle error codes of the join group 
> > and heartbeat responses.
> > 2. Getting the consumer to detect coordinator failures and switch over to 
> > another coordinator (my KAFKA-1334 patch just had the coordinator detect 
> > consumer failures). A nice benefit of completing this first is that if we 
> > decide that the rebalances on coordinator failover are an actual issue, 
> > this would greatly facilitate testing any coordinator failover logic. Right 
> > now, it's unclear how this rb's logic can be tested.
> 
> Onur Karaman wrote:
>     I added a ticket for 2: 
> [KAFKA-2208](https://issues.apache.org/jira/browse/KAFKA-2208)


Thanks for the prompt response Onur!

1. I think I agree with about the priority, and also I agree that the getting 
group logic should not be in join but rather heartbeat.

2. About consumer side failure detection, actually the consumer would not use 
heartbeat expiration to detect coordinator failure, but would only mark the 
current coordinator as dead upon disconnection / error-code reception.


- Guozhang


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34450/#review84539
-----------------------------------------------------------


On May 20, 2015, 4:13 p.m., Guozhang Wang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/34450/
> -----------------------------------------------------------
> 
> (Updated May 20, 2015, 4:13 p.m.)
> 
> 
> Review request for kafka.
> 
> 
> Bugs: KAFKA-2017
>     https://issues.apache.org/jira/browse/KAFKA-2017
> 
> 
> Repository: kafka
> 
> 
> Description
> -------
> 
> 1. Upon receiving join-group, if the group metadata cannot be found in the 
> local cache try to read it from ZK; 2. Upon completing rebalance, update the 
> ZK with new group registry or delete the registry if the group becomes empty
> 
> 
> Diffs
> -----
> 
>   core/src/main/scala/kafka/coordinator/ConsumerCoordinator.scala 
> af06ad45cdc46ac3bc27898ebc1a5bd5b1c7b19e 
>   core/src/main/scala/kafka/coordinator/ConsumerGroupMetadata.scala 
> 47bdfa7cc86fd4e841e2b1d6bfd40f1508e643bd 
>   core/src/main/scala/kafka/coordinator/CoordinatorMetadata.scala 
> c39e6de34ee531c6dfa9107b830752bd7f8fbe59 
>   core/src/main/scala/kafka/utils/ZkUtils.scala 
> 2618dd39b925b979ad6e4c0abd5c6eaafb3db5d5 
> 
> Diff: https://reviews.apache.org/r/34450/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Guozhang Wang
> 
>

Re: Review Request 34450: Fix KAFKA-2017; rebased

Reply via email to