[
https://issues.apache.org/jira/browse/KAFKA-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14980895#comment-14980895
]
Jiangjie Qin commented on KAFKA-2017:
-------------------------------------
[~guozhang] I think the approach works. It might be a little bit tight in
schedule, though.
We may also need to enforce session timeout to be greater than request timeout.
If the coordinator has hard failure. The consumers will still send
HeartBeatRequest to the failed coordinator but won't receive any
HeartbeatResponse. It continues until request timeout. So if session timeout is
smaller than request timeout (which is the current setting now), consumers
might be kicked out of the group and still have issue with committing offsets.
Just want to make sure we considered all the alternatives, in terms of (1), my
original understanding is that it actually sort of persists group information
in consumers themselves. The idea is that when coordinator fails over, the
consumers will eventually talk to the new coordinator through some kind of
requests, so the new coordinator just need to silently collect the information
from consumers. If the coordinator receive Hearbeat or Offsetcommit from an
unknown group id or unknown consumer, it infers the group is in stable state.
We simply accept them if the group is unknown and record the information of the
consumer id, group id and generation id. For subsequent requests from
consumers, as long as the generation Id matches, coordinator just add them to
the group. (That will make the consumer id essentially less useful but this is
the problem we already have today, i.e. user will either always receive
UnknownConsumerIdException or IllegalGenerationIdException.) We might need
think a bit more about what if new coordinator receives JoingGroupRequest or
SyncGroupRequest as the first request of an unknown group or consumer. I am not
sure if this would work or not, but might be an option.
The caveat is that if the coordinator and the consumer failed at the same time,
no rebalance will be triggered by the new coordinator because the new
coordinator depends on the consumers periodical requests to recover group
information. Also describe group won't work because the assignment information
is not available unless we let the consumers to send metadata again.
> Persist Coordinator State for Coordinator Failover
> --------------------------------------------------
>
> Key: KAFKA-2017
> URL: https://issues.apache.org/jira/browse/KAFKA-2017
> Project: Kafka
> Issue Type: Sub-task
> Components: consumer
> Affects Versions: 0.9.0.0
> Reporter: Onur Karaman
> Assignee: Guozhang Wang
> Priority: Blocker
> Fix For: 0.9.0.0
>
> Attachments: KAFKA-2017.patch, KAFKA-2017_2015-05-20_09:13:39.patch,
> KAFKA-2017_2015-05-21_19:02:47.patch
>
>
> When a coordinator fails, the group membership protocol tries to failover to
> a new coordinator without forcing all the consumers rejoin their groups. This
> is possible if the coordinator persists its state so that the state can be
> transferred during coordinator failover. This state consists of most of the
> information in GroupRegistry and ConsumerRegistry.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)