[ https://issues.apache.org/jira/browse/KAFKA-8438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856999#comment-16856999 ]
Richard Yu commented on KAFKA-8438: ----------------------------------- Sure [~bchen225242] will look into it and see what else I could think of. > Add API to allow user to define end behavior of consumer failure > ---------------------------------------------------------------- > > Key: KAFKA-8438 > URL: https://issues.apache.org/jira/browse/KAFKA-8438 > Project: Kafka > Issue Type: New Feature > Components: consumer > Reporter: Richard Yu > Priority: Major > Labels: needs-dicussion, needs-kip > > Recently, in a concerted effort to make Kafka's rebalances less painful, > various approaches has been used to reduce the number of and impact of > rebalances. Often, the trigger of a rebalance is a failure of some sort or a > thrown exception during processing, in which case, the workload will be > redistributed among surviving threads. Working to reduce rebalances due to > random consumer crashes, a recent change to Kafka internals had been made > (which introduces the concept of static membership) that prevents a rebalance > from occurring within {{session.timeout.ms}} in the hope that the consumer > thread which crashed would recover in that time interval and rejoin the group. > However, in some cases, some consumer threads would permanently go down or > remain dead for long periods of time. In these scenarios, users of Kafka > would possibly not be aware of such a crash until hours later after it > happened which forces Kafka users to manually start a new KafkaConsumer > process a considerable period of time after the failure had occurred. That is > where the addition of a callback such as {{onConsumerFailure}} would help. > There are multiple use cases for this callback (which is defined by the > user). {{onConsumerFailure}} is called when a particular consumer thread goes > under for some specified time interval (i.e. a config called > {{acceptable.consumer.failure.timeout.ms}}). When called, this method could > be used to log a consumer failure or should the user wish it, create a new > thread which would then rejoin the consumer group (which could also include > the required {{group.instance.id}} so that a rebalance wouldn't be > re-triggered –- we would need to think about that). > Should the old thread recover and attempt to rejoin the consumer group (with > the substitute thread being part of the group), the old thread will be denied > access and an exception would be thrown (to indicate that another process has > already taken its place). > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)