[jira] [Commented] (KAFKA-9102) Increase default zk session timeout and max lag
[ https://issues.apache.org/jira/browse/KAFKA-9102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16960264#comment-16960264 ] ASF GitHub Bot commented on KAFKA-9102: --- hachikuji commented on pull request #7596: KAFKA-9102; Increase default zk session timeout and replica max lag [KIP-537] URL: https://github.com/apache/kafka/pull/7596 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Increase default zk session timeout and max lag > --- > > Key: KAFKA-9102 > URL: https://issues.apache.org/jira/browse/KAFKA-9102 > Project: Kafka > Issue Type: Improvement >Reporter: Jason Gustafson >Assignee: Jason Gustafson >Priority: Major > Fix For: 2.5.0 > > > This tracks the implementation of KIP-537: > https://cwiki.apache.org/confluence/display/KAFKA/KIP-537%3A+Increase+default+zookeeper+session+timeout -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-9102) Increase default zk session timeout and max lag
[ https://issues.apache.org/jira/browse/KAFKA-9102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Gustafson resolved KAFKA-9102. Resolution: Fixed > Increase default zk session timeout and max lag > --- > > Key: KAFKA-9102 > URL: https://issues.apache.org/jira/browse/KAFKA-9102 > Project: Kafka > Issue Type: Improvement >Reporter: Jason Gustafson >Assignee: Jason Gustafson >Priority: Major > Fix For: 2.5.0 > > > This tracks the implementation of KIP-537: > https://cwiki.apache.org/confluence/display/KAFKA/KIP-537%3A+Increase+default+zookeeper+session+timeout -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-9106) metrics exposed via JMX shoud be configurable
[ https://issues.apache.org/jira/browse/KAFKA-9106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xavier Léauté updated KAFKA-9106: - Description: Kafka exposes a very large number of metrics, all of which are always visible in JMX by default. On large clusters with many partitions, this may result in tens of thousands of mbeans to be registered, which can lead to timeouts with some popular monitoring agents that rely on listing JMX metrics via RMI. Making the set of JMX-visible metrics configurable would allow operators to decide on the set of critical metrics to collect and workaround limitation of JMX in those cases. corresponding KIP https://cwiki.apache.org/confluence/display/KAFKA/KIP-544%3A+Make+metrics+exposed+via+JMX+configurable was: Kafka exposes a very large number of metrics, all of which are always visible in JMX by default. On large clusters with many partitions, this may result in tens of thousands of mbeans to be registered, which can lead to timeouts with some popular monitoring agents that rely on listing JMX metrics via RMI. Making the set of JMX-visible metrics configurable would allow operators to decide on the set of critical metrics to collect and workaround limitation of JMX in those cases. > metrics exposed via JMX shoud be configurable > - > > Key: KAFKA-9106 > URL: https://issues.apache.org/jira/browse/KAFKA-9106 > Project: Kafka > Issue Type: Improvement > Components: metrics >Reporter: Xavier Léauté >Assignee: Xavier Léauté >Priority: Major > > Kafka exposes a very large number of metrics, all of which are always visible > in JMX by default. On large clusters with many partitions, this may result in > tens of thousands of mbeans to be registered, which can lead to timeouts with > some popular monitoring agents that rely on listing JMX metrics via RMI. > Making the set of JMX-visible metrics configurable would allow operators to > decide on the set of critical metrics to collect and workaround limitation of > JMX in those cases. > corresponding KIP > https://cwiki.apache.org/confluence/display/KAFKA/KIP-544%3A+Make+metrics+exposed+via+JMX+configurable -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-9106) metrics exposed via JMX shoud be configurable
Xavier Léauté created KAFKA-9106: Summary: metrics exposed via JMX shoud be configurable Key: KAFKA-9106 URL: https://issues.apache.org/jira/browse/KAFKA-9106 Project: Kafka Issue Type: Improvement Components: metrics Reporter: Xavier Léauté Assignee: Xavier Léauté Kafka exposes a very large number of metrics, all of which are always visible in JMX by default. On large clusters with many partitions, this may result in tens of thousands of mbeans to be registered, which can lead to timeouts with some popular monitoring agents that rely on listing JMX metrics via RMI. Making the set of JMX-visible metrics configurable would allow operators to decide on the set of critical metrics to collect and workaround limitation of JMX in those cases. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9105) Truncate producer state when incrementing log start offset
[ https://issues.apache.org/jira/browse/KAFKA-9105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16960145#comment-16960145 ] ASF GitHub Bot commented on KAFKA-9105: --- hachikuji commented on pull request #7599: KAFKA-9105: Add back truncateHead method to ProducerStateManager URL: https://github.com/apache/kafka/pull/7599 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Truncate producer state when incrementing log start offset > -- > > Key: KAFKA-9105 > URL: https://issues.apache.org/jira/browse/KAFKA-9105 > Project: Kafka > Issue Type: Bug >Affects Versions: 2.4.0 >Reporter: Bob Barrett >Assignee: Bob Barrett >Priority: Blocker > > As part of the fix for KAFKA-7190, we removed the > ProducerStateManager.truncateHead method as part of the change to retain > producer state for longer. This removed some needed producer state management > (such as removing unreplicated transactions) when incrementing the log start > offset. We need to add this functionality back in. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-9105) Truncate producer state when incrementing log start offset
[ https://issues.apache.org/jira/browse/KAFKA-9105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Gustafson resolved KAFKA-9105. Resolution: Fixed > Truncate producer state when incrementing log start offset > -- > > Key: KAFKA-9105 > URL: https://issues.apache.org/jira/browse/KAFKA-9105 > Project: Kafka > Issue Type: Bug >Affects Versions: 2.4.0 >Reporter: Bob Barrett >Assignee: Bob Barrett >Priority: Blocker > > As part of the fix for KAFKA-7190, we removed the > ProducerStateManager.truncateHead method as part of the change to retain > producer state for longer. This removed some needed producer state management > (such as removing unreplicated transactions) when incrementing the log start > offset. We need to add this functionality back in. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-8968) Refactor Task-level Metrics
[ https://issues.apache.org/jira/browse/KAFKA-8968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Bejeck updated KAFKA-8968: --- Affects Version/s: 2.5.0 > Refactor Task-level Metrics > --- > > Key: KAFKA-8968 > URL: https://issues.apache.org/jira/browse/KAFKA-8968 > Project: Kafka > Issue Type: Improvement > Components: streams >Affects Versions: 2.5.0 >Reporter: Bruno Cadonna >Assignee: Bruno Cadonna >Priority: Major > > Refactor task-level metrics as proposed in KIP-444. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-8968) Refactor Task-level Metrics
[ https://issues.apache.org/jira/browse/KAFKA-8968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Bejeck resolved KAFKA-8968. Resolution: Fixed > Refactor Task-level Metrics > --- > > Key: KAFKA-8968 > URL: https://issues.apache.org/jira/browse/KAFKA-8968 > Project: Kafka > Issue Type: Improvement > Components: streams >Affects Versions: 2.5.0 >Reporter: Bruno Cadonna >Assignee: Bruno Cadonna >Priority: Major > Fix For: 2.5.0 > > > Refactor task-level metrics as proposed in KIP-444. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-8968) Refactor Task-level Metrics
[ https://issues.apache.org/jira/browse/KAFKA-8968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Bejeck updated KAFKA-8968: --- Fix Version/s: 2.5.0 > Refactor Task-level Metrics > --- > > Key: KAFKA-8968 > URL: https://issues.apache.org/jira/browse/KAFKA-8968 > Project: Kafka > Issue Type: Improvement > Components: streams >Affects Versions: 2.5.0 >Reporter: Bruno Cadonna >Assignee: Bruno Cadonna >Priority: Major > Fix For: 2.5.0 > > > Refactor task-level metrics as proposed in KIP-444. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-8968) Refactor Task-level Metrics
[ https://issues.apache.org/jira/browse/KAFKA-8968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16960078#comment-16960078 ] ASF GitHub Bot commented on KAFKA-8968: --- bbejeck commented on pull request #7566: KAFKA-8968: Refactor task-level metrics URL: https://github.com/apache/kafka/pull/7566 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Refactor Task-level Metrics > --- > > Key: KAFKA-8968 > URL: https://issues.apache.org/jira/browse/KAFKA-8968 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Bruno Cadonna >Assignee: Bruno Cadonna >Priority: Major > > Refactor task-level metrics as proposed in KIP-444. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (KAFKA-6453) Reconsider timestamp propagation semantics
[ https://issues.apache.org/jira/browse/KAFKA-6453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16959956#comment-16959956 ] Victoria Bialas edited comment on KAFKA-6453 at 10/25/19 5:58 PM: -- Helpful links: * Related KIP-258: [https://cwiki.apache.org/confluence/display/KAFKA/KIP-258%3A+Allow+to+Store+Record+Timestamps+in+RocksDB] * KIP-251 (upon which this new work is based): [https://cwiki.apache.org/confluence/display/KAFKA/KIP-251%3A+Allow+timestamp+manipulation+in+Processor+API] * Docs in Confluent pertinent to WindowStore: [Windowing|#windowing] was (Author: orangesnap): Helpful links: * Related KIP-258: [https://cwiki.apache.org/confluence/display/KAFKA/KIP-258%3A+Allow+to+Store+Record+Timestamps+in+RocksDB] * Earlier KIP (upon which this new work is based) KIP-251: [https://cwiki.apache.org/confluence/display/KAFKA/KIP-251%3A+Allow+timestamp+manipulation+in+Processor+API] * Docs in Confluent pertinent to WindowStore: [Windowing|#windowing] > Reconsider timestamp propagation semantics > -- > > Key: KAFKA-6453 > URL: https://issues.apache.org/jira/browse/KAFKA-6453 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Matthias J. Sax >Assignee: Victoria Bialas >Priority: Major > Labels: needs-kip > > Atm, Kafka Streams only has a defined "contract" about timestamp propagation > at the Processor API level: all processor within a sub-topology, see the > timestamp from the input topic record and this timestamp will be used for all > result record when writing them to an topic, too. > The DSL, inherits this "contract" atm. > From a DSL point of view, it would be desirable to provide a different > contract to the user. To allow this, we need to do the following: > - extend Processor API to allow manipulation timestamps (ie, a Processor can > set a new timestamp for downstream records) > - define a DSL "contract" for timestamp propagation for each DSL operator > - document the DSL "contract" > - implement the DSL "contract" using the new/extended Processor API -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (KAFKA-6453) Reconsider timestamp propagation semantics
[ https://issues.apache.org/jira/browse/KAFKA-6453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16959956#comment-16959956 ] Victoria Bialas edited comment on KAFKA-6453 at 10/25/19 5:57 PM: -- Helpful links: * Related KIP: [KIP-258|[https://cwiki.apache.org/confluence/display/KAFKA/KIP-258%3A+Allow+to+Store+Record+Timestamps+in+RocksDB]] * Earlier KIP (upon which this new work is based): [KIP-251]([https://cwiki.apache.org/confluence/display/KAFKA/KIP-251%3A+Allow+timestamp+manipulation+in+Processor+API]) * Docs in Confluent pertinent to WindowStore: [Windowing|#windowing]] was (Author: orangesnap): Helpful links: * Related KIP: [KIP-258|[https://cwiki.apache.org/confluence/display/KAFKA/KIP-258%3A+Allow+to+Store+Record+Timestamps+in+RocksDB]] * Earlier KIP (upon which this new work is based): [KIP-251]([https://cwiki.apache.org/confluence/display/KAFKA/KIP-251%3A+Allow+timestamp+manipulation+in+Processor+API]) * Docs in Confluent pertinent to WindowStore: [Windowing|[https://docs.confluent.io/current/streams/concepts.html#windowing]] > Reconsider timestamp propagation semantics > -- > > Key: KAFKA-6453 > URL: https://issues.apache.org/jira/browse/KAFKA-6453 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Matthias J. Sax >Assignee: Victoria Bialas >Priority: Major > Labels: needs-kip > > Atm, Kafka Streams only has a defined "contract" about timestamp propagation > at the Processor API level: all processor within a sub-topology, see the > timestamp from the input topic record and this timestamp will be used for all > result record when writing them to an topic, too. > The DSL, inherits this "contract" atm. > From a DSL point of view, it would be desirable to provide a different > contract to the user. To allow this, we need to do the following: > - extend Processor API to allow manipulation timestamps (ie, a Processor can > set a new timestamp for downstream records) > - define a DSL "contract" for timestamp propagation for each DSL operator > - document the DSL "contract" > - implement the DSL "contract" using the new/extended Processor API -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (KAFKA-6453) Reconsider timestamp propagation semantics
[ https://issues.apache.org/jira/browse/KAFKA-6453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16959956#comment-16959956 ] Victoria Bialas edited comment on KAFKA-6453 at 10/25/19 5:57 PM: -- Helpful links: * Related KIP-258: [https://cwiki.apache.org/confluence/display/KAFKA/KIP-258%3A+Allow+to+Store+Record+Timestamps+in+RocksDB] * Earlier KIP (upon which this new work is based) KIP-251: [https://cwiki.apache.org/confluence/display/KAFKA/KIP-251%3A+Allow+timestamp+manipulation+in+Processor+API] * Docs in Confluent pertinent to WindowStore: [Windowing|#windowing] was (Author: orangesnap): Helpful links: * Related KIP: [KIP-258|[https://cwiki.apache.org/confluence/display/KAFKA/KIP-258%3A+Allow+to+Store+Record+Timestamps+in+RocksDB]] * Earlier KIP (upon which this new work is based): [KIP-251]([https://cwiki.apache.org/confluence/display/KAFKA/KIP-251%3A+Allow+timestamp+manipulation+in+Processor+API]) * Docs in Confluent pertinent to WindowStore: [Windowing|#windowing]] > Reconsider timestamp propagation semantics > -- > > Key: KAFKA-6453 > URL: https://issues.apache.org/jira/browse/KAFKA-6453 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Matthias J. Sax >Assignee: Victoria Bialas >Priority: Major > Labels: needs-kip > > Atm, Kafka Streams only has a defined "contract" about timestamp propagation > at the Processor API level: all processor within a sub-topology, see the > timestamp from the input topic record and this timestamp will be used for all > result record when writing them to an topic, too. > The DSL, inherits this "contract" atm. > From a DSL point of view, it would be desirable to provide a different > contract to the user. To allow this, we need to do the following: > - extend Processor API to allow manipulation timestamps (ie, a Processor can > set a new timestamp for downstream records) > - define a DSL "contract" for timestamp propagation for each DSL operator > - document the DSL "contract" > - implement the DSL "contract" using the new/extended Processor API -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-6453) Reconsider timestamp propagation semantics
[ https://issues.apache.org/jira/browse/KAFKA-6453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16959956#comment-16959956 ] Victoria Bialas commented on KAFKA-6453: Helpful links: * Related KIP: [KIP-258|[https://cwiki.apache.org/confluence/display/KAFKA/KIP-258%3A+Allow+to+Store+Record+Timestamps+in+RocksDB]] * Earlier KIP (upon which this new work is based): [KIP-251]([https://cwiki.apache.org/confluence/display/KAFKA/KIP-251%3A+Allow+timestamp+manipulation+in+Processor+API]) * Docs in Confluent pertinent to WindowStore: [Windowing|[https://docs.confluent.io/current/streams/concepts.html#windowing]] > Reconsider timestamp propagation semantics > -- > > Key: KAFKA-6453 > URL: https://issues.apache.org/jira/browse/KAFKA-6453 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Matthias J. Sax >Assignee: Victoria Bialas >Priority: Major > Labels: needs-kip > > Atm, Kafka Streams only has a defined "contract" about timestamp propagation > at the Processor API level: all processor within a sub-topology, see the > timestamp from the input topic record and this timestamp will be used for all > result record when writing them to an topic, too. > The DSL, inherits this "contract" atm. > From a DSL point of view, it would be desirable to provide a different > contract to the user. To allow this, we need to do the following: > - extend Processor API to allow manipulation timestamps (ie, a Processor can > set a new timestamp for downstream records) > - define a DSL "contract" for timestamp propagation for each DSL operator > - document the DSL "contract" > - implement the DSL "contract" using the new/extended Processor API -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-8950) KafkaConsumer stops fetching
[ https://issues.apache.org/jira/browse/KAFKA-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16959954#comment-16959954 ] Andrew Olson commented on KAFKA-8950: - Ok, sounds good. Thanks. > KafkaConsumer stops fetching > > > Key: KAFKA-8950 > URL: https://issues.apache.org/jira/browse/KAFKA-8950 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 2.3.0 > Environment: linux >Reporter: Will James >Priority: Major > Fix For: 2.4.0, 2.3.1 > > > We have a KafkaConsumer consuming from a single partition with > enable.auto.commit set to true. > Very occasionally, the consumer goes into a broken state. It returns no > records from the broker with every poll, and from most of the Kafka metrics > in the consumer it looks like it is fully caught up to the end of the log. > We see that we are long polling for the max poll timeout, and that there is > zero lag. In addition, we see that the heartbeat rate stays unchanged from > before the issue begins (so the consumer stays a part of the consumer group). > In addition, from looking at the __consumer_offsets topic, it is possible to > see that the consumer is committing the same offset on the auto commit > interval, however, the offset does not move, and the lag from the broker's > perspective continues to increase. > The issue is only resolved by restarting our application (which restarts the > KafkaConsumer instance). > From a heap dump of an application in this state, I can see that the Fetcher > is in a state where it believes there are nodesWithPendingFetchRequests. > However, I can see the state of the fetch latency sensor, specifically, the > fetch rate, and see that the samples were not updated for a long period of > time (actually, precisely the amount of time that the problem in our > application was occurring, around 50 hours - we have alerting on other > metrics but not the fetch rate, so we didn't notice the problem until a > customer complained). > In this example, the consumer was processing around 40 messages per second, > with an average size of about 10kb, although most of the other examples of > this have happened with higher volume (250 messages / second, around 23kb per > message on average). > I have spent some time investigating the issue on our end, and will continue > to do so as time allows, however I wanted to raise this as an issue because > it may be affecting other people. > Please let me know if you have any questions or need additional information. > I doubt I can provide heap dumps unfortunately, but I can provide further > information as needed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-9105) Truncate producer state when incrementing log start offset
[ https://issues.apache.org/jira/browse/KAFKA-9105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bob Barrett updated KAFKA-9105: --- Description: As part of the fix for KAFKA-7190, we removed the ProducerStateManager.truncateHead method as part of the change to retain producer state for longer. This removed some needed producer state management (such as removing unreplicated transactions) when incrementing the log start offset. We need to add this functionality back in. (was: In github.com/apache/kafka/commit/c49775b, we removed the ProducerStateManager.truncateHead method as part of the change to retain producer state for longer. This removed some needed producer state management (such as removing unreplicated transactions) when incrementing the log start offset. We need to add this functionality back in.) > Truncate producer state when incrementing log start offset > -- > > Key: KAFKA-9105 > URL: https://issues.apache.org/jira/browse/KAFKA-9105 > Project: Kafka > Issue Type: Bug >Affects Versions: 2.4.0 >Reporter: Bob Barrett >Assignee: Bob Barrett >Priority: Blocker > > As part of the fix for KAFKA-7190, we removed the > ProducerStateManager.truncateHead method as part of the change to retain > producer state for longer. This removed some needed producer state management > (such as removing unreplicated transactions) when incrementing the log start > offset. We need to add this functionality back in. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-9105) Truncate producer state when incrementing log start offset
Bob Barrett created KAFKA-9105: -- Summary: Truncate producer state when incrementing log start offset Key: KAFKA-9105 URL: https://issues.apache.org/jira/browse/KAFKA-9105 Project: Kafka Issue Type: Bug Affects Versions: 2.4.0 Reporter: Bob Barrett Assignee: Bob Barrett In github.com/apache/kafka/commit/c49775b, we removed the ProducerStateManager.truncateHead method as part of the change to retain producer state for longer. This removed some needed producer state management (such as removing unreplicated transactions) when incrementing the log start offset. We need to add this functionality back in. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-8950) KafkaConsumer stops fetching
[ https://issues.apache.org/jira/browse/KAFKA-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16959918#comment-16959918 ] Tom Lee commented on KAFKA-8950: Yeah sounds very similar. If 2.3.1 doesn't help when it ships, I'd open a new ticket & attach a heap dump and a stack trace next time it gets stuck. > KafkaConsumer stops fetching > > > Key: KAFKA-8950 > URL: https://issues.apache.org/jira/browse/KAFKA-8950 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 2.3.0 > Environment: linux >Reporter: Will James >Priority: Major > Fix For: 2.4.0, 2.3.1 > > > We have a KafkaConsumer consuming from a single partition with > enable.auto.commit set to true. > Very occasionally, the consumer goes into a broken state. It returns no > records from the broker with every poll, and from most of the Kafka metrics > in the consumer it looks like it is fully caught up to the end of the log. > We see that we are long polling for the max poll timeout, and that there is > zero lag. In addition, we see that the heartbeat rate stays unchanged from > before the issue begins (so the consumer stays a part of the consumer group). > In addition, from looking at the __consumer_offsets topic, it is possible to > see that the consumer is committing the same offset on the auto commit > interval, however, the offset does not move, and the lag from the broker's > perspective continues to increase. > The issue is only resolved by restarting our application (which restarts the > KafkaConsumer instance). > From a heap dump of an application in this state, I can see that the Fetcher > is in a state where it believes there are nodesWithPendingFetchRequests. > However, I can see the state of the fetch latency sensor, specifically, the > fetch rate, and see that the samples were not updated for a long period of > time (actually, precisely the amount of time that the problem in our > application was occurring, around 50 hours - we have alerting on other > metrics but not the fetch rate, so we didn't notice the problem until a > customer complained). > In this example, the consumer was processing around 40 messages per second, > with an average size of about 10kb, although most of the other examples of > this have happened with higher volume (250 messages / second, around 23kb per > message on average). > I have spent some time investigating the issue on our end, and will continue > to do so as time allows, however I wanted to raise this as an issue because > it may be affecting other people. > Please let me know if you have any questions or need additional information. > I doubt I can provide heap dumps unfortunately, but I can provide further > information as needed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-8950) KafkaConsumer stops fetching
[ https://issues.apache.org/jira/browse/KAFKA-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16959913#comment-16959913 ] tony mancill commented on KAFKA-8950: - [~noslowerdna] That sounds exactly like the issue we encountered (I work with [~thomaslee]). We rebuilt the 2.3.0 tag + Will's patch and are looking forward to 2.3.1. > KafkaConsumer stops fetching > > > Key: KAFKA-8950 > URL: https://issues.apache.org/jira/browse/KAFKA-8950 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 2.3.0 > Environment: linux >Reporter: Will James >Priority: Major > Fix For: 2.4.0, 2.3.1 > > > We have a KafkaConsumer consuming from a single partition with > enable.auto.commit set to true. > Very occasionally, the consumer goes into a broken state. It returns no > records from the broker with every poll, and from most of the Kafka metrics > in the consumer it looks like it is fully caught up to the end of the log. > We see that we are long polling for the max poll timeout, and that there is > zero lag. In addition, we see that the heartbeat rate stays unchanged from > before the issue begins (so the consumer stays a part of the consumer group). > In addition, from looking at the __consumer_offsets topic, it is possible to > see that the consumer is committing the same offset on the auto commit > interval, however, the offset does not move, and the lag from the broker's > perspective continues to increase. > The issue is only resolved by restarting our application (which restarts the > KafkaConsumer instance). > From a heap dump of an application in this state, I can see that the Fetcher > is in a state where it believes there are nodesWithPendingFetchRequests. > However, I can see the state of the fetch latency sensor, specifically, the > fetch rate, and see that the samples were not updated for a long period of > time (actually, precisely the amount of time that the problem in our > application was occurring, around 50 hours - we have alerting on other > metrics but not the fetch rate, so we didn't notice the problem until a > customer complained). > In this example, the consumer was processing around 40 messages per second, > with an average size of about 10kb, although most of the other examples of > this have happened with higher volume (250 messages / second, around 23kb per > message on average). > I have spent some time investigating the issue on our end, and will continue > to do so as time allows, however I wanted to raise this as an issue because > it may be affecting other people. > Please let me know if you have any questions or need additional information. > I doubt I can provide heap dumps unfortunately, but I can provide further > information as needed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9098) Name Repartition Filter, Source, and Sink Processors
[ https://issues.apache.org/jira/browse/KAFKA-9098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16959900#comment-16959900 ] ASF GitHub Bot commented on KAFKA-9098: --- bbejeck commented on pull request #7598: KAFKA-9098: When users name repartition topic, use the name for the repartition filter, source and sink node. URL: https://github.com/apache/kafka/pull/7598 When users specify a name for a repartition topic, we should use the same name for the repartition filter, source, and sink nodes. With the addition of KIP-307 if users go to the effort of naming every node in the topology having processor nodes with generated names is inconsistent behavior. *Summary of testing strategy (including rationale) for the feature or bug fix. Unit and/or integration tests are expected for any behaviour change and system tests should be considered for larger changes.* ### Committer Checklist (excluded from commit message) - [ ] Verify design and implementation - [ ] Verify test coverage and CI build status - [ ] Verify documentation (including upgrade notes) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Name Repartition Filter, Source, and Sink Processors > > > Key: KAFKA-9098 > URL: https://issues.apache.org/jira/browse/KAFKA-9098 > Project: Kafka > Issue Type: Improvement > Components: streams >Affects Versions: 2.2.0, 2.3.0 >Reporter: Bill Bejeck >Assignee: Bill Bejeck >Priority: Major > > When users provide a name for repartition topics, we should the same name as > the base for the filter, source and sink operators. While this does not > break a topology, users providing names for all processors in a DSL topology > may find the generated names for the repartition topics filter, source, and > sink operators as inconsistent with the naming approach. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-8950) KafkaConsumer stops fetching
[ https://issues.apache.org/jira/browse/KAFKA-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16959891#comment-16959891 ] Andrew Olson commented on KAFKA-8950: - Thank you for the detailed and helpful response, [~thomaslee]. The symptom we see is that a random partition occasionally (maybe once every month or two) gets "stuck" and accumulates lag indefinitely while all other partitions continue to be consumed normally, with the only solution being to restart the consumer. Our consumer groups are relatively large (about 20 to 80 members) with each member reading from about 10 to 25 partitions usually, some high volume (up to 5k messages per second) and some low. We don't auto-commit offsets. > KafkaConsumer stops fetching > > > Key: KAFKA-8950 > URL: https://issues.apache.org/jira/browse/KAFKA-8950 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 2.3.0 > Environment: linux >Reporter: Will James >Priority: Major > Fix For: 2.4.0, 2.3.1 > > > We have a KafkaConsumer consuming from a single partition with > enable.auto.commit set to true. > Very occasionally, the consumer goes into a broken state. It returns no > records from the broker with every poll, and from most of the Kafka metrics > in the consumer it looks like it is fully caught up to the end of the log. > We see that we are long polling for the max poll timeout, and that there is > zero lag. In addition, we see that the heartbeat rate stays unchanged from > before the issue begins (so the consumer stays a part of the consumer group). > In addition, from looking at the __consumer_offsets topic, it is possible to > see that the consumer is committing the same offset on the auto commit > interval, however, the offset does not move, and the lag from the broker's > perspective continues to increase. > The issue is only resolved by restarting our application (which restarts the > KafkaConsumer instance). > From a heap dump of an application in this state, I can see that the Fetcher > is in a state where it believes there are nodesWithPendingFetchRequests. > However, I can see the state of the fetch latency sensor, specifically, the > fetch rate, and see that the samples were not updated for a long period of > time (actually, precisely the amount of time that the problem in our > application was occurring, around 50 hours - we have alerting on other > metrics but not the fetch rate, so we didn't notice the problem until a > customer complained). > In this example, the consumer was processing around 40 messages per second, > with an average size of about 10kb, although most of the other examples of > this have happened with higher volume (250 messages / second, around 23kb per > message on average). > I have spent some time investigating the issue on our end, and will continue > to do so as time allows, however I wanted to raise this as an issue because > it may be affecting other people. > Please let me know if you have any questions or need additional information. > I doubt I can provide heap dumps unfortunately, but I can provide further > information as needed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9069) Flaky Test AdminClientIntegrationTest.testCreatePartitions
[ https://issues.apache.org/jira/browse/KAFKA-9069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16959850#comment-16959850 ] Bruno Cadonna commented on KAFKA-9069: -- https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/26176/testReport/junit/kafka.api/SslAdminClientIntegrationTest/testCreatePartitions/ > Flaky Test AdminClientIntegrationTest.testCreatePartitions > -- > > Key: KAFKA-9069 > URL: https://issues.apache.org/jira/browse/KAFKA-9069 > Project: Kafka > Issue Type: Bug > Components: admin, core, unit tests >Reporter: Matthias J. Sax >Priority: Major > Labels: flaky-test > > [https://builds.apache.org/job/kafka-pr-jdk11-scala2.13/2792/testReport/junit/kafka.api/AdminClientIntegrationTest/testCreatePartitions/] > {quote}java.lang.AssertionError: validateOnly expected:<3> but was:<1> at > org.junit.Assert.fail(Assert.java:89) at > org.junit.Assert.failNotEquals(Assert.java:835) at > org.junit.Assert.assertEquals(Assert.java:647) at > kafka.api.AdminClientIntegrationTest.$anonfun$testCreatePartitions$5(AdminClientIntegrationTest.scala:651) > at > kafka.api.AdminClientIntegrationTest.$anonfun$testCreatePartitions$5$adapted(AdminClientIntegrationTest.scala:601) > at scala.collection.immutable.List.foreach(List.scala:305) at > kafka.api.AdminClientIntegrationTest.testCreatePartitions(AdminClientIntegrationTest.scala:601){quote} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-7447) Consumer offsets lost during leadership rebalance after bringing node back from clean shutdown
[ https://issues.apache.org/jira/browse/KAFKA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16959834#comment-16959834 ] Valentin Florea commented on KAFKA-7447: I also ran into something similar while restarting the brokers in the cluster one by one (in order to add a new user). Initially we thought it's this bug, but it turned out that this was due to our wrong configuration of our cluster. Our offsets.topic.replication.factor setting was set to 1 for some reason. We modified it to 3 and now we should be fine. This was quite helpful: http://javierholguera.com/2018/06/13/kafka-defaults-that-you-should-re-consider-i/ > Consumer offsets lost during leadership rebalance after bringing node back > from clean shutdown > -- > > Key: KAFKA-7447 > URL: https://issues.apache.org/jira/browse/KAFKA-7447 > Project: Kafka > Issue Type: Bug >Affects Versions: 1.1.1, 2.0.0 >Reporter: Ben Isaacs >Priority: Major > > *Summary:* > * When 1 of my 3 brokers is cleanly shut down, consumption and production > continues as normal due to replication. (Consumers are rebalanced to the > replicas, and producers are rebalanced to the remaining brokers). However, > when the cleanly-shut-down broker comes back, after about 10 minutes, a > flurry of production errors occur and my consumers suddenly go back in time 2 > weeks, causing a long outage (12 hours+) as all messages are replayed on some > topics. > * The hypothesis is that the auto-leadership-rebalance is happening too > quickly after the downed broker returns, before it has had a chance to become > fully synchronised on all partitions. In particular, it seems that having > consumer offets ahead of the most recent data on the topic that consumer was > following causes the consumer to be reset to 0. > *Expected:* > * bringing a node back from a clean shut down does not cause any consumers > to reset to 0. > *Actual:* > * I experience approximately 12 hours of partial outage triggered at the > point that auto leadership rebalance occurs, after a cleanly shut down node > returns. > *Workaround:* > * disable auto leadership rebalance entirely. > * manually rebalance it from time to time when all nodes and all partitions > are fully replicated. > *My Setup:* > * Kafka deployment with 3 brokers and 2 topics. > * Replication factor is 3, for all topics. > * min.isr is 2, for all topics. > * Zookeeper deployment with 3 instances. > * In the region of 10 to 15 consumers, with 2 user topics (and, of course, > the system topics such as consumer offsets). Consumer offsets has the > standard 50 partitions. The user topics have about 3000 partitions in total. > * Offset retention time of 7 days, and topic retention time of 14 days. > * Input rate ~1000 messages/sec. > * Deployment happens to be on Google compute engine. > *Related Stack Overflow Post:* > https://stackoverflow.com/questions/52367825/apache-kafka-loses-some-consumer-offsets-when-when-i-bounce-a-broker > It was suggested I open a ticket by "Muir" who says he they have also > experienced this. > *Transcription of logs, showing the problem:* > Below, you can see chronologically sorted, interleaved, logs from the 3 > brokers. prod-kafka-2 is the node which was cleanly shut down and then > restarted. I filtered the messages only to those regardling > __consumer_offsets-29 because it's just too much to paste, otherwise. > ||Broker host||Broker ID|| > |prod-kafka-1|0| > |prod-kafka-2|1 (this one was restarted)| > |prod-kafka-3|2| > prod-kafka-2: (just starting up) > {code} > [2018-09-17 09:21:46,246] WARN [ReplicaFetcher replicaId=1, leaderId=2, > fetcherId=0] Based on follower's leader epoch, leader replied with an unknown > offset in __consumer_offsets-29. The initial fetch offset 0 will be used for > truncation. (kafka.server.ReplicaFetcherThread) > {code} > prod-kafka-3: (sees replica1 come back) > {code} > [2018-09-17 09:22:02,027] INFO [Partition __consumer_offsets-29 broker=2] > Expanding ISR from 0,2 to 0,2,1 (kafka.cluster.Partition) > {code} > prod-kafka-2: > {code} > [2018-09-17 09:22:33,892] INFO [GroupMetadataManager brokerId=1] Scheduling > unloading of offsets and group metadata from __consumer_offsets-29 > (kafka.coordinator.group.GroupMetadataManager) > [2018-09-17 09:22:33,902] INFO [GroupMetadataManager brokerId=1] Finished > unloading __consumer_offsets-29. Removed 0 cached offsets and 0 cached > groups. (kafka.coordinator.group.GroupMetadataManager) > [2018-09-17 09:24:03,287] INFO [ReplicaFetcherManager on broker 1] Removed > fetcher for partitions __consumer_offsets-29 > (kafka.server.ReplicaFetcherManager) > [2018-09-17 09:24:03,287] INFO [Partition __consumer_offsets-29 broker=1]
[jira] [Issue Comment Deleted] (KAFKA-7447) Consumer offsets lost during leadership rebalance after bringing node back from clean shutdown
[ https://issues.apache.org/jira/browse/KAFKA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Valentin Florea updated KAFKA-7447: --- Comment: was deleted (was: We're running in production Kafka 2.0.0 and just ran this morning on the same issue. We had to restart the cluster in order to add a new user (not a great mechanism of adding users also). Really serious problem and costed us 4 hours of fixing to bring all systems up to date from a data-consistency standpoint. Any ETA on a fix for this?) > Consumer offsets lost during leadership rebalance after bringing node back > from clean shutdown > -- > > Key: KAFKA-7447 > URL: https://issues.apache.org/jira/browse/KAFKA-7447 > Project: Kafka > Issue Type: Bug >Affects Versions: 1.1.1, 2.0.0 >Reporter: Ben Isaacs >Priority: Major > > *Summary:* > * When 1 of my 3 brokers is cleanly shut down, consumption and production > continues as normal due to replication. (Consumers are rebalanced to the > replicas, and producers are rebalanced to the remaining brokers). However, > when the cleanly-shut-down broker comes back, after about 10 minutes, a > flurry of production errors occur and my consumers suddenly go back in time 2 > weeks, causing a long outage (12 hours+) as all messages are replayed on some > topics. > * The hypothesis is that the auto-leadership-rebalance is happening too > quickly after the downed broker returns, before it has had a chance to become > fully synchronised on all partitions. In particular, it seems that having > consumer offets ahead of the most recent data on the topic that consumer was > following causes the consumer to be reset to 0. > *Expected:* > * bringing a node back from a clean shut down does not cause any consumers > to reset to 0. > *Actual:* > * I experience approximately 12 hours of partial outage triggered at the > point that auto leadership rebalance occurs, after a cleanly shut down node > returns. > *Workaround:* > * disable auto leadership rebalance entirely. > * manually rebalance it from time to time when all nodes and all partitions > are fully replicated. > *My Setup:* > * Kafka deployment with 3 brokers and 2 topics. > * Replication factor is 3, for all topics. > * min.isr is 2, for all topics. > * Zookeeper deployment with 3 instances. > * In the region of 10 to 15 consumers, with 2 user topics (and, of course, > the system topics such as consumer offsets). Consumer offsets has the > standard 50 partitions. The user topics have about 3000 partitions in total. > * Offset retention time of 7 days, and topic retention time of 14 days. > * Input rate ~1000 messages/sec. > * Deployment happens to be on Google compute engine. > *Related Stack Overflow Post:* > https://stackoverflow.com/questions/52367825/apache-kafka-loses-some-consumer-offsets-when-when-i-bounce-a-broker > It was suggested I open a ticket by "Muir" who says he they have also > experienced this. > *Transcription of logs, showing the problem:* > Below, you can see chronologically sorted, interleaved, logs from the 3 > brokers. prod-kafka-2 is the node which was cleanly shut down and then > restarted. I filtered the messages only to those regardling > __consumer_offsets-29 because it's just too much to paste, otherwise. > ||Broker host||Broker ID|| > |prod-kafka-1|0| > |prod-kafka-2|1 (this one was restarted)| > |prod-kafka-3|2| > prod-kafka-2: (just starting up) > {code} > [2018-09-17 09:21:46,246] WARN [ReplicaFetcher replicaId=1, leaderId=2, > fetcherId=0] Based on follower's leader epoch, leader replied with an unknown > offset in __consumer_offsets-29. The initial fetch offset 0 will be used for > truncation. (kafka.server.ReplicaFetcherThread) > {code} > prod-kafka-3: (sees replica1 come back) > {code} > [2018-09-17 09:22:02,027] INFO [Partition __consumer_offsets-29 broker=2] > Expanding ISR from 0,2 to 0,2,1 (kafka.cluster.Partition) > {code} > prod-kafka-2: > {code} > [2018-09-17 09:22:33,892] INFO [GroupMetadataManager brokerId=1] Scheduling > unloading of offsets and group metadata from __consumer_offsets-29 > (kafka.coordinator.group.GroupMetadataManager) > [2018-09-17 09:22:33,902] INFO [GroupMetadataManager brokerId=1] Finished > unloading __consumer_offsets-29. Removed 0 cached offsets and 0 cached > groups. (kafka.coordinator.group.GroupMetadataManager) > [2018-09-17 09:24:03,287] INFO [ReplicaFetcherManager on broker 1] Removed > fetcher for partitions __consumer_offsets-29 > (kafka.server.ReplicaFetcherManager) > [2018-09-17 09:24:03,287] INFO [Partition __consumer_offsets-29 broker=1] > __consumer_offsets-29 starts at Leader Epoch 78 from offset 0. Previous > Leader Epoch was: 77 (kafka.cluster.Partition) > [2018-09-17 09:24:03,287] INFO [
[jira] [Commented] (KAFKA-7447) Consumer offsets lost during leadership rebalance after bringing node back from clean shutdown
[ https://issues.apache.org/jira/browse/KAFKA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16959825#comment-16959825 ] Ismael Juma commented on KAFKA-7447: [~timvanlaer] you can try the 2.2.2 RC or go straight to the recently announced 2.3.1. :) > Consumer offsets lost during leadership rebalance after bringing node back > from clean shutdown > -- > > Key: KAFKA-7447 > URL: https://issues.apache.org/jira/browse/KAFKA-7447 > Project: Kafka > Issue Type: Bug >Affects Versions: 1.1.1, 2.0.0 >Reporter: Ben Isaacs >Priority: Major > > *Summary:* > * When 1 of my 3 brokers is cleanly shut down, consumption and production > continues as normal due to replication. (Consumers are rebalanced to the > replicas, and producers are rebalanced to the remaining brokers). However, > when the cleanly-shut-down broker comes back, after about 10 minutes, a > flurry of production errors occur and my consumers suddenly go back in time 2 > weeks, causing a long outage (12 hours+) as all messages are replayed on some > topics. > * The hypothesis is that the auto-leadership-rebalance is happening too > quickly after the downed broker returns, before it has had a chance to become > fully synchronised on all partitions. In particular, it seems that having > consumer offets ahead of the most recent data on the topic that consumer was > following causes the consumer to be reset to 0. > *Expected:* > * bringing a node back from a clean shut down does not cause any consumers > to reset to 0. > *Actual:* > * I experience approximately 12 hours of partial outage triggered at the > point that auto leadership rebalance occurs, after a cleanly shut down node > returns. > *Workaround:* > * disable auto leadership rebalance entirely. > * manually rebalance it from time to time when all nodes and all partitions > are fully replicated. > *My Setup:* > * Kafka deployment with 3 brokers and 2 topics. > * Replication factor is 3, for all topics. > * min.isr is 2, for all topics. > * Zookeeper deployment with 3 instances. > * In the region of 10 to 15 consumers, with 2 user topics (and, of course, > the system topics such as consumer offsets). Consumer offsets has the > standard 50 partitions. The user topics have about 3000 partitions in total. > * Offset retention time of 7 days, and topic retention time of 14 days. > * Input rate ~1000 messages/sec. > * Deployment happens to be on Google compute engine. > *Related Stack Overflow Post:* > https://stackoverflow.com/questions/52367825/apache-kafka-loses-some-consumer-offsets-when-when-i-bounce-a-broker > It was suggested I open a ticket by "Muir" who says he they have also > experienced this. > *Transcription of logs, showing the problem:* > Below, you can see chronologically sorted, interleaved, logs from the 3 > brokers. prod-kafka-2 is the node which was cleanly shut down and then > restarted. I filtered the messages only to those regardling > __consumer_offsets-29 because it's just too much to paste, otherwise. > ||Broker host||Broker ID|| > |prod-kafka-1|0| > |prod-kafka-2|1 (this one was restarted)| > |prod-kafka-3|2| > prod-kafka-2: (just starting up) > {code} > [2018-09-17 09:21:46,246] WARN [ReplicaFetcher replicaId=1, leaderId=2, > fetcherId=0] Based on follower's leader epoch, leader replied with an unknown > offset in __consumer_offsets-29. The initial fetch offset 0 will be used for > truncation. (kafka.server.ReplicaFetcherThread) > {code} > prod-kafka-3: (sees replica1 come back) > {code} > [2018-09-17 09:22:02,027] INFO [Partition __consumer_offsets-29 broker=2] > Expanding ISR from 0,2 to 0,2,1 (kafka.cluster.Partition) > {code} > prod-kafka-2: > {code} > [2018-09-17 09:22:33,892] INFO [GroupMetadataManager brokerId=1] Scheduling > unloading of offsets and group metadata from __consumer_offsets-29 > (kafka.coordinator.group.GroupMetadataManager) > [2018-09-17 09:22:33,902] INFO [GroupMetadataManager brokerId=1] Finished > unloading __consumer_offsets-29. Removed 0 cached offsets and 0 cached > groups. (kafka.coordinator.group.GroupMetadataManager) > [2018-09-17 09:24:03,287] INFO [ReplicaFetcherManager on broker 1] Removed > fetcher for partitions __consumer_offsets-29 > (kafka.server.ReplicaFetcherManager) > [2018-09-17 09:24:03,287] INFO [Partition __consumer_offsets-29 broker=1] > __consumer_offsets-29 starts at Leader Epoch 78 from offset 0. Previous > Leader Epoch was: 77 (kafka.cluster.Partition) > [2018-09-17 09:24:03,287] INFO [GroupMetadataManager brokerId=1] Scheduling > loading of offsets and group metadata from __consumer_offsets-29 > (kafka.coordinator.group.GroupMetadataManager) > [2018-09-17 09:24:03,288] INFO [GroupMetadataManager brokerId=1] Finis
[jira] [Comment Edited] (KAFKA-7447) Consumer offsets lost during leadership rebalance after bringing node back from clean shutdown
[ https://issues.apache.org/jira/browse/KAFKA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16959803#comment-16959803 ] Tim Van Laer edited comment on KAFKA-7447 at 10/25/19 2:46 PM: --- During offset loading in the recovering broker, following exception was thrown, I don't know if that can cause the issue. {code:java} [2019-10-25 05:09:08,151] INFO [GroupMetadataManager brokerId=2] Scheduling loading of offsets and group metadata from __consumer_offsets-0 (kafka.coordinator.group.GroupMetadataManager) ... [2019-10-25 05:09:12,972] ERROR [GroupMetadataManager brokerId=2] Error loading offsets from __consumer_offsets-0 (kafka.coordinator.group.GroupMetadataManager) java.util.NoSuchElementException: key not found: redacted-kafkastreams-application-4cbd9546-7b7b-4436-942f-26abae8aeb19-StreamThread-1-consumer-d994fd76-b31e-44ad-a27a-6f43ff0e9e78 at scala.collection.MapLike.default(MapLike.scala:235) at scala.collection.MapLike.default$(MapLike.scala:234) at scala.collection.AbstractMap.default(Map.scala:63) at scala.collection.mutable.HashMap.apply(HashMap.scala:69) at kafka.coordinator.group.GroupMetadata.get(GroupMetadata.scala:203) at kafka.coordinator.group.GroupCoordinator.$anonfun$tryCompleteHeartbeat$1(GroupCoordinator.scala:927) at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23) at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:251) at kafka.coordinator.group.GroupMetadata.inLock(GroupMetadata.scala:198) at kafka.coordinator.group.GroupCoordinator.tryCompleteHeartbeat(GroupCoordinator.scala:920) at kafka.coordinator.group.DelayedHeartbeat.tryComplete(DelayedHeartbeat.scala:34) at kafka.server.DelayedOperation.maybeTryComplete(DelayedOperation.scala:121) at kafka.server.DelayedOperationPurgatory$Watchers.tryCompleteWatched(DelayedOperation.scala:388) at kafka.server.DelayedOperationPurgatory.checkAndComplete(DelayedOperation.scala:294) at kafka.coordinator.group.GroupCoordinator.completeAndScheduleNextExpiration(GroupCoordinator.scala:737) at kafka.coordinator.group.GroupCoordinator.completeAndScheduleNextHeartbeatExpiration(GroupCoordinator.scala:730) at kafka.coordinator.group.GroupCoordinator.$anonfun$onGroupLoaded$3(GroupCoordinator.scala:677) at kafka.coordinator.group.GroupCoordinator.$anonfun$onGroupLoaded$3$adapted(GroupCoordinator.scala:677) at scala.collection.immutable.List.foreach(List.scala:392) at kafka.coordinator.group.GroupCoordinator.$anonfun$onGroupLoaded$1(GroupCoordinator.scala:677) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:251) at kafka.coordinator.group.GroupMetadata.inLock(GroupMetadata.scala:198) at kafka.coordinator.group.GroupCoordinator.onGroupLoaded(GroupCoordinator.scala:670) at kafka.coordinator.group.GroupCoordinator.$anonfun$handleGroupImmigration$1(GroupCoordinator.scala:682) at kafka.coordinator.group.GroupCoordinator.$anonfun$handleGroupImmigration$1$adapted(GroupCoordinator.scala:682) at kafka.coordinator.group.GroupMetadataManager.$anonfun$doLoadGroupsAndOffsets$23(GroupMetadataManager.scala:646) at kafka.coordinator.group.GroupMetadataManager.$anonfun$doLoadGroupsAndOffsets$23$adapted(GroupMetadataManager.scala:641) at scala.collection.mutable.HashMap$$anon$2.$anonfun$foreach$3(HashMap.scala:158) at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:158) at kafka.coordinator.group.GroupMetadataManager.doLoadGroupsAndOffsets(GroupMetadataManager.scala:641) at kafka.coordinator.group.GroupMetadataManager.loadGroupsAndOffsets(GroupMetadataManager.scala:500) at kafka.coordinator.group.GroupMetadataManager.$anonfun$scheduleLoadGroupAndOffsets$2(GroupMetadataManager.scala:491) at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114) at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {code} was (Author: timvanlaer): During offset loading in the recovering broker, following exception was thrown, I don't know if that can cause the issue. {code:java} [2019-10-25 05:09:08,151]
[jira] [Commented] (KAFKA-7447) Consumer offsets lost during leadership rebalance after bringing node back from clean shutdown
[ https://issues.apache.org/jira/browse/KAFKA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16959824#comment-16959824 ] Tim Van Laer commented on KAFKA-7447: - Thanks [~ijuma] for looking into this, well appreciated! I hit KAFKA-8896 at the same moment :) Let's upgrade to 2.2.2 first! > Consumer offsets lost during leadership rebalance after bringing node back > from clean shutdown > -- > > Key: KAFKA-7447 > URL: https://issues.apache.org/jira/browse/KAFKA-7447 > Project: Kafka > Issue Type: Bug >Affects Versions: 1.1.1, 2.0.0 >Reporter: Ben Isaacs >Priority: Major > > *Summary:* > * When 1 of my 3 brokers is cleanly shut down, consumption and production > continues as normal due to replication. (Consumers are rebalanced to the > replicas, and producers are rebalanced to the remaining brokers). However, > when the cleanly-shut-down broker comes back, after about 10 minutes, a > flurry of production errors occur and my consumers suddenly go back in time 2 > weeks, causing a long outage (12 hours+) as all messages are replayed on some > topics. > * The hypothesis is that the auto-leadership-rebalance is happening too > quickly after the downed broker returns, before it has had a chance to become > fully synchronised on all partitions. In particular, it seems that having > consumer offets ahead of the most recent data on the topic that consumer was > following causes the consumer to be reset to 0. > *Expected:* > * bringing a node back from a clean shut down does not cause any consumers > to reset to 0. > *Actual:* > * I experience approximately 12 hours of partial outage triggered at the > point that auto leadership rebalance occurs, after a cleanly shut down node > returns. > *Workaround:* > * disable auto leadership rebalance entirely. > * manually rebalance it from time to time when all nodes and all partitions > are fully replicated. > *My Setup:* > * Kafka deployment with 3 brokers and 2 topics. > * Replication factor is 3, for all topics. > * min.isr is 2, for all topics. > * Zookeeper deployment with 3 instances. > * In the region of 10 to 15 consumers, with 2 user topics (and, of course, > the system topics such as consumer offsets). Consumer offsets has the > standard 50 partitions. The user topics have about 3000 partitions in total. > * Offset retention time of 7 days, and topic retention time of 14 days. > * Input rate ~1000 messages/sec. > * Deployment happens to be on Google compute engine. > *Related Stack Overflow Post:* > https://stackoverflow.com/questions/52367825/apache-kafka-loses-some-consumer-offsets-when-when-i-bounce-a-broker > It was suggested I open a ticket by "Muir" who says he they have also > experienced this. > *Transcription of logs, showing the problem:* > Below, you can see chronologically sorted, interleaved, logs from the 3 > brokers. prod-kafka-2 is the node which was cleanly shut down and then > restarted. I filtered the messages only to those regardling > __consumer_offsets-29 because it's just too much to paste, otherwise. > ||Broker host||Broker ID|| > |prod-kafka-1|0| > |prod-kafka-2|1 (this one was restarted)| > |prod-kafka-3|2| > prod-kafka-2: (just starting up) > {code} > [2018-09-17 09:21:46,246] WARN [ReplicaFetcher replicaId=1, leaderId=2, > fetcherId=0] Based on follower's leader epoch, leader replied with an unknown > offset in __consumer_offsets-29. The initial fetch offset 0 will be used for > truncation. (kafka.server.ReplicaFetcherThread) > {code} > prod-kafka-3: (sees replica1 come back) > {code} > [2018-09-17 09:22:02,027] INFO [Partition __consumer_offsets-29 broker=2] > Expanding ISR from 0,2 to 0,2,1 (kafka.cluster.Partition) > {code} > prod-kafka-2: > {code} > [2018-09-17 09:22:33,892] INFO [GroupMetadataManager brokerId=1] Scheduling > unloading of offsets and group metadata from __consumer_offsets-29 > (kafka.coordinator.group.GroupMetadataManager) > [2018-09-17 09:22:33,902] INFO [GroupMetadataManager brokerId=1] Finished > unloading __consumer_offsets-29. Removed 0 cached offsets and 0 cached > groups. (kafka.coordinator.group.GroupMetadataManager) > [2018-09-17 09:24:03,287] INFO [ReplicaFetcherManager on broker 1] Removed > fetcher for partitions __consumer_offsets-29 > (kafka.server.ReplicaFetcherManager) > [2018-09-17 09:24:03,287] INFO [Partition __consumer_offsets-29 broker=1] > __consumer_offsets-29 starts at Leader Epoch 78 from offset 0. Previous > Leader Epoch was: 77 (kafka.cluster.Partition) > [2018-09-17 09:24:03,287] INFO [GroupMetadataManager brokerId=1] Scheduling > loading of offsets and group metadata from __consumer_offsets-29 > (kafka.coordinator.group.GroupMetadataManager) > [2018-09-17 09:24:03,288] INFO
[jira] [Commented] (KAFKA-7447) Consumer offsets lost during leadership rebalance after bringing node back from clean shutdown
[ https://issues.apache.org/jira/browse/KAFKA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16959823#comment-16959823 ] Ismael Juma commented on KAFKA-7447: [~bchen225242], [~guozhang], [~hachikuji] Thoughts? > Consumer offsets lost during leadership rebalance after bringing node back > from clean shutdown > -- > > Key: KAFKA-7447 > URL: https://issues.apache.org/jira/browse/KAFKA-7447 > Project: Kafka > Issue Type: Bug >Affects Versions: 1.1.1, 2.0.0 >Reporter: Ben Isaacs >Priority: Major > > *Summary:* > * When 1 of my 3 brokers is cleanly shut down, consumption and production > continues as normal due to replication. (Consumers are rebalanced to the > replicas, and producers are rebalanced to the remaining brokers). However, > when the cleanly-shut-down broker comes back, after about 10 minutes, a > flurry of production errors occur and my consumers suddenly go back in time 2 > weeks, causing a long outage (12 hours+) as all messages are replayed on some > topics. > * The hypothesis is that the auto-leadership-rebalance is happening too > quickly after the downed broker returns, before it has had a chance to become > fully synchronised on all partitions. In particular, it seems that having > consumer offets ahead of the most recent data on the topic that consumer was > following causes the consumer to be reset to 0. > *Expected:* > * bringing a node back from a clean shut down does not cause any consumers > to reset to 0. > *Actual:* > * I experience approximately 12 hours of partial outage triggered at the > point that auto leadership rebalance occurs, after a cleanly shut down node > returns. > *Workaround:* > * disable auto leadership rebalance entirely. > * manually rebalance it from time to time when all nodes and all partitions > are fully replicated. > *My Setup:* > * Kafka deployment with 3 brokers and 2 topics. > * Replication factor is 3, for all topics. > * min.isr is 2, for all topics. > * Zookeeper deployment with 3 instances. > * In the region of 10 to 15 consumers, with 2 user topics (and, of course, > the system topics such as consumer offsets). Consumer offsets has the > standard 50 partitions. The user topics have about 3000 partitions in total. > * Offset retention time of 7 days, and topic retention time of 14 days. > * Input rate ~1000 messages/sec. > * Deployment happens to be on Google compute engine. > *Related Stack Overflow Post:* > https://stackoverflow.com/questions/52367825/apache-kafka-loses-some-consumer-offsets-when-when-i-bounce-a-broker > It was suggested I open a ticket by "Muir" who says he they have also > experienced this. > *Transcription of logs, showing the problem:* > Below, you can see chronologically sorted, interleaved, logs from the 3 > brokers. prod-kafka-2 is the node which was cleanly shut down and then > restarted. I filtered the messages only to those regardling > __consumer_offsets-29 because it's just too much to paste, otherwise. > ||Broker host||Broker ID|| > |prod-kafka-1|0| > |prod-kafka-2|1 (this one was restarted)| > |prod-kafka-3|2| > prod-kafka-2: (just starting up) > {code} > [2018-09-17 09:21:46,246] WARN [ReplicaFetcher replicaId=1, leaderId=2, > fetcherId=0] Based on follower's leader epoch, leader replied with an unknown > offset in __consumer_offsets-29. The initial fetch offset 0 will be used for > truncation. (kafka.server.ReplicaFetcherThread) > {code} > prod-kafka-3: (sees replica1 come back) > {code} > [2018-09-17 09:22:02,027] INFO [Partition __consumer_offsets-29 broker=2] > Expanding ISR from 0,2 to 0,2,1 (kafka.cluster.Partition) > {code} > prod-kafka-2: > {code} > [2018-09-17 09:22:33,892] INFO [GroupMetadataManager brokerId=1] Scheduling > unloading of offsets and group metadata from __consumer_offsets-29 > (kafka.coordinator.group.GroupMetadataManager) > [2018-09-17 09:22:33,902] INFO [GroupMetadataManager brokerId=1] Finished > unloading __consumer_offsets-29. Removed 0 cached offsets and 0 cached > groups. (kafka.coordinator.group.GroupMetadataManager) > [2018-09-17 09:24:03,287] INFO [ReplicaFetcherManager on broker 1] Removed > fetcher for partitions __consumer_offsets-29 > (kafka.server.ReplicaFetcherManager) > [2018-09-17 09:24:03,287] INFO [Partition __consumer_offsets-29 broker=1] > __consumer_offsets-29 starts at Leader Epoch 78 from offset 0. Previous > Leader Epoch was: 77 (kafka.cluster.Partition) > [2018-09-17 09:24:03,287] INFO [GroupMetadataManager brokerId=1] Scheduling > loading of offsets and group metadata from __consumer_offsets-29 > (kafka.coordinator.group.GroupMetadataManager) > [2018-09-17 09:24:03,288] INFO [GroupMetadataManager brokerId=1] Finished > loading offsets and group metada
[jira] [Comment Edited] (KAFKA-7447) Consumer offsets lost during leadership rebalance after bringing node back from clean shutdown
[ https://issues.apache.org/jira/browse/KAFKA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16959803#comment-16959803 ] Tim Van Laer edited comment on KAFKA-7447 at 10/25/19 2:45 PM: --- During offset loading in the recovering broker, following exception was thrown, I don't know if that can cause the issue. {code:java} [2019-10-25 05:09:08,151] INFO [GroupMetadataManager brokerId=2] Scheduling loading of offsets and group metadata from __consumer_offsets-0 (kafka.coordinator.group.GroupMetadataManager) ... [2019-10-25 05:09:12,972] ERROR [GroupMetadataManager brokerId=2] Error loading offsets from __consumer_offsets-0 (kafka.coordinator.group.GroupMetadataManager) java.util.NoSuchElementException: key not found: redacted-kafkastreams-application-4cbd9546-7b7b-4436-942f-26abae8aeb19-StreamThread-1-consumer-d994fd76-b31e-44ad-a27a-6f43ff0e9e78 at scala.collection.MapLike.default(MapLike.scala:235) at scala.collection.MapLike.default$(MapLike.scala:234) at scala.collection.AbstractMap.default(Map.scala:63) at scala.collection.mutable.HashMap.apply(HashMap.scala:69) at kafka.coordinator.group.GroupMetadata.get(GroupMetadata.scala:203) at kafka.coordinator.group.GroupCoordinator.$anonfun$tryCompleteHeartbeat$1(GroupCoordinator.scala:927) at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23) at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:251) at kafka.coordinator.group.GroupMetadata.inLock(GroupMetadata.scala:198) at kafka.coordinator.group.GroupCoordinator.tryCompleteHeartbeat(GroupCoordinator.scala:920) at kafka.coordinator.group.DelayedHeartbeat.tryComplete(DelayedHeartbeat.scala:34) at kafka.server.DelayedOperation.maybeTryComplete(DelayedOperation.scala:121) at kafka.server.DelayedOperationPurgatory$Watchers.tryCompleteWatched(DelayedOperation.scala:388) at kafka.server.DelayedOperationPurgatory.checkAndComplete(DelayedOperation.scala:294) at kafka.coordinator.group.GroupCoordinator.completeAndScheduleNextExpiration(GroupCoordinator.scala:737) at kafka.coordinator.group.GroupCoordinator.completeAndScheduleNextHeartbeatExpiration(GroupCoordinator.scala:730) at kafka.coordinator.group.GroupCoordinator.$anonfun$onGroupLoaded$3(GroupCoordinator.scala:677) at kafka.coordinator.group.GroupCoordinator.$anonfun$onGroupLoaded$3$adapted(GroupCoordinator.scala:677) at scala.collection.immutable.List.foreach(List.scala:392) at kafka.coordinator.group.GroupCoordinator.$anonfun$onGroupLoaded$1(GroupCoordinator.scala:677) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:251) at kafka.coordinator.group.GroupMetadata.inLock(GroupMetadata.scala:198) at kafka.coordinator.group.GroupCoordinator.onGroupLoaded(GroupCoordinator.scala:670) at kafka.coordinator.group.GroupCoordinator.$anonfun$handleGroupImmigration$1(GroupCoordinator.scala:682) at kafka.coordinator.group.GroupCoordinator.$anonfun$handleGroupImmigration$1$adapted(GroupCoordinator.scala:682) at kafka.coordinator.group.GroupMetadataManager.$anonfun$doLoadGroupsAndOffsets$23(GroupMetadataManager.scala:646) at kafka.coordinator.group.GroupMetadataManager.$anonfun$doLoadGroupsAndOffsets$23$adapted(GroupMetadataManager.scala:641) at scala.collection.mutable.HashMap$$anon$2.$anonfun$foreach$3(HashMap.scala:158) at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:158) at kafka.coordinator.group.GroupMetadataManager.doLoadGroupsAndOffsets(GroupMetadataManager.scala:641) at kafka.coordinator.group.GroupMetadataManager.loadGroupsAndOffsets(GroupMetadataManager.scala:500) at kafka.coordinator.group.GroupMetadataManager.$anonfun$scheduleLoadGroupAndOffsets$2(GroupMetadataManager.scala:491) at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114) at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {code} This very much looks like KAFKA-8896, which is fixed in 2.2.2. I'll give that a try first. was (Author: timvanlaer): During offset loading in the recovering broker, following exception
[jira] [Commented] (KAFKA-7447) Consumer offsets lost during leadership rebalance after bringing node back from clean shutdown
[ https://issues.apache.org/jira/browse/KAFKA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16959822#comment-16959822 ] Ismael Juma commented on KAFKA-7447: Thanks for the information. That looks similar to KAFKA-8896. > Consumer offsets lost during leadership rebalance after bringing node back > from clean shutdown > -- > > Key: KAFKA-7447 > URL: https://issues.apache.org/jira/browse/KAFKA-7447 > Project: Kafka > Issue Type: Bug >Affects Versions: 1.1.1, 2.0.0 >Reporter: Ben Isaacs >Priority: Major > > *Summary:* > * When 1 of my 3 brokers is cleanly shut down, consumption and production > continues as normal due to replication. (Consumers are rebalanced to the > replicas, and producers are rebalanced to the remaining brokers). However, > when the cleanly-shut-down broker comes back, after about 10 minutes, a > flurry of production errors occur and my consumers suddenly go back in time 2 > weeks, causing a long outage (12 hours+) as all messages are replayed on some > topics. > * The hypothesis is that the auto-leadership-rebalance is happening too > quickly after the downed broker returns, before it has had a chance to become > fully synchronised on all partitions. In particular, it seems that having > consumer offets ahead of the most recent data on the topic that consumer was > following causes the consumer to be reset to 0. > *Expected:* > * bringing a node back from a clean shut down does not cause any consumers > to reset to 0. > *Actual:* > * I experience approximately 12 hours of partial outage triggered at the > point that auto leadership rebalance occurs, after a cleanly shut down node > returns. > *Workaround:* > * disable auto leadership rebalance entirely. > * manually rebalance it from time to time when all nodes and all partitions > are fully replicated. > *My Setup:* > * Kafka deployment with 3 brokers and 2 topics. > * Replication factor is 3, for all topics. > * min.isr is 2, for all topics. > * Zookeeper deployment with 3 instances. > * In the region of 10 to 15 consumers, with 2 user topics (and, of course, > the system topics such as consumer offsets). Consumer offsets has the > standard 50 partitions. The user topics have about 3000 partitions in total. > * Offset retention time of 7 days, and topic retention time of 14 days. > * Input rate ~1000 messages/sec. > * Deployment happens to be on Google compute engine. > *Related Stack Overflow Post:* > https://stackoverflow.com/questions/52367825/apache-kafka-loses-some-consumer-offsets-when-when-i-bounce-a-broker > It was suggested I open a ticket by "Muir" who says he they have also > experienced this. > *Transcription of logs, showing the problem:* > Below, you can see chronologically sorted, interleaved, logs from the 3 > brokers. prod-kafka-2 is the node which was cleanly shut down and then > restarted. I filtered the messages only to those regardling > __consumer_offsets-29 because it's just too much to paste, otherwise. > ||Broker host||Broker ID|| > |prod-kafka-1|0| > |prod-kafka-2|1 (this one was restarted)| > |prod-kafka-3|2| > prod-kafka-2: (just starting up) > {code} > [2018-09-17 09:21:46,246] WARN [ReplicaFetcher replicaId=1, leaderId=2, > fetcherId=0] Based on follower's leader epoch, leader replied with an unknown > offset in __consumer_offsets-29. The initial fetch offset 0 will be used for > truncation. (kafka.server.ReplicaFetcherThread) > {code} > prod-kafka-3: (sees replica1 come back) > {code} > [2018-09-17 09:22:02,027] INFO [Partition __consumer_offsets-29 broker=2] > Expanding ISR from 0,2 to 0,2,1 (kafka.cluster.Partition) > {code} > prod-kafka-2: > {code} > [2018-09-17 09:22:33,892] INFO [GroupMetadataManager brokerId=1] Scheduling > unloading of offsets and group metadata from __consumer_offsets-29 > (kafka.coordinator.group.GroupMetadataManager) > [2018-09-17 09:22:33,902] INFO [GroupMetadataManager brokerId=1] Finished > unloading __consumer_offsets-29. Removed 0 cached offsets and 0 cached > groups. (kafka.coordinator.group.GroupMetadataManager) > [2018-09-17 09:24:03,287] INFO [ReplicaFetcherManager on broker 1] Removed > fetcher for partitions __consumer_offsets-29 > (kafka.server.ReplicaFetcherManager) > [2018-09-17 09:24:03,287] INFO [Partition __consumer_offsets-29 broker=1] > __consumer_offsets-29 starts at Leader Epoch 78 from offset 0. Previous > Leader Epoch was: 77 (kafka.cluster.Partition) > [2018-09-17 09:24:03,287] INFO [GroupMetadataManager brokerId=1] Scheduling > loading of offsets and group metadata from __consumer_offsets-29 > (kafka.coordinator.group.GroupMetadataManager) > [2018-09-17 09:24:03,288] INFO [GroupMetadataManager brokerId=1] Finished > loading offsets and gr
[jira] [Commented] (KAFKA-8950) KafkaConsumer stops fetching
[ https://issues.apache.org/jira/browse/KAFKA-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16959813#comment-16959813 ] Tom Lee commented on KAFKA-8950: This specific bug was new in 2.3.0, yep. IIRC the map that would get into an inconsistent state and cause all the trouble was introduced in 2.3.0. Looking at 0.10.2.2 specifically, I think the code as implemented would have been fine. If it's helpful, we've also had various issues with earlier versions of the client libs too but more often than not the issues were mitigated by config, upgrading client libs or system-level tuning: sysctls like tcp_retries2 & tcp_syn_retries are often set too high, misconfigured NICs can be an issue because of packet loss, stuff like that. Request timeouts for consumers in ~0.10 were extremely high because of some sort of coupling with group rebalances and iirc this didn't get fixed until 2.0.0. See https://cwiki.apache.org/confluence/display/KAFKA/KIP-266%3A+Fix+consumer+indefinite+blocking+behavior. Producers had similar but different issues with certain configuration options. Some of this was more difficult to work around directly without the 2.3.x upgrade. Not to say there are no more issues, but a custom build of 2.3.0 with Will's patch has been solid for us so far. By comparison, "vanilla" 2.3.0 would cause us trouble maybe once or twice a day. > KafkaConsumer stops fetching > > > Key: KAFKA-8950 > URL: https://issues.apache.org/jira/browse/KAFKA-8950 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 2.3.0 > Environment: linux >Reporter: Will James >Priority: Major > Fix For: 2.4.0, 2.3.1 > > > We have a KafkaConsumer consuming from a single partition with > enable.auto.commit set to true. > Very occasionally, the consumer goes into a broken state. It returns no > records from the broker with every poll, and from most of the Kafka metrics > in the consumer it looks like it is fully caught up to the end of the log. > We see that we are long polling for the max poll timeout, and that there is > zero lag. In addition, we see that the heartbeat rate stays unchanged from > before the issue begins (so the consumer stays a part of the consumer group). > In addition, from looking at the __consumer_offsets topic, it is possible to > see that the consumer is committing the same offset on the auto commit > interval, however, the offset does not move, and the lag from the broker's > perspective continues to increase. > The issue is only resolved by restarting our application (which restarts the > KafkaConsumer instance). > From a heap dump of an application in this state, I can see that the Fetcher > is in a state where it believes there are nodesWithPendingFetchRequests. > However, I can see the state of the fetch latency sensor, specifically, the > fetch rate, and see that the samples were not updated for a long period of > time (actually, precisely the amount of time that the problem in our > application was occurring, around 50 hours - we have alerting on other > metrics but not the fetch rate, so we didn't notice the problem until a > customer complained). > In this example, the consumer was processing around 40 messages per second, > with an average size of about 10kb, although most of the other examples of > this have happened with higher volume (250 messages / second, around 23kb per > message on average). > I have spent some time investigating the issue on our end, and will continue > to do so as time allows, however I wanted to raise this as an issue because > it may be affecting other people. > Please let me know if you have any questions or need additional information. > I doubt I can provide heap dumps unfortunately, but I can provide further > information as needed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-7447) Consumer offsets lost during leadership rebalance after bringing node back from clean shutdown
[ https://issues.apache.org/jira/browse/KAFKA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16959803#comment-16959803 ] Tim Van Laer commented on KAFKA-7447: - During offset loading in the recovering broker, following exception was thrown, I don't know if that can cause the issue. {code:java} [2019-10-25 05:09:08,151] INFO [GroupMetadataManager brokerId=2] Scheduling loading of offsets and group metadata from __consumer_offsets-0 (kafka.coordinator.group.GroupMetadataManager) ... [2019-10-25 05:09:12,972] ERROR [GroupMetadataManager brokerId=2] Error loading offsets from __consumer_offsets-0 (kafka.coordinator.group.GroupMetadataManager) java.util.NoSuchElementException: key not found: redacted-kafkastreams-application-4cbd9546-7b7b-4436-942f-26abae8aeb19-StreamThread-1-consumer-d994fd76-b31e-44ad-a27a-6f43ff0e9e78 at scala.collection.MapLike.default(MapLike.scala:235) at scala.collection.MapLike.default$(MapLike.scala:234) at scala.collection.AbstractMap.default(Map.scala:63) at scala.collection.mutable.HashMap.apply(HashMap.scala:69) at kafka.coordinator.group.GroupMetadata.get(GroupMetadata.scala:203) at kafka.coordinator.group.GroupCoordinator.$anonfun$tryCompleteHeartbeat$1(GroupCoordinator.scala:927) at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23) at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:251) at kafka.coordinator.group.GroupMetadata.inLock(GroupMetadata.scala:198) at kafka.coordinator.group.GroupCoordinator.tryCompleteHeartbeat(GroupCoordinator.scala:920) at kafka.coordinator.group.DelayedHeartbeat.tryComplete(DelayedHeartbeat.scala:34) at kafka.server.DelayedOperation.maybeTryComplete(DelayedOperation.scala:121) at kafka.server.DelayedOperationPurgatory$Watchers.tryCompleteWatched(DelayedOperation.scala:388) at kafka.server.DelayedOperationPurgatory.checkAndComplete(DelayedOperation.scala:294) at kafka.coordinator.group.GroupCoordinator.completeAndScheduleNextExpiration(GroupCoordinator.scala:737) at kafka.coordinator.group.GroupCoordinator.completeAndScheduleNextHeartbeatExpiration(GroupCoordinator.scala:730) at kafka.coordinator.group.GroupCoordinator.$anonfun$onGroupLoaded$3(GroupCoordinator.scala:677) at kafka.coordinator.group.GroupCoordinator.$anonfun$onGroupLoaded$3$adapted(GroupCoordinator.scala:677) at scala.collection.immutable.List.foreach(List.scala:392) at kafka.coordinator.group.GroupCoordinator.$anonfun$onGroupLoaded$1(GroupCoordinator.scala:677) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:251) at kafka.coordinator.group.GroupMetadata.inLock(GroupMetadata.scala:198) at kafka.coordinator.group.GroupCoordinator.onGroupLoaded(GroupCoordinator.scala:670) at kafka.coordinator.group.GroupCoordinator.$anonfun$handleGroupImmigration$1(GroupCoordinator.scala:682) at kafka.coordinator.group.GroupCoordinator.$anonfun$handleGroupImmigration$1$adapted(GroupCoordinator.scala:682) at kafka.coordinator.group.GroupMetadataManager.$anonfun$doLoadGroupsAndOffsets$23(GroupMetadataManager.scala:646) at kafka.coordinator.group.GroupMetadataManager.$anonfun$doLoadGroupsAndOffsets$23$adapted(GroupMetadataManager.scala:641) at scala.collection.mutable.HashMap$$anon$2.$anonfun$foreach$3(HashMap.scala:158) at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:158) at kafka.coordinator.group.GroupMetadataManager.doLoadGroupsAndOffsets(GroupMetadataManager.scala:641) at kafka.coordinator.group.GroupMetadataManager.loadGroupsAndOffsets(GroupMetadataManager.scala:500) at kafka.coordinator.group.GroupMetadataManager.$anonfun$scheduleLoadGroupAndOffsets$2(GroupMetadataManager.scala:491) at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114) at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {code} > Consumer offsets lost during leadership rebalance after bringing node back > from clean shutdown > -- > > Key: KAFKA-7447 >
[jira] [Commented] (KAFKA-7447) Consumer offsets lost during leadership rebalance after bringing node back from clean shutdown
[ https://issues.apache.org/jira/browse/KAFKA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16959789#comment-16959789 ] Tim Van Laer commented on KAFKA-7447: - Looks like we ran in a similar issue. One of our brokers lost connection with zookeeper, rejoined the cluster and continued it's regular behaviour. Around that time, two consumer groups unexpected restarted processing their complete input topic (auto.offset.reset=earliest). Offsets of both consumer groups are stored at partition 0 of __consumer_offsets (which has 50 partitions). We have 115 consumer groups, none of the other groups was impacted (their offsets are all stored on other partitions). The broker with the zookeeper connection glitch is also the leader of partition 0 of the __consumer_offsets topic. We're running Kafka 2.2.1, inter.broker.protocol.version=2.2 (so KAFKA-8069 looks like a different issue to me) > Consumer offsets lost during leadership rebalance after bringing node back > from clean shutdown > -- > > Key: KAFKA-7447 > URL: https://issues.apache.org/jira/browse/KAFKA-7447 > Project: Kafka > Issue Type: Bug >Affects Versions: 1.1.1, 2.0.0 >Reporter: Ben Isaacs >Priority: Major > > *Summary:* > * When 1 of my 3 brokers is cleanly shut down, consumption and production > continues as normal due to replication. (Consumers are rebalanced to the > replicas, and producers are rebalanced to the remaining brokers). However, > when the cleanly-shut-down broker comes back, after about 10 minutes, a > flurry of production errors occur and my consumers suddenly go back in time 2 > weeks, causing a long outage (12 hours+) as all messages are replayed on some > topics. > * The hypothesis is that the auto-leadership-rebalance is happening too > quickly after the downed broker returns, before it has had a chance to become > fully synchronised on all partitions. In particular, it seems that having > consumer offets ahead of the most recent data on the topic that consumer was > following causes the consumer to be reset to 0. > *Expected:* > * bringing a node back from a clean shut down does not cause any consumers > to reset to 0. > *Actual:* > * I experience approximately 12 hours of partial outage triggered at the > point that auto leadership rebalance occurs, after a cleanly shut down node > returns. > *Workaround:* > * disable auto leadership rebalance entirely. > * manually rebalance it from time to time when all nodes and all partitions > are fully replicated. > *My Setup:* > * Kafka deployment with 3 brokers and 2 topics. > * Replication factor is 3, for all topics. > * min.isr is 2, for all topics. > * Zookeeper deployment with 3 instances. > * In the region of 10 to 15 consumers, with 2 user topics (and, of course, > the system topics such as consumer offsets). Consumer offsets has the > standard 50 partitions. The user topics have about 3000 partitions in total. > * Offset retention time of 7 days, and topic retention time of 14 days. > * Input rate ~1000 messages/sec. > * Deployment happens to be on Google compute engine. > *Related Stack Overflow Post:* > https://stackoverflow.com/questions/52367825/apache-kafka-loses-some-consumer-offsets-when-when-i-bounce-a-broker > It was suggested I open a ticket by "Muir" who says he they have also > experienced this. > *Transcription of logs, showing the problem:* > Below, you can see chronologically sorted, interleaved, logs from the 3 > brokers. prod-kafka-2 is the node which was cleanly shut down and then > restarted. I filtered the messages only to those regardling > __consumer_offsets-29 because it's just too much to paste, otherwise. > ||Broker host||Broker ID|| > |prod-kafka-1|0| > |prod-kafka-2|1 (this one was restarted)| > |prod-kafka-3|2| > prod-kafka-2: (just starting up) > {code} > [2018-09-17 09:21:46,246] WARN [ReplicaFetcher replicaId=1, leaderId=2, > fetcherId=0] Based on follower's leader epoch, leader replied with an unknown > offset in __consumer_offsets-29. The initial fetch offset 0 will be used for > truncation. (kafka.server.ReplicaFetcherThread) > {code} > prod-kafka-3: (sees replica1 come back) > {code} > [2018-09-17 09:22:02,027] INFO [Partition __consumer_offsets-29 broker=2] > Expanding ISR from 0,2 to 0,2,1 (kafka.cluster.Partition) > {code} > prod-kafka-2: > {code} > [2018-09-17 09:22:33,892] INFO [GroupMetadataManager brokerId=1] Scheduling > unloading of offsets and group metadata from __consumer_offsets-29 > (kafka.coordinator.group.GroupMetadataManager) > [2018-09-17 09:22:33,902] INFO [GroupMetadataManager brokerId=1] Finished > unloading __consumer_offsets-29. Removed 0 cached offsets and 0 cached > groups. (kafka.coordinator.group.GroupMe
[jira] [Commented] (KAFKA-8950) KafkaConsumer stops fetching
[ https://issues.apache.org/jira/browse/KAFKA-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16959782#comment-16959782 ] Andrew Olson commented on KAFKA-8950: - Was this confirmed to be first introduced in 2.3.0? If so do you know the introductory JIRA issue? We may have seen similar in previous Kafka versions (0.10.2.1, 2.2.1), so curious if the bug is truly new or might have been around for a while. > KafkaConsumer stops fetching > > > Key: KAFKA-8950 > URL: https://issues.apache.org/jira/browse/KAFKA-8950 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 2.3.0 > Environment: linux >Reporter: Will James >Priority: Major > Fix For: 2.4.0, 2.3.1 > > > We have a KafkaConsumer consuming from a single partition with > enable.auto.commit set to true. > Very occasionally, the consumer goes into a broken state. It returns no > records from the broker with every poll, and from most of the Kafka metrics > in the consumer it looks like it is fully caught up to the end of the log. > We see that we are long polling for the max poll timeout, and that there is > zero lag. In addition, we see that the heartbeat rate stays unchanged from > before the issue begins (so the consumer stays a part of the consumer group). > In addition, from looking at the __consumer_offsets topic, it is possible to > see that the consumer is committing the same offset on the auto commit > interval, however, the offset does not move, and the lag from the broker's > perspective continues to increase. > The issue is only resolved by restarting our application (which restarts the > KafkaConsumer instance). > From a heap dump of an application in this state, I can see that the Fetcher > is in a state where it believes there are nodesWithPendingFetchRequests. > However, I can see the state of the fetch latency sensor, specifically, the > fetch rate, and see that the samples were not updated for a long period of > time (actually, precisely the amount of time that the problem in our > application was occurring, around 50 hours - we have alerting on other > metrics but not the fetch rate, so we didn't notice the problem until a > customer complained). > In this example, the consumer was processing around 40 messages per second, > with an average size of about 10kb, although most of the other examples of > this have happened with higher volume (250 messages / second, around 23kb per > message on average). > I have spent some time investigating the issue on our end, and will continue > to do so as time allows, however I wanted to raise this as an issue because > it may be affecting other people. > Please let me know if you have any questions or need additional information. > I doubt I can provide heap dumps unfortunately, but I can provide further > information as needed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-8992) Don't expose Errors in `RemoveMemberFromGroupResult`
[ https://issues.apache.org/jira/browse/KAFKA-8992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Gustafson resolved KAFKA-8992. Resolution: Fixed > Don't expose Errors in `RemoveMemberFromGroupResult` > > > Key: KAFKA-8992 > URL: https://issues.apache.org/jira/browse/KAFKA-8992 > Project: Kafka > Issue Type: Bug >Reporter: Jason Gustafson >Assignee: Boyang Chen >Priority: Blocker > Fix For: 2.4.0 > > > The type `RemoveMemberFromGroupResult` exposes `Errors` from `topLevelError`. > We should just get rid of this API. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-8992) Don't expose Errors in `RemoveMemberFromGroupResult`
[ https://issues.apache.org/jira/browse/KAFKA-8992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16959489#comment-16959489 ] ASF GitHub Bot commented on KAFKA-8992: --- hachikuji commented on pull request #7478: KAFKA-8992: (2.4 blocker) Redefine RemoveMembersFromGroup interface on Admin Client URL: https://github.com/apache/kafka/pull/7478 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Don't expose Errors in `RemoveMemberFromGroupResult` > > > Key: KAFKA-8992 > URL: https://issues.apache.org/jira/browse/KAFKA-8992 > Project: Kafka > Issue Type: Bug >Reporter: Jason Gustafson >Assignee: Boyang Chen >Priority: Blocker > Fix For: 2.4.0 > > > The type `RemoveMemberFromGroupResult` exposes `Errors` from `topLevelError`. > We should just get rid of this API. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-9104) Allow AdminClient to manage users resources
Jakub created KAFKA-9104: Summary: Allow AdminClient to manage users resources Key: KAFKA-9104 URL: https://issues.apache.org/jira/browse/KAFKA-9104 Project: Kafka Issue Type: Improvement Components: admin Affects Versions: 2.3.1 Reporter: Jakub Right now, AdminClient only supports * TopicResource * GroupResource * ClusterResource * BrokerResource It's important for our automation environment to also support User resources, since as of right now only way for to manage different users is to run {code:java} kafka-configs --alter --add-config 'producer_byte_rate=XXX,consumer_byte_rate=XXX' --entity-type users --entity-name Xx {code} which is not feasible in automated way per each user. -- This message was sent by Atlassian Jira (v8.3.4#803005)