[jira] [Commented] (KAFKA-10313) Out of range offset errors leading to offset reset
[ https://issues.apache.org/jira/browse/KAFKA-10313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17197722#comment-17197722 ] Varsha Abhinandan commented on KAFKA-10313: --- The issue mentioned in KAFKA-9543 seems to coincide with the segment rollover and also post 2.4.0 version. Unfortunately, we are facing this issue in 2.2.2 version and according to the logs it's not around the time of segment rollover. > Out of range offset errors leading to offset reset > -- > > Key: KAFKA-10313 > URL: https://issues.apache.org/jira/browse/KAFKA-10313 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 2.2.2 >Reporter: Varsha Abhinandan >Priority: Major > > Hi, > > We have been occasionally noticing offset resets happening on the Kafka > consumer because of offset out of range error. However, I don't see any > errors in the broker logs. No logs related to leader-election, replica lag, > Kafka broker pod restarts or anything. (just info logs were enabled in the > prod environment). > > It appeared from the logs that the out of range error was because of the > fetch offset being larger than the offset range on the broker. Noticed this > happening multiple times on different consumers, stream apps in the prod > environment. So, it doesn't seem like an application bug and more like a bug > in the KafkaConsumer. Would like to understand the cause for such errors. > > Also, none of the offset reset options are desirable. Choosing "earliest" > creates a sudden huge lag (we have a retention of 24hours) and choosing > "latest" leads to data loss (the records produced between the out of range > error and when offset reset happens on the consumer). So, wondering if it is > better for the Kafka client to separate out 'auto.offset.reset' config for > just offset not found. For, out of range error maybe the Kafka client can > automatically reset the offset to latest if the fetch offset is higher to > prevent data loss. Also, automatically reset it to earliest if the fetch > offset is lesser than the start offset. > > Following are the logs on the consumer side : > {noformat} > [2020-07-17T08:46:00,322Z] [INFO ] [pipeline-thread-12 > ([prd453-19-event-upsert]-bo-pipeline-12)] > [o.a.k.c.consumer.internals.Fetcher] [Consumer > clientId=bo-indexer-group-prd453-19-on-c19-bo-indexer-upsert-blue-5d665bcbb7-dnvkh-pid-1-kafka-message-source-id-544, > groupId=bo-indexer-group-prd453-19] Fetch offset 476383711 is out of range > for partition prd453-19-event-upsert-32, resetting offset > [2020-07-17T08:46:00,330Z] [INFO ] [pipeline-thread-12 > ([prd453-19-event-upsert]-bo-pipeline-12)] > [o.a.k.c.consumer.internals.Fetcher] [Consumer > clientId=bo-indexer-group-prd453-19-on-c19-bo-indexer-upsert-blue-5d665bcbb7-dnvkh-pid-1-kafka-message-source-id-544, > groupId=bo-indexer-group-prd453-19] Resetting offset for partition > prd453-19-event-upsert-32 to offset 453223789. > {noformat} > Broker logs for the partition : > {noformat} > [2020-07-17T07:40:12,082Z] [INFO ] [kafka-scheduler-4] [kafka.log.Log] > [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Found deletable > segments with base offsets [452091893] due to retention time 8640ms breach > [2020-07-17T07:40:12,082Z] [INFO ] [kafka-scheduler-4] [kafka.log.Log] > [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Scheduling log > segment [baseOffset 452091893, size 1073741693] for deletion. > [2020-07-17T07:40:12,083Z] [INFO ] [kafka-scheduler-4] [kafka.log.Log] > [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Incrementing log > start offset to 453223789 > [2020-07-17T07:41:12,083Z] [INFO ] [kafka-scheduler-7] [kafka.log.Log] > [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Deleting segment > 452091893 > [2020-07-17T07:41:12,114Z] [INFO ] [kafka-scheduler-7] > [kafka.log.LogSegment] Deleted log > /data/kafka/prd453-19-event-upsert-32/000452091893.log.deleted. > [2020-07-17T07:41:12,114Z] [INFO ] [kafka-scheduler-7] > [kafka.log.LogSegment] Deleted offset index > /data/kafka/prd453-19-event-upsert-32/000452091893.index.deleted. > [2020-07-17T07:41:12,114Z] [INFO ] [kafka-scheduler-7] > [kafka.log.LogSegment] Deleted time index > /data/kafka/prd453-19-event-upsert-32/000452091893.timeindex.deleted. > [2020-07-17T07:52:31,836Z] [INFO ] [data-plane-kafka-request-handler-3] > [kafka.log.ProducerStateManager] [ProducerStateManager > partition=prd453-19-event-upsert-32] Writing producer snapshot at offset > 475609786 > [2020-07-17T07:52:31,836Z] [INFO ] [data-plane-kafka-request-handler-3] > [kafka.log.Log] [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] > Rolled new log
[jira] [Comment Edited] (KAFKA-10313) Out of range offset errors leading to offset reset
[ https://issues.apache.org/jira/browse/KAFKA-10313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17197722#comment-17197722 ] Varsha Abhinandan edited comment on KAFKA-10313 at 9/17/20, 2:26 PM: - The issue mentioned in KAFKA-9543 seems to coincide with the segment rollover and also post 2.4.0 version. Unfortunately, we are facing this issue in 2.2.2 version and according to the logs it's not around the same time as segment rollover. was (Author: varsha.abhinandan): The issue mentioned in KAFKA-9543 seems to coincide with the segment rollover and also post 2.4.0 version. Unfortunately, we are facing this issue in 2.2.2 version and according to the logs it's not around the time of segment rollover. > Out of range offset errors leading to offset reset > -- > > Key: KAFKA-10313 > URL: https://issues.apache.org/jira/browse/KAFKA-10313 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 2.2.2 >Reporter: Varsha Abhinandan >Priority: Major > > Hi, > > We have been occasionally noticing offset resets happening on the Kafka > consumer because of offset out of range error. However, I don't see any > errors in the broker logs. No logs related to leader-election, replica lag, > Kafka broker pod restarts or anything. (just info logs were enabled in the > prod environment). > > It appeared from the logs that the out of range error was because of the > fetch offset being larger than the offset range on the broker. Noticed this > happening multiple times on different consumers, stream apps in the prod > environment. So, it doesn't seem like an application bug and more like a bug > in the KafkaConsumer. Would like to understand the cause for such errors. > > Also, none of the offset reset options are desirable. Choosing "earliest" > creates a sudden huge lag (we have a retention of 24hours) and choosing > "latest" leads to data loss (the records produced between the out of range > error and when offset reset happens on the consumer). So, wondering if it is > better for the Kafka client to separate out 'auto.offset.reset' config for > just offset not found. For, out of range error maybe the Kafka client can > automatically reset the offset to latest if the fetch offset is higher to > prevent data loss. Also, automatically reset it to earliest if the fetch > offset is lesser than the start offset. > > Following are the logs on the consumer side : > {noformat} > [2020-07-17T08:46:00,322Z] [INFO ] [pipeline-thread-12 > ([prd453-19-event-upsert]-bo-pipeline-12)] > [o.a.k.c.consumer.internals.Fetcher] [Consumer > clientId=bo-indexer-group-prd453-19-on-c19-bo-indexer-upsert-blue-5d665bcbb7-dnvkh-pid-1-kafka-message-source-id-544, > groupId=bo-indexer-group-prd453-19] Fetch offset 476383711 is out of range > for partition prd453-19-event-upsert-32, resetting offset > [2020-07-17T08:46:00,330Z] [INFO ] [pipeline-thread-12 > ([prd453-19-event-upsert]-bo-pipeline-12)] > [o.a.k.c.consumer.internals.Fetcher] [Consumer > clientId=bo-indexer-group-prd453-19-on-c19-bo-indexer-upsert-blue-5d665bcbb7-dnvkh-pid-1-kafka-message-source-id-544, > groupId=bo-indexer-group-prd453-19] Resetting offset for partition > prd453-19-event-upsert-32 to offset 453223789. > {noformat} > Broker logs for the partition : > {noformat} > [2020-07-17T07:40:12,082Z] [INFO ] [kafka-scheduler-4] [kafka.log.Log] > [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Found deletable > segments with base offsets [452091893] due to retention time 8640ms breach > [2020-07-17T07:40:12,082Z] [INFO ] [kafka-scheduler-4] [kafka.log.Log] > [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Scheduling log > segment [baseOffset 452091893, size 1073741693] for deletion. > [2020-07-17T07:40:12,083Z] [INFO ] [kafka-scheduler-4] [kafka.log.Log] > [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Incrementing log > start offset to 453223789 > [2020-07-17T07:41:12,083Z] [INFO ] [kafka-scheduler-7] [kafka.log.Log] > [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Deleting segment > 452091893 > [2020-07-17T07:41:12,114Z] [INFO ] [kafka-scheduler-7] > [kafka.log.LogSegment] Deleted log > /data/kafka/prd453-19-event-upsert-32/000452091893.log.deleted. > [2020-07-17T07:41:12,114Z] [INFO ] [kafka-scheduler-7] > [kafka.log.LogSegment] Deleted offset index > /data/kafka/prd453-19-event-upsert-32/000452091893.index.deleted. > [2020-07-17T07:41:12,114Z] [INFO ] [kafka-scheduler-7] > [kafka.log.LogSegment] Deleted time index > /data/kafka/prd453-19-event-upsert-32/000452091893.timeindex.deleted. > [2020-07-17T07:52:31,836Z] [INFO ]
[jira] [Updated] (KAFKA-10313) Out of range offset errors leading to offset reset
[ https://issues.apache.org/jira/browse/KAFKA-10313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varsha Abhinandan updated KAFKA-10313: -- Description: Hi, We have been occasionally noticing offset resets happening on the Kafka consumer because of offset out of range error. However, I don't see any errors in the broker logs. No logs related to leader-election, replica lag, Kafka broker pod restarts or anything. (just info logs were enabled in the prod environment). It appeared from the logs that the out of range error was because of the fetch offset being larger than the offset range on the broker. Noticed this happening multiple times on different consumers, stream apps in the prod environment. So, it doesn't seem like an application bug and more like a bug in the KafkaConsumer. Would like to understand the cause for such errors. Also, none of the offset reset options are desirable. Choosing "earliest" creates a sudden huge lag (we have a retention of 24hours) and choosing "latest" leads to data loss (the records produced between the out of range error and when offset reset happens on the consumer). So, wondering if it is better for the Kafka client to separate out 'auto.offset.reset' config for just offset not found. For, out of range error maybe the Kafka client can automatically reset the offset to latest if the fetch offset is higher to prevent data loss. Also, automatically reset it to earliest if the fetch offset is lesser than the start offset. Following are the logs on the consumer side : {noformat} [2020-07-17T08:46:00,322Z] [INFO ] [pipeline-thread-12 ([prd453-19-event-upsert]-bo-pipeline-12)] [o.a.k.c.consumer.internals.Fetcher] [Consumer clientId=bo-indexer-group-prd453-19-on-c19-bo-indexer-upsert-blue-5d665bcbb7-dnvkh-pid-1-kafka-message-source-id-544, groupId=bo-indexer-group-prd453-19] Fetch offset 476383711 is out of range for partition prd453-19-event-upsert-32, resetting offset [2020-07-17T08:46:00,330Z] [INFO ] [pipeline-thread-12 ([prd453-19-event-upsert]-bo-pipeline-12)] [o.a.k.c.consumer.internals.Fetcher] [Consumer clientId=bo-indexer-group-prd453-19-on-c19-bo-indexer-upsert-blue-5d665bcbb7-dnvkh-pid-1-kafka-message-source-id-544, groupId=bo-indexer-group-prd453-19] Resetting offset for partition prd453-19-event-upsert-32 to offset 453223789. {noformat} Broker logs for the partition : {noformat} [2020-07-17T07:40:12,082Z] [INFO ] [kafka-scheduler-4] [kafka.log.Log] [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Found deletable segments with base offsets [452091893] due to retention time 8640ms breach [2020-07-17T07:40:12,082Z] [INFO ] [kafka-scheduler-4] [kafka.log.Log] [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Scheduling log segment [baseOffset 452091893, size 1073741693] for deletion. [2020-07-17T07:40:12,083Z] [INFO ] [kafka-scheduler-4] [kafka.log.Log] [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Incrementing log start offset to 453223789 [2020-07-17T07:41:12,083Z] [INFO ] [kafka-scheduler-7] [kafka.log.Log] [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Deleting segment 452091893 [2020-07-17T07:41:12,114Z] [INFO ] [kafka-scheduler-7] [kafka.log.LogSegment] Deleted log /data/kafka/prd453-19-event-upsert-32/000452091893.log.deleted. [2020-07-17T07:41:12,114Z] [INFO ] [kafka-scheduler-7] [kafka.log.LogSegment] Deleted offset index /data/kafka/prd453-19-event-upsert-32/000452091893.index.deleted. [2020-07-17T07:41:12,114Z] [INFO ] [kafka-scheduler-7] [kafka.log.LogSegment] Deleted time index /data/kafka/prd453-19-event-upsert-32/000452091893.timeindex.deleted. [2020-07-17T07:52:31,836Z] [INFO ] [data-plane-kafka-request-handler-3] [kafka.log.ProducerStateManager] [ProducerStateManager partition=prd453-19-event-upsert-32] Writing producer snapshot at offset 475609786 [2020-07-17T07:52:31,836Z] [INFO ] [data-plane-kafka-request-handler-3] [kafka.log.Log] [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Rolled new log segment at offset 475609786 in 1 ms.{noformat} {noformat} [2020-07-17T09:05:12,075Z] [INFO ] [kafka-scheduler-2] [kafka.log.Log] [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Found deletable segments with base offsets [453223789] due to retention time 8640ms breach [2020-07-17T09:05:12,075Z] [INFO ] [kafka-scheduler-2] [kafka.log.Log] [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Scheduling log segment [baseOffset 453223789, size 1073741355] for deletion. [2020-07-17T09:05:12,075Z] [INFO ] [kafka-scheduler-2] [kafka.log.Log] [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Incrementing log start offset to 454388428 [2020-07-17T09:06:12,075Z] [INFO ] [kafka-scheduler-6] [kafka.log.Log] [Log partition=prd453-19-event-upsert-32,
[jira] [Updated] (KAFKA-10313) Out of range offset errors leading to offset reset
[ https://issues.apache.org/jira/browse/KAFKA-10313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varsha Abhinandan updated KAFKA-10313: -- Priority: Minor (was: Critical) > Out of range offset errors leading to offset reset > -- > > Key: KAFKA-10313 > URL: https://issues.apache.org/jira/browse/KAFKA-10313 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 2.2.2 >Reporter: Varsha Abhinandan >Priority: Minor > > Hi, > > We have been occasionally noticing offset resets happening on the Kafka > consumer because of offset out of range error. However, I don't see any > errors in the broker logs. No logs related to leader-election, replica lag, > Kafka broker pod restarts or anything. (just info logs were enabled in the > prod environment). > > It appeared from the logs that the out of range error was because of the > fetch offset being larger than the offset range on the broker. Noticed this > happening multiple times on different consumers, stream apps in the prod > environment. So, it doesn't seem like an application bug and more like a bug > in the KafkaConsumer. Would like to understand the cause for such errors. > > Also, none of the offset reset options are desirable. Choosing "earliest" > creates a sudden huge lag (we have a retention of 24hours) and choosing > "latest" leads to data loss (the records produced between the out of range > error and when offset reset happens on the consumer). So, wondering if it is > better for the Kafka client to separate out 'auto.offset.reset' config for > just offset not found. For, out of range error maybe the Kafka client can > automatically reset the offset to latest if the fetch offset is higher to > prevent data loss. Also, automatically reset it to earliest if the fetch > offset is lesser than the start offset. > > > Following are the logs on the consumer side : > > {noformat} > [2020-07-17T08:46:00,322Z] [INFO ] [pipeline-thread-12 > ([prd453-19-event-upsert]-bo-pipeline-12)] > [o.a.k.c.consumer.internals.Fetcher] [Consumer > clientId=bo-indexer-group-prd453-19-on-c19-bo-indexer-upsert-blue-5d665bcbb7-dnvkh-pid-1-kafka-message-source-id-544, > groupId=bo-indexer-group-prd453-19] Fetch offset 476383711 is out of range > for partition prd453-19-event-upsert-32, resetting offset > [2020-07-17T08:46:00,330Z] [INFO ] [pipeline-thread-12 > ([prd453-19-event-upsert]-bo-pipeline-12)] > [o.a.k.c.consumer.internals.Fetcher] [Consumer > clientId=bo-indexer-group-prd453-19-on-c19-bo-indexer-upsert-blue-5d665bcbb7-dnvkh-pid-1-kafka-message-source-id-544, > groupId=bo-indexer-group-prd453-19] Resetting offset for partition > prd453-19-event-upsert-32 to offset 453223789. > {noformat} > Broker logs for the partition : > {noformat} > [2020-07-17T07:40:12,082Z] [INFO ] [kafka-scheduler-4] [kafka.log.Log] > [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Found deletable > segments with base offsets [452091893] due to retention time 8640ms breach > [2020-07-17T07:40:12,082Z] [INFO ] [kafka-scheduler-4] [kafka.log.Log] > [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Scheduling log > segment [baseOffset 452091893, size 1073741693] for deletion. > [2020-07-17T07:40:12,083Z] [INFO ] [kafka-scheduler-4] [kafka.log.Log] > [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Incrementing log > start offset to 453223789 > [2020-07-17T07:41:12,083Z] [INFO ] [kafka-scheduler-7] [kafka.log.Log] > [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Deleting segment > 452091893 > [2020-07-17T07:41:12,114Z] [INFO ] [kafka-scheduler-7] > [kafka.log.LogSegment] Deleted log > /data/kafka/prd453-19-event-upsert-32/000452091893.log.deleted. > [2020-07-17T07:41:12,114Z] [INFO ] [kafka-scheduler-7] > [kafka.log.LogSegment] Deleted offset index > /data/kafka/prd453-19-event-upsert-32/000452091893.index.deleted. > [2020-07-17T07:41:12,114Z] [INFO ] [kafka-scheduler-7] > [kafka.log.LogSegment] Deleted time index > /data/kafka/prd453-19-event-upsert-32/000452091893.timeindex.deleted. > [2020-07-17T07:52:31,836Z] [INFO ] [data-plane-kafka-request-handler-3] > [kafka.log.ProducerStateManager] [ProducerStateManager > partition=prd453-19-event-upsert-32] Writing producer snapshot at offset > 475609786 > [2020-07-17T07:52:31,836Z] [INFO ] [data-plane-kafka-request-handler-3] > [kafka.log.Log] [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] > Rolled new log segment at offset 475609786 in 1 ms.{noformat} > > {noformat} > [2020-07-17T09:05:12,075Z] [INFO ] [kafka-scheduler-2] [kafka.log.Log] > [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Found deletable > segments with
[jira] [Updated] (KAFKA-10313) Out of range offset errors leading to offset reset
[ https://issues.apache.org/jira/browse/KAFKA-10313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varsha Abhinandan updated KAFKA-10313: -- Priority: Major (was: Minor) > Out of range offset errors leading to offset reset > -- > > Key: KAFKA-10313 > URL: https://issues.apache.org/jira/browse/KAFKA-10313 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 2.2.2 >Reporter: Varsha Abhinandan >Priority: Major > > Hi, > > We have been occasionally noticing offset resets happening on the Kafka > consumer because of offset out of range error. However, I don't see any > errors in the broker logs. No logs related to leader-election, replica lag, > Kafka broker pod restarts or anything. (just info logs were enabled in the > prod environment). > > It appeared from the logs that the out of range error was because of the > fetch offset being larger than the offset range on the broker. Noticed this > happening multiple times on different consumers, stream apps in the prod > environment. So, it doesn't seem like an application bug and more like a bug > in the KafkaConsumer. Would like to understand the cause for such errors. > > Also, none of the offset reset options are desirable. Choosing "earliest" > creates a sudden huge lag (we have a retention of 24hours) and choosing > "latest" leads to data loss (the records produced between the out of range > error and when offset reset happens on the consumer). So, wondering if it is > better for the Kafka client to separate out 'auto.offset.reset' config for > just offset not found. For, out of range error maybe the Kafka client can > automatically reset the offset to latest if the fetch offset is higher to > prevent data loss. Also, automatically reset it to earliest if the fetch > offset is lesser than the start offset. > > > Following are the logs on the consumer side : > > {noformat} > [2020-07-17T08:46:00,322Z] [INFO ] [pipeline-thread-12 > ([prd453-19-event-upsert]-bo-pipeline-12)] > [o.a.k.c.consumer.internals.Fetcher] [Consumer > clientId=bo-indexer-group-prd453-19-on-c19-bo-indexer-upsert-blue-5d665bcbb7-dnvkh-pid-1-kafka-message-source-id-544, > groupId=bo-indexer-group-prd453-19] Fetch offset 476383711 is out of range > for partition prd453-19-event-upsert-32, resetting offset > [2020-07-17T08:46:00,330Z] [INFO ] [pipeline-thread-12 > ([prd453-19-event-upsert]-bo-pipeline-12)] > [o.a.k.c.consumer.internals.Fetcher] [Consumer > clientId=bo-indexer-group-prd453-19-on-c19-bo-indexer-upsert-blue-5d665bcbb7-dnvkh-pid-1-kafka-message-source-id-544, > groupId=bo-indexer-group-prd453-19] Resetting offset for partition > prd453-19-event-upsert-32 to offset 453223789. > {noformat} > Broker logs for the partition : > {noformat} > [2020-07-17T07:40:12,082Z] [INFO ] [kafka-scheduler-4] [kafka.log.Log] > [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Found deletable > segments with base offsets [452091893] due to retention time 8640ms breach > [2020-07-17T07:40:12,082Z] [INFO ] [kafka-scheduler-4] [kafka.log.Log] > [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Scheduling log > segment [baseOffset 452091893, size 1073741693] for deletion. > [2020-07-17T07:40:12,083Z] [INFO ] [kafka-scheduler-4] [kafka.log.Log] > [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Incrementing log > start offset to 453223789 > [2020-07-17T07:41:12,083Z] [INFO ] [kafka-scheduler-7] [kafka.log.Log] > [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Deleting segment > 452091893 > [2020-07-17T07:41:12,114Z] [INFO ] [kafka-scheduler-7] > [kafka.log.LogSegment] Deleted log > /data/kafka/prd453-19-event-upsert-32/000452091893.log.deleted. > [2020-07-17T07:41:12,114Z] [INFO ] [kafka-scheduler-7] > [kafka.log.LogSegment] Deleted offset index > /data/kafka/prd453-19-event-upsert-32/000452091893.index.deleted. > [2020-07-17T07:41:12,114Z] [INFO ] [kafka-scheduler-7] > [kafka.log.LogSegment] Deleted time index > /data/kafka/prd453-19-event-upsert-32/000452091893.timeindex.deleted. > [2020-07-17T07:52:31,836Z] [INFO ] [data-plane-kafka-request-handler-3] > [kafka.log.ProducerStateManager] [ProducerStateManager > partition=prd453-19-event-upsert-32] Writing producer snapshot at offset > 475609786 > [2020-07-17T07:52:31,836Z] [INFO ] [data-plane-kafka-request-handler-3] > [kafka.log.Log] [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] > Rolled new log segment at offset 475609786 in 1 ms.{noformat} > > {noformat} > [2020-07-17T09:05:12,075Z] [INFO ] [kafka-scheduler-2] [kafka.log.Log] > [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Found deletable > segments with base
[jira] [Updated] (KAFKA-10313) Out of range offset errors leading to offset reset
[ https://issues.apache.org/jira/browse/KAFKA-10313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varsha Abhinandan updated KAFKA-10313: -- Priority: Critical (was: Major) > Out of range offset errors leading to offset reset > -- > > Key: KAFKA-10313 > URL: https://issues.apache.org/jira/browse/KAFKA-10313 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 2.2.2 >Reporter: Varsha Abhinandan >Priority: Critical > > Hi, > > We have been occasionally noticing offset resets happening on the Kafka > consumer because of offset out of range error. However, I don't see any > errors in the broker logs. No logs related to leader-election, replica lag, > Kafka broker pod restarts or anything. (just info logs were enabled in the > prod environment). > > It appeared from the logs that the out of range error was because of the > fetch offset being larger than the offset range on the broker. Noticed this > happening multiple times on different consumers, stream apps in the prod > environment. So, it doesn't seem like an application bug and more like a bug > in the KafkaConsumer. Would like to understand the cause for such errors. > > Also, none of the offset reset options are desirable. Choosing "earliest" > creates a sudden huge lag (we have a retention of 24hours) and choosing > "latest" leads to data loss (the records produced between the out of range > error and when offset reset happens on the consumer). So, wondering if it is > better for the Kafka client to separate out 'auto.offset.reset' config for > just offset not found. For, out of range error maybe the Kafka client can > automatically reset the offset to latest if the fetch offset is higher to > prevent data loss. Also, automatically reset it to earliest if the fetch > offset is lesser than the start offset. > > > Following are the logs on the consumer side : > > {noformat} > [2020-07-17T08:46:00,322Z] [INFO ] [pipeline-thread-12 > ([prd453-19-event-upsert]-bo-pipeline-12)] > [o.a.k.c.consumer.internals.Fetcher] [Consumer > clientId=bo-indexer-group-prd453-19-on-c19-bo-indexer-upsert-blue-5d665bcbb7-dnvkh-pid-1-kafka-message-source-id-544, > groupId=bo-indexer-group-prd453-19] Fetch offset 476383711 is out of range > for partition prd453-19-event-upsert-32, resetting offset > [2020-07-17T08:46:00,330Z] [INFO ] [pipeline-thread-12 > ([prd453-19-event-upsert]-bo-pipeline-12)] > [o.a.k.c.consumer.internals.Fetcher] [Consumer > clientId=bo-indexer-group-prd453-19-on-c19-bo-indexer-upsert-blue-5d665bcbb7-dnvkh-pid-1-kafka-message-source-id-544, > groupId=bo-indexer-group-prd453-19] Resetting offset for partition > prd453-19-event-upsert-32 to offset 453223789. > {noformat} > Broker logs for the partition : > {noformat} > [2020-07-17T07:40:12,082Z] [INFO ] [kafka-scheduler-4] [kafka.log.Log] > [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Found deletable > segments with base offsets [452091893] due to retention time 8640ms breach > [2020-07-17T07:40:12,082Z] [INFO ] [kafka-scheduler-4] [kafka.log.Log] > [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Scheduling log > segment [baseOffset 452091893, size 1073741693] for deletion. > [2020-07-17T07:40:12,083Z] [INFO ] [kafka-scheduler-4] [kafka.log.Log] > [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Incrementing log > start offset to 453223789 > [2020-07-17T07:41:12,083Z] [INFO ] [kafka-scheduler-7] [kafka.log.Log] > [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Deleting segment > 452091893 > [2020-07-17T07:41:12,114Z] [INFO ] [kafka-scheduler-7] > [kafka.log.LogSegment] Deleted log > /data/kafka/prd453-19-event-upsert-32/000452091893.log.deleted. > [2020-07-17T07:41:12,114Z] [INFO ] [kafka-scheduler-7] > [kafka.log.LogSegment] Deleted offset index > /data/kafka/prd453-19-event-upsert-32/000452091893.index.deleted. > [2020-07-17T07:41:12,114Z] [INFO ] [kafka-scheduler-7] > [kafka.log.LogSegment] Deleted time index > /data/kafka/prd453-19-event-upsert-32/000452091893.timeindex.deleted. > [2020-07-17T07:52:31,836Z] [INFO ] [data-plane-kafka-request-handler-3] > [kafka.log.ProducerStateManager] [ProducerStateManager > partition=prd453-19-event-upsert-32] Writing producer snapshot at offset > 475609786 > [2020-07-17T07:52:31,836Z] [INFO ] [data-plane-kafka-request-handler-3] > [kafka.log.Log] [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] > Rolled new log segment at offset 475609786 in 1 ms.{noformat} > > {noformat} > [2020-07-17T09:05:12,075Z] [INFO ] [kafka-scheduler-2] [kafka.log.Log] > [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Found deletable > segments with
[jira] [Updated] (KAFKA-10313) Out of range offset errors leading to offset reset
[ https://issues.apache.org/jira/browse/KAFKA-10313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varsha Abhinandan updated KAFKA-10313: -- Description: Hi, We have been occasionally noticing offset resets happening on the Kafka consumer because of offset out of range error. However, I don't see any errors in the broker logs. No logs related to leader-election, replica lag, Kafka broker pod restarts or anything. (just info logs were enabled in the prod environment). It appeared from the logs that the out of range error was because of the fetch offset being larger than the offset range on the broker. Noticed this happening multiple times on different consumers, stream apps in the prod environment. So, it doesn't seem like an application bug and more like a bug in the KafkaConsumer. Would like to understand the cause for such errors. Also, none of the offset reset options are desirable. Choosing "earliest" creates a sudden huge lag (we have a retention of 24hours) and choosing "latest" leads to data loss (the records produced between the out of range error and when offset reset happens on the consumer). So, wondering if it is better for the Kafka client to separate out 'auto.offset.reset' config for just offset not found. For, out of range error maybe the Kafka client can automatically reset the offset to latest if the fetch offset is higher to prevent data loss. Also, automatically reset it to earliest if the fetch offset is lesser than the start offset. Following are the logs on the consumer side : {noformat} [2020-07-17T08:46:00,322Z] [INFO ] [pipeline-thread-12 ([prd453-19-event-upsert]-bo-pipeline-12)] [o.a.k.c.consumer.internals.Fetcher] [Consumer clientId=bo-indexer-group-prd453-19-on-c19-bo-indexer-upsert-blue-5d665bcbb7-dnvkh-pid-1-kafka-message-source-id-544, groupId=bo-indexer-group-prd453-19] Fetch offset 476383711 is out of range for partition prd453-19-event-upsert-32, resetting offset [2020-07-17T08:46:00,330Z] [INFO ] [pipeline-thread-12 ([prd453-19-event-upsert]-bo-pipeline-12)] [o.a.k.c.consumer.internals.Fetcher] [Consumer clientId=bo-indexer-group-prd453-19-on-c19-bo-indexer-upsert-blue-5d665bcbb7-dnvkh-pid-1-kafka-message-source-id-544, groupId=bo-indexer-group-prd453-19] Resetting offset for partition prd453-19-event-upsert-32 to offset 453223789. {noformat} Broker logs for the partition : {noformat} [2020-07-17T07:40:12,082Z] [INFO ] [kafka-scheduler-4] [kafka.log.Log] [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Found deletable segments with base offsets [452091893] due to retention time 8640ms breach [2020-07-17T07:40:12,082Z] [INFO ] [kafka-scheduler-4] [kafka.log.Log] [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Scheduling log segment [baseOffset 452091893, size 1073741693] for deletion. [2020-07-17T07:40:12,083Z] [INFO ] [kafka-scheduler-4] [kafka.log.Log] [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Incrementing log start offset to 453223789 [2020-07-17T07:41:12,083Z] [INFO ] [kafka-scheduler-7] [kafka.log.Log] [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Deleting segment 452091893 [2020-07-17T07:41:12,114Z] [INFO ] [kafka-scheduler-7] [kafka.log.LogSegment] Deleted log /data/kafka/prd453-19-event-upsert-32/000452091893.log.deleted. [2020-07-17T07:41:12,114Z] [INFO ] [kafka-scheduler-7] [kafka.log.LogSegment] Deleted offset index /data/kafka/prd453-19-event-upsert-32/000452091893.index.deleted. [2020-07-17T07:41:12,114Z] [INFO ] [kafka-scheduler-7] [kafka.log.LogSegment] Deleted time index /data/kafka/prd453-19-event-upsert-32/000452091893.timeindex.deleted. [2020-07-17T07:52:31,836Z] [INFO ] [data-plane-kafka-request-handler-3] [kafka.log.ProducerStateManager] [ProducerStateManager partition=prd453-19-event-upsert-32] Writing producer snapshot at offset 475609786 [2020-07-17T07:52:31,836Z] [INFO ] [data-plane-kafka-request-handler-3] [kafka.log.Log] [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Rolled new log segment at offset 475609786 in 1 ms.{noformat} {noformat} [2020-07-17T09:05:12,075Z] [INFO ] [kafka-scheduler-2] [kafka.log.Log] [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Found deletable segments with base offsets [453223789] due to retention time 8640ms breach [2020-07-17T09:05:12,075Z] [INFO ] [kafka-scheduler-2] [kafka.log.Log] [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Scheduling log segment [baseOffset 453223789, size 1073741355] for deletion. [2020-07-17T09:05:12,075Z] [INFO ] [kafka-scheduler-2] [kafka.log.Log] [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Incrementing log start offset to 454388428 [2020-07-17T09:06:12,075Z] [INFO ] [kafka-scheduler-6] [kafka.log.Log] [Log
[jira] [Updated] (KAFKA-10313) Out of range offset errors leading to offset reset
[ https://issues.apache.org/jira/browse/KAFKA-10313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varsha Abhinandan updated KAFKA-10313: -- Description: Hi, We have been occasionally noticing offset resets happening on the Kafka consumer because of offset out of range error. However, I don't see any errors in the broker logs. No logs related to leader-election, replica lag, Kafka broker pod restarts or anything. (just info logs were enabled in the prod environment). It appeared from the logs that the out of range error was because of the fetch offset being larger than the offset range on the broker. Noticed this happening multiple times on different consumers, stream apps in the prod environment. So, it doesn't seem like an application bug and more like a bug in the KafkaConsumer. Would like to understand the cause for such errors. Also, none of the offset reset options are desirable. Choosing "earliest" creates a sudden huge lag (we have a retention of 24hours) and choosing "latest" leads to data loss (the records produced between the out of range error and when offset reset happens on the consumer). So, wondering if it is better for the Kafka client to separate out 'auto.offset.reset' config for just offset not found. For, out of range error maybe the Kafka client can automatically reset the offset to latest if the fetch offset is higher to prevent data loss. Also, automatically reset it to earliest if the fetch offset is lesser than the start offset. Following are the logs on the consumer side : {noformat} [2020-07-17T08:46:00,322Z] [INFO ] [pipeline-thread-12 ([prd453-19-event-upsert]-bo-pipeline-12)] [o.a.k.c.consumer.internals.Fetcher] [Consumer clientId=bo-indexer-group-prd453-19-on-c19-bo-indexer-upsert-blue-5d665bcbb7-dnvkh-pid-1-kafka-message-source-id-544, groupId=bo-indexer-group-prd453-19] Fetch offset 476383711 is out of range for partition prd453-19-event-upsert-32, resetting offset [2020-07-17T08:46:00,330Z] [INFO ] [pipeline-thread-12 ([prd453-19-event-upsert]-bo-pipeline-12)] [o.a.k.c.consumer.internals.Fetcher] [Consumer clientId=bo-indexer-group-prd453-19-on-c19-bo-indexer-upsert-blue-5d665bcbb7-dnvkh-pid-1-kafka-message-source-id-544, groupId=bo-indexer-group-prd453-19] Resetting offset for partition prd453-19-event-upsert-32 to offset 453223789. {noformat} Broker logs for the partition : {noformat} [2020-07-17T07:40:12,082Z] [INFO ] [kafka-scheduler-4] [kafka.log.Log] [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Found deletable segments with base offsets [452091893] due to retention time 8640ms breach [2020-07-17T07:40:12,082Z] [INFO ] [kafka-scheduler-4] [kafka.log.Log] [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Scheduling log segment [baseOffset 452091893, size 1073741693] for deletion. [2020-07-17T07:40:12,083Z] [INFO ] [kafka-scheduler-4] [kafka.log.Log] [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Incrementing log start offset to 453223789 [2020-07-17T07:41:12,083Z] [INFO ] [kafka-scheduler-7] [kafka.log.Log] [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Deleting segment 452091893 [2020-07-17T07:41:12,114Z] [INFO ] [kafka-scheduler-7] [kafka.log.LogSegment] Deleted log /data/kafka/prd453-19-event-upsert-32/000452091893.log.deleted. [2020-07-17T07:41:12,114Z] [INFO ] [kafka-scheduler-7] [kafka.log.LogSegment] Deleted offset index /data/kafka/prd453-19-event-upsert-32/000452091893.index.deleted. [2020-07-17T07:41:12,114Z] [INFO ] [kafka-scheduler-7] [kafka.log.LogSegment] Deleted time index /data/kafka/prd453-19-event-upsert-32/000452091893.timeindex.deleted. [2020-07-17T07:52:31,836Z] [INFO ] [data-plane-kafka-request-handler-3] [kafka.log.ProducerStateManager] [ProducerStateManager partition=prd453-19-event-upsert-32] Writing producer snapshot at offset 475609786 [2020-07-17T07:52:31,836Z] [INFO ] [data-plane-kafka-request-handler-3] [kafka.log.Log] [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Rolled new log segment at offset 475609786 in 1 ms.{noformat} {noformat} [2020-07-17T09:05:12,075Z] [INFO ] [kafka-scheduler-2] [kafka.log.Log] [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Found deletable segments with base offsets [453223789] due to retention time 8640ms breach [2020-07-17T09:05:12,075Z] [INFO ] [kafka-scheduler-2] [kafka.log.Log] [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Scheduling log segment [baseOffset 453223789, size 1073741355] for deletion. [2020-07-17T09:05:12,075Z] [INFO ] [kafka-scheduler-2] [kafka.log.Log] [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Incrementing log start offset to 454388428 [2020-07-17T09:06:12,075Z] [INFO ] [kafka-scheduler-6] [kafka.log.Log] [Log partition=prd453-19-event-upsert-32,
[jira] [Created] (KAFKA-10313) Out of range offset errors leading to offset reset
Varsha Abhinandan created KAFKA-10313: - Summary: Out of range offset errors leading to offset reset Key: KAFKA-10313 URL: https://issues.apache.org/jira/browse/KAFKA-10313 Project: Kafka Issue Type: Bug Components: consumer Affects Versions: 2.2.2 Reporter: Varsha Abhinandan Hi, We have been occasionally noticing offset resets happening on the Kafka consumer because of offset out of range error. However, I don't see any errors in the broker logs. No logs related to leader-election, replica lag, Kafka broker pod restarts or anything. (just info logs were enabled in the prod environment). It appeared from the logs that the out of range error was because of the fetch offset being larger than the offset range on the broker. Noticed this happening multiple times on different consumers, stream apps in the prod environment. So, it doesn't seem like an application bug and more like a bug in the KafkaConsumer. Would like to understand the cause for such errors. Also, none of the offset reset options are desirable. Choosing "earliest" creates a sudden huge lag (we have a retention of 24hours) and choosing "latest" leads to data loss (the records produced between the out of range error and when offset reset happens on the consumer). So, wondering if it is better for the Kafka client to separate out 'auto.offset.reset' config for just offset not found. For, out of range error maybe the Kafka client can automatically reset the offset to latest if the fetch offset is higher to prevent data loss. Also, automatically reset it to earliest if the fetch offset is lesser than the start offset. Following are the logs on the consumer side : [2020-07-17T08:46:00,322Z] [INFO ] [pipeline-thread-12 ([prd453-19-event-upsert]-bo-pipeline-12)] [o.a.k.c.consumer.internals.Fetcher] [Consumer clientId=bo-indexer-group-prd453-19-on-c19-bo-indexer-upsert-blue-5d665bcbb7-dnvkh-pid-1-kafka-message-source-id-544, groupId=bo-indexer-group-prd453-19] Fetch offset 476383711 is out of range for partition prd453-19-event-upsert-32, resetting offset[2020-07-17T08:46:00,330Z] [INFO ] [pipeline-thread-12 ([prd453-19-event-upsert]-bo-pipeline-12)] [o.a.k.c.consumer.internals.Fetcher] [Consumer clientId=bo-indexer-group-prd453-19-on-c19-bo-indexer-upsert-blue-5d665bcbb7-dnvkh-pid-1-kafka-message-source-id-544, groupId=bo-indexer-group-prd453-19] Resetting offset for partition prd453-19-event-upsert-32 to offset 453223789. Broker logs for the partition : _[2020-07-17T07:40:12,082Z] [INFO ] [kafka-scheduler-4] [kafka.log.Log] [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Found deletable segments with base offsets [452091893] due to retention time 8640ms breach_ _[2020-07-17T07:40:12,082Z] [INFO ] [kafka-scheduler-4] [kafka.log.Log] [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Scheduling log segment [baseOffset 452091893, size 1073741693] for deletion._ _[2020-07-17T07:40:12,083Z] [INFO ] [kafka-scheduler-4] [kafka.log.Log] [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Incrementing log start offset to 453223789_ _[2020-07-17T07:41:12,083Z] [INFO ] [kafka-scheduler-7] [kafka.log.Log] [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Deleting segment 452091893_ _[2020-07-17T07:41:12,114Z] [INFO ] [kafka-scheduler-7] [kafka.log.LogSegment] Deleted log /data/kafka/prd453-19-event-upsert-32/000452091893.log.deleted._ _[2020-07-17T07:41:12,114Z] [INFO ] [kafka-scheduler-7] [kafka.log.LogSegment] Deleted offset index /data/kafka/prd453-19-event-upsert-32/000452091893.index.deleted._ _[2020-07-17T07:41:12,114Z] [INFO ] [kafka-scheduler-7] [kafka.log.LogSegment] Deleted time index /data/kafka/prd453-19-event-upsert-32/000452091893.timeindex.deleted._ _[2020-07-17T07:52:31,836Z] [INFO ] [data-plane-kafka-request-handler-3] [kafka.log.ProducerStateManager] [ProducerStateManager partition=prd453-19-event-upsert-32] Writing producer snapshot at offset 475609786_ _[2020-07-17T07:52:31,836Z] [INFO ] [data-plane-kafka-request-handler-3] [kafka.log.Log] [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Rolled new log segment at offset 475609786 in 1 ms._ _[2020-07-17T09:05:12,075Z] [INFO ] [kafka-scheduler-2] [kafka.log.Log] [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Found deletable segments with base offsets [453223789] due to retention time 8640ms breach_ _[2020-07-17T09:05:12,075Z] [INFO ] [kafka-scheduler-2] [kafka.log.Log] [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Scheduling log segment [baseOffset 453223789, size 1073741355] for deletion._ _[2020-07-17T09:05:12,075Z] [INFO ] [kafka-scheduler-2] [kafka.log.Log] [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Incrementing log start
[jira] [Comment Edited] (KAFKA-8673) Kafka stream threads stuck while sending offsets to transaction preventing join group from completing
[ https://issues.apache.org/jira/browse/KAFKA-8673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892014#comment-16892014 ] Varsha Abhinandan edited comment on KAFKA-8673 at 7/29/19 7:34 AM: --- Hi [~guozhang], the threads were blocked on TransactionalRequestResult.await for about 4 days. The rebalance completed only after we restarted the processes which had the stream threads stuck on TransactionalRequestResult.await. was (Author: varsha.abhinandan): Hi [~guozhang], the threads were blocked on TransactionalRequestResult.await for about 4 days. The rebalance completed only after we restarted the processes which had the stream threads stuck on TransactionalRequestResult.await. > Kafka stream threads stuck while sending offsets to transaction preventing > join group from completing > - > > Key: KAFKA-8673 > URL: https://issues.apache.org/jira/browse/KAFKA-8673 > Project: Kafka > Issue Type: Bug > Components: consumer, streams >Affects Versions: 2.2.0 >Reporter: Varsha Abhinandan >Priority: Major > Attachments: Screen Shot 2019-07-11 at 12.08.09 PM.png > > > We observed a deadlock kind of a situation in our Kafka streams application > when we accidentally shut down all the brokers. The Kafka cluster was brought > back in about an hour. > Observations made : > # Normal Kafka producers and consumers started working fine after the > brokers were up again. > # The Kafka streams applications were stuck in the "rebalancing" state. > # The Kafka streams apps have exactly-once semantics enabled. > # The stack trace showed most of the stream threads sending the join group > requests to the group co-ordinator > # Few stream threads couldn't initiate the join group request since the call > to > [org.apache.kafka.clients.producer.KafkaProducer#sendOffsetsToTransaction|https://jira.corp.appdynamics.com/browse/ANLYTCS_ES-2062#sendOffsetsToTransaction%20which%20was%20hung] > was stuck. > # Seems like the join group requests were getting parked at the coordinator > since the expected members hadn't sent their own group join requests > # And after the timeout, the stream threads that were not stuck sent a new > join group requests. > # Maybe (6) and (7) is happening infinitely > # Sample values of the GroupMetadata object on the group co-ordinator - > [^Screen Shot 2019-07-11 at 12.08.09 PM.png] > # The list of notYetJoinedMembers client id's matched with the threads > waiting for their offsets to be committed. > {code:java} > [List(MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer-efa41349-3da1-43b6-9710-a662f68c63b1, > > clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer, > clientHost=/10.136.98.48, sessionTimeoutMs=15000, > rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), > MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer-7cc8e41b-ad98-4006-a18a-b22abe6350f4, > > clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer, > clientHost=/10.136.103.148, sessionTimeoutMs=15000, > rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), > MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer-9ffb96c1-3379-4cbd-bee1-5d4719fe6c9d, > > clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer, > clientHost=/10.136.98.48, sessionTimeoutMs=15000, > rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), > MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer-5b8a1f1f-84dd-4a87-86c8-7542c0e50d1f, > > clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer, > clientHost=/10.136.103.148, sessionTimeoutMs=15000, > rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), > MemberMetadata(memberId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer-3cb67ec9-c548-4386-962d-64d9772bf719, > > clientId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer, > clientHost=/10.136.99.15, sessionTimeoutMs=15000, > rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ))] > vabhinandan-mac:mp-jstack varsha.abhinandan$ cat jstack.* | grep > "metric-extractor-stream-c1-" | grep "StreamThread-" | grep "waiting on > condition" > "metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36" > #128 daemon prio=5 os_prio=0
[jira] [Commented] (KAFKA-8673) Kafka stream threads stuck while sending offsets to transaction preventing join group from completing
[ https://issues.apache.org/jira/browse/KAFKA-8673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892014#comment-16892014 ] Varsha Abhinandan commented on KAFKA-8673: -- Hi [~guozhang], the threads were blocked on TransactionalRequestResult.await for about 4 days. The rebalance completed only after we restarted the processes which had the stream threads stuck on TransactionalRequestResult.await. > Kafka stream threads stuck while sending offsets to transaction preventing > join group from completing > - > > Key: KAFKA-8673 > URL: https://issues.apache.org/jira/browse/KAFKA-8673 > Project: Kafka > Issue Type: Bug > Components: consumer, streams >Affects Versions: 2.2.0 >Reporter: Varsha Abhinandan >Priority: Major > Attachments: Screen Shot 2019-07-11 at 12.08.09 PM.png > > > We observed a deadlock kind of a situation in our Kafka streams application > when we accidentally shut down all the brokers. The Kafka cluster was brought > back in about an hour. > Observations made : > # Normal Kafka producers and consumers started working fine after the > brokers were up again. > # The Kafka streams applications were stuck in the "rebalancing" state. > # The Kafka streams apps have exactly-once semantics enabled. > # The stack trace showed most of the stream threads sending the join group > requests to the group co-ordinator > # Few stream threads couldn't initiate the join group request since the call > to > [org.apache.kafka.clients.producer.KafkaProducer#sendOffsetsToTransaction|https://jira.corp.appdynamics.com/browse/ANLYTCS_ES-2062#sendOffsetsToTransaction%20which%20was%20hung] > was stuck. > # Seems like the join group requests were getting parked at the coordinator > since the expected members hadn't sent their own group join requests > # And after the timeout, the stream threads that were not stuck sent a new > join group requests. > # Maybe (6) and (7) is happening infinitely > # Sample values of the GroupMetadata object on the group co-ordinator - > [^Screen Shot 2019-07-11 at 12.08.09 PM.png] > # The list of notYetJoinedMembers client id's matched with the threads > waiting for their offsets to be committed. > {code:java} > [List(MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer-efa41349-3da1-43b6-9710-a662f68c63b1, > > clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer, > clientHost=/10.136.98.48, sessionTimeoutMs=15000, > rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), > MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer-7cc8e41b-ad98-4006-a18a-b22abe6350f4, > > clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer, > clientHost=/10.136.103.148, sessionTimeoutMs=15000, > rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), > MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer-9ffb96c1-3379-4cbd-bee1-5d4719fe6c9d, > > clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer, > clientHost=/10.136.98.48, sessionTimeoutMs=15000, > rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), > MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer-5b8a1f1f-84dd-4a87-86c8-7542c0e50d1f, > > clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer, > clientHost=/10.136.103.148, sessionTimeoutMs=15000, > rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), > MemberMetadata(memberId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer-3cb67ec9-c548-4386-962d-64d9772bf719, > > clientId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer, > clientHost=/10.136.99.15, sessionTimeoutMs=15000, > rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ))] > vabhinandan-mac:mp-jstack varsha.abhinandan$ cat jstack.* | grep > "metric-extractor-stream-c1-" | grep "StreamThread-" | grep "waiting on > condition" > "metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36" > #128 daemon prio=5 os_prio=0 tid=0x7fc53c047800 nid=0xac waiting on > condition [0x7fc4e68e7000] > "metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21" > #93 daemon prio=5 os_prio=0 tid=0x7fc53c2b5800 nid=0x9d waiting on > condition [0x7fc4e77f6000] >
[jira] [Updated] (KAFKA-8673) Kafka stream threads stuck while sending offsets to transaction preventing join group from completing
[ https://issues.apache.org/jira/browse/KAFKA-8673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varsha Abhinandan updated KAFKA-8673: - Description: We observed a deadlock kind of a situation in our Kafka streams application when we accidentally shut down all the brokers. The Kafka cluster was brought back in about an hour. Observations made : # Normal Kafka producers and consumers started working fine after the brokers were up again. # The Kafka streams applications were stuck in the "rebalancing" state. # The Kafka streams apps have exactly-once semantics enabled. # The stack trace showed most of the stream threads sending the join group requests to the group co-ordinator # Few stream threads couldn't initiate the join group request since the call to [org.apache.kafka.clients.producer.KafkaProducer#sendOffsetsToTransaction|https://jira.corp.appdynamics.com/browse/ANLYTCS_ES-2062#sendOffsetsToTransaction%20which%20was%20hung] was stuck. # Seems like the join group requests were getting parked at the coordinator since the expected members hadn't sent their own group join requests # And after the timeout, the stream threads that were not stuck sent a new join group requests. # Maybe (6) and (7) is happening infinitely # Sample values of the GroupMetadata object on the group co-ordinator - [^Screen Shot 2019-07-11 at 12.08.09 PM.png] # The list of notYetJoinedMembers client id's matched with the threads waiting for their offsets to be committed. {code:java} [List(MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer-efa41349-3da1-43b6-9710-a662f68c63b1, clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer, clientHost=/10.136.98.48, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer-7cc8e41b-ad98-4006-a18a-b22abe6350f4, clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer, clientHost=/10.136.103.148, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer-9ffb96c1-3379-4cbd-bee1-5d4719fe6c9d, clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer, clientHost=/10.136.98.48, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer-5b8a1f1f-84dd-4a87-86c8-7542c0e50d1f, clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer, clientHost=/10.136.103.148, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), MemberMetadata(memberId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer-3cb67ec9-c548-4386-962d-64d9772bf719, clientId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer, clientHost=/10.136.99.15, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ))] vabhinandan-mac:mp-jstack varsha.abhinandan$ cat jstack.* | grep "metric-extractor-stream-c1-" | grep "StreamThread-" | grep "waiting on condition" "metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36" #128 daemon prio=5 os_prio=0 tid=0x7fc53c047800 nid=0xac waiting on condition [0x7fc4e68e7000] "metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21" #93 daemon prio=5 os_prio=0 tid=0x7fc53c2b5800 nid=0x9d waiting on condition [0x7fc4e77f6000] "metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33" #125 daemon prio=5 os_prio=0 tid=0x7fe18017c800 nid=0xbc waiting on condition [0x7fe12e7e8000] "metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38" #154 daemon prio=5 os_prio=0 tid=0x7f27c4225800 nid=0xc4 waiting on condition [0x7f2772bec000] "metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27" #143 daemon prio=5 os_prio=0 tid=0x7f27c4365800 nid=0xb9 waiting on condition [0x7f27736f7000] {code} 11. Sample Stream Thread stuck - {noformat} "metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36" #128 daemon prio=5 os_prio=0 tid=0x7fc53c047800 nid=0xac waiting on condition [0x7fc4e68e7000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000723587580> (a java.util.concurrent.CountDownLatch$Sync) at
[jira] [Updated] (KAFKA-8673) Kafka stream threads stuck while sending offsets to transaction preventing join group from completing
[ https://issues.apache.org/jira/browse/KAFKA-8673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varsha Abhinandan updated KAFKA-8673: - Description: We observed a deadlock kind of a situation in our Kafka streams application when we accidentally shut down all the brokers. The Kafka cluster was brought back in about an hour. Observations made : # Normal Kafka producers and consumers started working fine after the brokers were up again. # The Kafka streams applications were stuck in the "rebalancing" state. # The Kafka streams apps have exactly-once semantics enabled. # The stack trace showed most of the stream threads sending the join group requests to the group co-ordinator # Few stream threads couldn't initiate the join group request since the call to [org.apache.kafka.clients.producer.KafkaProducer#sendOffsetsToTransaction|https://jira.corp.appdynamics.com/browse/ANLYTCS_ES-2062#sendOffsetsToTransaction%20which%20was%20hung] was stuck. # Seems like the join group requests were getting parked at the coordinator since the expected members hadn't sent their own group join requests # And after the timeout, the stream threads that were not stuck sent a new join group requests. # Maybe (6) and (7) is happening infinitely # Sample values of the GroupMetadata object on the group co-ordinator - [^Screen Shot 2019-07-11 at 12.08.09 PM.png] # The list of notYetJoinedMembers client id's matched with the threads waiting for their offsets to be committed. {code:java} [List(MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer-efa41349-3da1-43b6-9710-a662f68c63b1, clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer, clientHost=/10.136.98.48, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer-7cc8e41b-ad98-4006-a18a-b22abe6350f4, clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer, clientHost=/10.136.103.148, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer-9ffb96c1-3379-4cbd-bee1-5d4719fe6c9d, clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer, clientHost=/10.136.98.48, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer-5b8a1f1f-84dd-4a87-86c8-7542c0e50d1f, clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer, clientHost=/10.136.103.148, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), MemberMetadata(memberId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer-3cb67ec9-c548-4386-962d-64d9772bf719, clientId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer, clientHost=/10.136.99.15, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ))] vabhinandan-mac:mp-jstack varsha.abhinandan$ cat jstack.* | grep "metric-extractor-stream-c1-" | grep "StreamThread-" | grep "waiting on condition" "metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36" #128 daemon prio=5 os_prio=0 tid=0x7fc53c047800 nid=0xac waiting on condition [0x7fc4e68e7000] "metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21" #93 daemon prio=5 os_prio=0 tid=0x7fc53c2b5800 nid=0x9d waiting on condition [0x7fc4e77f6000] "metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33" #125 daemon prio=5 os_prio=0 tid=0x7fe18017c800 nid=0xbc waiting on condition [0x7fe12e7e8000] "metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38" #154 daemon prio=5 os_prio=0 tid=0x7f27c4225800 nid=0xc4 waiting on condition [0x7f2772bec000] "metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27" #143 daemon prio=5 os_prio=0 tid=0x7f27c4365800 nid=0xb9 waiting on condition [0x7f27736f7000] {code} # Sample Stream Thread stuck - {noformat} "metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36" #128 daemon prio=5 os_prio=0 tid=0x7fc53c047800 nid=0xac waiting on condition [0x7fc4e68e7000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000723587580> (a java.util.concurrent.CountDownLatch$Sync) at
[jira] [Updated] (KAFKA-8673) Kafka stream threads stuck while sending offsets to transaction preventing join group from completing
[ https://issues.apache.org/jira/browse/KAFKA-8673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varsha Abhinandan updated KAFKA-8673: - Description: We observed a deadlock kind of a situation in our Kafka streams application when we accidentally shut down all the brokers. The Kafka cluster was brought back in about an hour. Observations made : # Normal Kafka producers and consumers started working fine after the brokers were up again. # The Kafka streams applications were stuck in the "rebalancing" state. # The Kafka streams apps have exactly-once semantics enabled. # The stack trace showed most of the stream threads sending the join group requests to the group co-ordinator # Few stream threads couldn't initiate the join group request since the call to [org.apache.kafka.clients.producer.KafkaProducer#sendOffsetsToTransaction|https://jira.corp.appdynamics.com/browse/ANLYTCS_ES-2062#sendOffsetsToTransaction%20which%20was%20hung] was stuck. # Seems like the join group requests were getting parked at the coordinator since the expected members hadn't sent their own group join requests # And after the timeout, the stream threads that were not stuck sent a new join group requests. # Maybe (6) and (7) is happening infinitely # Sample values of the GroupMetadata object on the group co-ordinator - [^Screen Shot 2019-07-11 at 12.08.09 PM.png] # The list of notYetJoinedMembers client id's matched with the threads waiting for their offsets to be committed. {code:java} [List(MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer-efa41349-3da1-43b6-9710-a662f68c63b1, clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer, clientHost=/10.136.98.48, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer-7cc8e41b-ad98-4006-a18a-b22abe6350f4, clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer, clientHost=/10.136.103.148, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer-9ffb96c1-3379-4cbd-bee1-5d4719fe6c9d, clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer, clientHost=/10.136.98.48, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer-5b8a1f1f-84dd-4a87-86c8-7542c0e50d1f, clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer, clientHost=/10.136.103.148, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), MemberMetadata(memberId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer-3cb67ec9-c548-4386-962d-64d9772bf719, clientId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer, clientHost=/10.136.99.15, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ))] vabhinandan-mac:mp-jstack varsha.abhinandan$ cat jstack.* | grep "metric-extractor-stream-c1-" | grep "StreamThread-" | grep "waiting on condition" "metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36" #128 daemon prio=5 os_prio=0 tid=0x7fc53c047800 nid=0xac waiting on condition [0x7fc4e68e7000] "metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21" #93 daemon prio=5 os_prio=0 tid=0x7fc53c2b5800 nid=0x9d waiting on condition [0x7fc4e77f6000] "metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33" #125 daemon prio=5 os_prio=0 tid=0x7fe18017c800 nid=0xbc waiting on condition [0x7fe12e7e8000] "metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38" #154 daemon prio=5 os_prio=0 tid=0x7f27c4225800 nid=0xc4 waiting on condition [0x7f2772bec000] "metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27" #143 daemon prio=5 os_prio=0 tid=0x7f27c4365800 nid=0xb9 waiting on condition [0x7f27736f7000] {code} {noformat} "metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36" #128 daemon prio=5 os_prio=0 tid=0x7fc53c047800 nid=0xac waiting on condition [0x7fc4e68e7000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000723587580> (a java.util.concurrent.CountDownLatch$Sync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at
[jira] [Updated] (KAFKA-8673) Kafka stream threads stuck while sending offsets to transaction preventing join group from completing
[ https://issues.apache.org/jira/browse/KAFKA-8673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varsha Abhinandan updated KAFKA-8673: - Description: We observed a deadlock kind of a situation in our Kafka streams application when we accidentally shut down all the brokers. The Kafka cluster was brought back in about an hour. Observations made : # Normal Kafka producers and consumers started working fine after the brokers were up again. # The Kafka streams applications were stuck in the "rebalancing" state. # The Kafka streams apps have exactly-once semantics enabled. # The stack trace showed most of the stream threads sending the join group requests to the group co-ordinator # Few stream threads couldn't initiate the join group request since the call to [org.apache.kafka.clients.producer.KafkaProducer#sendOffsetsToTransaction|https://jira.corp.appdynamics.com/browse/ANLYTCS_ES-2062#sendOffsetsToTransaction%20which%20was%20hung] was stuck. # Seems like the join group requests were getting parked at the coordinator since the expected members hadn't sent their own group join requests # And after the timeout, the stream threads that were not stuck sent a new join group requests. # Maybe (6) and (7) is happening infinitely # Sample values of the GroupMetadata object on the group co-ordinator - https://issues.apache.org/jira/secure/attachment/12974837/Screen%20Shot%202019-07-11%20at%2012.08.09%20PM.png # The list of notYetJoinedMembers client id's matched with the threads waiting for their offsets to be committed. {code:java} [List(MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer-efa41349-3da1-43b6-9710-a662f68c63b1, clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer, clientHost=/10.136.98.48, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer-7cc8e41b-ad98-4006-a18a-b22abe6350f4, clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer, clientHost=/10.136.103.148, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer-9ffb96c1-3379-4cbd-bee1-5d4719fe6c9d, clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer, clientHost=/10.136.98.48, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer-5b8a1f1f-84dd-4a87-86c8-7542c0e50d1f, clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer, clientHost=/10.136.103.148, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), MemberMetadata(memberId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer-3cb67ec9-c548-4386-962d-64d9772bf719, clientId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer, clientHost=/10.136.99.15, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ))] vabhinandan-mac:mp-jstack varsha.abhinandan$ cat jstack.* | grep "metric-extractor-stream-c1-" | grep "StreamThread-" | grep "waiting on condition" "metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36" #128 daemon prio=5 os_prio=0 tid=0x7fc53c047800 nid=0xac waiting on condition [0x7fc4e68e7000] "metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21" #93 daemon prio=5 os_prio=0 tid=0x7fc53c2b5800 nid=0x9d waiting on condition [0x7fc4e77f6000] "metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33" #125 daemon prio=5 os_prio=0 tid=0x7fe18017c800 nid=0xbc waiting on condition [0x7fe12e7e8000] "metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38" #154 daemon prio=5 os_prio=0 tid=0x7f27c4225800 nid=0xc4 waiting on condition [0x7f2772bec000] "metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27" #143 daemon prio=5 os_prio=0 tid=0x7f27c4365800 nid=0xb9 waiting on condition [0x7f27736f7000] {code} was: We observed a deadlock kind of a situation in our Kafka streams application when we accidentally shut down all the brokers. The Kafka cluster was brought back in about an hour. Observations made : # Normal Kafka producers and consumers started working fine after the brokers were up again. # The Kafka streams applications were stuck in the "rebalancing" state. # The
[jira] [Updated] (KAFKA-8673) Kafka stream threads stuck while sending offsets to transaction preventing join group from completing
[ https://issues.apache.org/jira/browse/KAFKA-8673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varsha Abhinandan updated KAFKA-8673: - Description: We observed a deadlock kind of a situation in our Kafka streams application when we accidentally shut down all the brokers. The Kafka cluster was brought back in about an hour. Observations made : # Normal Kafka producers and consumers started working fine after the brokers were up again. # The Kafka streams applications were stuck in the "rebalancing" state. # The Kafka streams apps have exactly-once semantics enabled. # The stack trace showed most of the stream threads sending the join group requests to the group co-ordinator # Few stream threads couldn't initiate the join group request since the call to [org.apache.kafka.clients.producer.KafkaProducer#sendOffsetsToTransaction|https://jira.corp.appdynamics.com/browse/ANLYTCS_ES-2062#sendOffsetsToTransaction%20which%20was%20hung] was stuck. # Seems like the join group requests were getting parked at the coordinator since the expected members hadn't sent their own group join requests # And after the timeout, the stream threads that were not stuck sent a new join group requests. # Maybe (6) and (7) is happening infinitely # Sample values of the GroupMetadata object on the group co-ordinator - # The list of notYetJoinedMembers client id's matched with the threads waiting for their offsets to be committed. {code:java} [List(MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer-efa41349-3da1-43b6-9710-a662f68c63b1, clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer, clientHost=/10.136.98.48, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer-7cc8e41b-ad98-4006-a18a-b22abe6350f4, clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer, clientHost=/10.136.103.148, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer-9ffb96c1-3379-4cbd-bee1-5d4719fe6c9d, clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer, clientHost=/10.136.98.48, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer-5b8a1f1f-84dd-4a87-86c8-7542c0e50d1f, clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer, clientHost=/10.136.103.148, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), MemberMetadata(memberId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer-3cb67ec9-c548-4386-962d-64d9772bf719, clientId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer, clientHost=/10.136.99.15, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ))] vabhinandan-mac:mp-jstack varsha.abhinandan$ cat jstack.* | grep "metric-extractor-stream-c1-" | grep "StreamThread-" | grep "waiting on condition" "metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36" #128 daemon prio=5 os_prio=0 tid=0x7fc53c047800 nid=0xac waiting on condition [0x7fc4e68e7000] "metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21" #93 daemon prio=5 os_prio=0 tid=0x7fc53c2b5800 nid=0x9d waiting on condition [0x7fc4e77f6000] "metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33" #125 daemon prio=5 os_prio=0 tid=0x7fe18017c800 nid=0xbc waiting on condition [0x7fe12e7e8000] "metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38" #154 daemon prio=5 os_prio=0 tid=0x7f27c4225800 nid=0xc4 waiting on condition [0x7f2772bec000] "metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27" #143 daemon prio=5 os_prio=0 tid=0x7f27c4365800 nid=0xb9 waiting on condition [0x7f27736f7000] {code} was: We observed a deadlock kind of a situation in our Kafka streams application when we accidentally shut down all the brokers. The Kafka cluster was brought back in about an hour. Observations made : # Normal Kafka producers and consumers started working fine after the brokers were up again. # The Kafka streams applications were stuck in the "rebalancing" state. # The Kafka streams apps have exactly-once semantics enabled. # The stack trace showed most of the stream threads
[jira] [Updated] (KAFKA-8673) Kafka stream threads stuck while sending offsets to transaction preventing join group from completing
[ https://issues.apache.org/jira/browse/KAFKA-8673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varsha Abhinandan updated KAFKA-8673: - Description: We observed a deadlock kind of a situation in our Kafka streams application when we accidentally shut down all the brokers. The Kafka cluster was brought back in about an hour. Observations made : # Normal Kafka producers and consumers started working fine after the brokers were up again. # The Kafka streams applications were stuck in the "rebalancing" state. # The Kafka streams apps have exactly-once semantics enabled. # The stack trace showed most of the stream threads sending the join group requests to the group co-ordinator # Few stream threads couldn't initiate the join group request since the call to [org.apache.kafka.clients.producer.KafkaProducer#sendOffsetsToTransaction|https://jira.corp.appdynamics.com/browse/ANLYTCS_ES-2062#sendOffsetsToTransaction%20which%20was%20hung] was stuck. # Seems like the join group requests were getting parked at the coordinator since the expected members hadn't sent their own group join requests # And after the timeout, the stream threads that were not stuck sent a new join group requests. # Maybe (6) and (7) is happening infinitely # Sample values of the GroupMetadata object on the group co-ordinator - # The list of notYetJoinedMembers client id's matched with the threads waiting for their offsets to be committed. {code:java} [List(MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer-efa41349-3da1-43b6-9710-a662f68c63b1, clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer, clientHost=/10.136.98.48, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer-7cc8e41b-ad98-4006-a18a-b22abe6350f4, clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer, clientHost=/10.136.103.148, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer-9ffb96c1-3379-4cbd-bee1-5d4719fe6c9d, clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer, clientHost=/10.136.98.48, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer-5b8a1f1f-84dd-4a87-86c8-7542c0e50d1f, clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer, clientHost=/10.136.103.148, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), MemberMetadata(memberId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer-3cb67ec9-c548-4386-962d-64d9772bf719, clientId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer, clientHost=/10.136.99.15, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ))] vabhinandan-mac:mp-jstack varsha.abhinandan$ cat jstack.* | grep "metric-extractor-stream-c1-" | grep "StreamThread-" | grep "waiting on condition" "metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36" #128 daemon prio=5 os_prio=0 tid=0x7fc53c047800 nid=0xac waiting on condition [0x7fc4e68e7000] "metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21" #93 daemon prio=5 os_prio=0 tid=0x7fc53c2b5800 nid=0x9d waiting on condition [0x7fc4e77f6000] "metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33" #125 daemon prio=5 os_prio=0 tid=0x7fe18017c800 nid=0xbc waiting on condition [0x7fe12e7e8000] "metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38" #154 daemon prio=5 os_prio=0 tid=0x7f27c4225800 nid=0xc4 waiting on condition [0x7f2772bec000] "metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27" #143 daemon prio=5 os_prio=0 tid=0x7f27c4365800 nid=0xb9 waiting on condition [0x7f27736f7000] {code} was: We observed a deadlock kind of a situation in our Kafka streams application when we accidentally shut down all the brokers. The Kafka cluster was brought back in about an hour. Observations made : # Normal Kafka producers and consumers started working fine after the brokers were up again. # The Kafka streams applications were stuck in the "rebalancing" state. # The Kafka streams apps have exactly-once semantics enabled. # The stack trace showed most of the stream threads
[jira] [Updated] (KAFKA-8673) Kafka stream threads stuck while sending offsets to transaction preventing join group from completing
[ https://issues.apache.org/jira/browse/KAFKA-8673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varsha Abhinandan updated KAFKA-8673: - Description: We observed a deadlock kind of a situation in our Kafka streams application when we accidentally shut down all the brokers. The Kafka cluster was brought back in about an hour. Observations made : # Normal Kafka producers and consumers started working fine after the brokers were up again. # The Kafka streams applications were stuck in the "rebalancing" state. # The Kafka streams apps have exactly-once semantics enabled. # The stack trace showed most of the stream threads sending the join group requests to the group co-ordinator # Few stream threads couldn't initiate the join group request since the call to [org.apache.kafka.clients.producer.KafkaProducer#sendOffsetsToTransaction|https://jira.corp.appdynamics.com/browse/ANLYTCS_ES-2062#sendOffsetsToTransaction%20which%20was%20hung] was stuck. # Seems like the join group requests were getting parked at the coordinator since the expected members hadn't sent their own group join requests # And after the timeout, the stream threads that were not stuck sent a new join group requests. # Maybe (6) and (7) is happening infinitely # Sample values of the GroupMetadata object on the group co-ordinator - was: We observed a deadlock kind of a situation in our Kafka streams application when we accidentally shut down all the brokers. The Kafka cluster was brought back in about an hour. Observations made : # Normal Kafka producers and consumers started working fine after the brokers were up again. # The Kafka streams applications were stuck in the "rebalancing" state. # The Kafka streams apps have exactly-once semantics enabled. # The stack trace showed most of the stream threads sending the join group requests to the group co-ordinator # Few stream threads couldn't initiate the join group request since the call to [org.apache.kafka.clients.producer.KafkaProducer#sendOffsetsToTransaction|https://jira.corp.appdynamics.com/browse/ANLYTCS_ES-2062#sendOffsetsToTransaction%20which%20was%20hung] was stuck. # Seems like the join group requests were getting parked at the coordinator since the expected members hadn't sent their own group join requests # And after the timeout, the stream threads that were not stuck sent a new join group requests. # Maybe (6) and (7) is happening infinitely # Sample values of the GroupMetadata object on the group co-ordinator - !Screen Shot 2019-07-11 at 12.08.09 PM.png! # The list of notYetJoinedMembers client id's matched with the threads waiting for their offsets to be committed. {code:java} [List(MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer-efa41349-3da1-43b6-9710-a662f68c63b1, clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer, clientHost=/10.136.98.48, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer-7cc8e41b-ad98-4006-a18a-b22abe6350f4, clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer, clientHost=/10.136.103.148, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer-9ffb96c1-3379-4cbd-bee1-5d4719fe6c9d, clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer, clientHost=/10.136.98.48, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer-5b8a1f1f-84dd-4a87-86c8-7542c0e50d1f, clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer, clientHost=/10.136.103.148, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), MemberMetadata(memberId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer-3cb67ec9-c548-4386-962d-64d9772bf719, clientId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer, clientHost=/10.136.99.15, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ))] vabhinandan-mac:mp-jstack varsha.abhinandan$ cat jstack.* | grep "metric-extractor-stream-c1-" | grep "StreamThread-" | grep "waiting on condition" "metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36" #128 daemon prio=5 os_prio=0 tid=0x7fc53c047800 nid=0xac waiting on condition [0x7fc4e68e7000]
[jira] [Updated] (KAFKA-8673) Kafka stream threads stuck while sending offsets to transaction preventing join group from completing
[ https://issues.apache.org/jira/browse/KAFKA-8673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varsha Abhinandan updated KAFKA-8673: - Description: We observed a deadlock kind of a situation in our Kafka streams application when we accidentally shut down all the brokers. The Kafka cluster was brought back in about an hour. Observations made : # Normal Kafka producers and consumers started working fine after the brokers were up again. # The Kafka streams applications were stuck in the "rebalancing" state. # The Kafka streams apps have exactly-once semantics enabled. # The stack trace showed most of the stream threads sending the join group requests to the group co-ordinator # Few stream threads couldn't initiate the join group request since the call to [org.apache.kafka.clients.producer.KafkaProducer#sendOffsetsToTransaction|https://jira.corp.appdynamics.com/browse/ANLYTCS_ES-2062#sendOffsetsToTransaction%20which%20was%20hung] was stuck. # Seems like the join group requests were getting parked at the coordinator since the expected members hadn't sent their own group join requests # And after the timeout, the stream threads that were not stuck sent a new join group requests. # Maybe (6) and (7) is happening infinitely # Sample values of the GroupMetadata object on the group co-ordinator # The list of notYetJoinedMembers client id's matched with the threads waiting for their offsets to be committed. {code:java} [List(MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer-efa41349-3da1-43b6-9710-a662f68c63b1, clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer, clientHost=/10.136.98.48, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer-7cc8e41b-ad98-4006-a18a-b22abe6350f4, clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer, clientHost=/10.136.103.148, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer-9ffb96c1-3379-4cbd-bee1-5d4719fe6c9d, clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer, clientHost=/10.136.98.48, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer-5b8a1f1f-84dd-4a87-86c8-7542c0e50d1f, clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer, clientHost=/10.136.103.148, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), MemberMetadata(memberId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer-3cb67ec9-c548-4386-962d-64d9772bf719, clientId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer, clientHost=/10.136.99.15, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ))] vabhinandan-mac:mp-jstack varsha.abhinandan$ cat jstack.* | grep "metric-extractor-stream-c1-" | grep "StreamThread-" | grep "waiting on condition" "metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36" #128 daemon prio=5 os_prio=0 tid=0x7fc53c047800 nid=0xac waiting on condition [0x7fc4e68e7000] "metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21" #93 daemon prio=5 os_prio=0 tid=0x7fc53c2b5800 nid=0x9d waiting on condition [0x7fc4e77f6000] "metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33" #125 daemon prio=5 os_prio=0 tid=0x7fe18017c800 nid=0xbc waiting on condition [0x7fe12e7e8000] "metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38" #154 daemon prio=5 os_prio=0 tid=0x7f27c4225800 nid=0xc4 waiting on condition [0x7f2772bec000] "metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27" #143 daemon prio=5 os_prio=0 tid=0x7f27c4365800 nid=0xb9 waiting on condition [0x7f27736f7000] {code} was: We observed a deadlock kind of a situation in our Kafka streams application when we accidentally shut down all the brokers. The Kafka cluster was brought back in about an hour. Observations made : # Normal Kafka producers and consumers started working fine after the brokers were up again. # The Kafka streams applications were stuck in the "rebalancing" state. # The Kafka streams apps have exactly-once semantics enabled. # The stack trace showed most of the stream threads
[jira] [Updated] (KAFKA-8673) Kafka stream threads stuck while sending offsets to transaction preventing join group from completing
[ https://issues.apache.org/jira/browse/KAFKA-8673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varsha Abhinandan updated KAFKA-8673: - Description: We observed a deadlock kind of a situation in our Kafka streams application when we accidentally shut down all the brokers. The Kafka cluster was brought back in about an hour. Observations made : # Normal Kafka producers and consumers started working fine after the brokers were up again. # The Kafka streams applications were stuck in the "rebalancing" state. # The Kafka streams apps have exactly-once semantics enabled. # The stack trace showed most of the stream threads sending the join group requests to the group co-ordinator # Few stream threads couldn't initiate the join group request since the call to [org.apache.kafka.clients.producer.KafkaProducer#sendOffsetsToTransaction|https://jira.corp.appdynamics.com/browse/ANLYTCS_ES-2062#sendOffsetsToTransaction%20which%20was%20hung] was stuck. # Seems like the join group requests were getting parked at the coordinator since the expected members hadn't sent their own group join requests # And after the timeout, the stream threads that were not stuck sent a new join group requests. # Maybe (6) and (7) is happening infinitely # Sample values of the GroupMetadata object on the group co-ordinator - !Screen Shot 2019-07-11 at 12.08.09 PM.png! # The list of notYetJoinedMembers client id's matched with the threads waiting for their offsets to be committed. {code:java} [List(MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer-efa41349-3da1-43b6-9710-a662f68c63b1, clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer, clientHost=/10.136.98.48, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer-7cc8e41b-ad98-4006-a18a-b22abe6350f4, clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer, clientHost=/10.136.103.148, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer-9ffb96c1-3379-4cbd-bee1-5d4719fe6c9d, clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer, clientHost=/10.136.98.48, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer-5b8a1f1f-84dd-4a87-86c8-7542c0e50d1f, clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer, clientHost=/10.136.103.148, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), MemberMetadata(memberId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer-3cb67ec9-c548-4386-962d-64d9772bf719, clientId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer, clientHost=/10.136.99.15, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ))] vabhinandan-mac:mp-jstack varsha.abhinandan$ cat jstack.* | grep "metric-extractor-stream-c1-" | grep "StreamThread-" | grep "waiting on condition" "metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36" #128 daemon prio=5 os_prio=0 tid=0x7fc53c047800 nid=0xac waiting on condition [0x7fc4e68e7000] "metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21" #93 daemon prio=5 os_prio=0 tid=0x7fc53c2b5800 nid=0x9d waiting on condition [0x7fc4e77f6000] "metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33" #125 daemon prio=5 os_prio=0 tid=0x7fe18017c800 nid=0xbc waiting on condition [0x7fe12e7e8000] "metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38" #154 daemon prio=5 os_prio=0 tid=0x7f27c4225800 nid=0xc4 waiting on condition [0x7f2772bec000] "metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27" #143 daemon prio=5 os_prio=0 tid=0x7f27c4365800 nid=0xb9 waiting on condition [0x7f27736f7000] {code} was: We observed a deadlock kind of a situation in our Kafka streams application when we accidentally shut down all the brokers. The Kafka cluster was brought back in about an hour. Observations made : # Normal Kafka producers and consumers started working fine after the brokers were up again. # The Kafka streams applications were stuck in the "rebalancing" state. # The Kafka streams apps have exactly-once semantics enabled. # The stack
[jira] [Updated] (KAFKA-8673) Kafka stream threads stuck while sending offsets to transaction preventing join group from completing
[ https://issues.apache.org/jira/browse/KAFKA-8673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varsha Abhinandan updated KAFKA-8673: - Attachment: Screen Shot 2019-07-11 at 12.08.09 PM.png > Kafka stream threads stuck while sending offsets to transaction preventing > join group from completing > - > > Key: KAFKA-8673 > URL: https://issues.apache.org/jira/browse/KAFKA-8673 > Project: Kafka > Issue Type: Bug > Components: consumer, streams >Affects Versions: 2.2.0 >Reporter: Varsha Abhinandan >Priority: Major > Attachments: Screen Shot 2019-07-11 at 12.08.09 PM.png > > > We observed a deadlock kind of a situation in our Kafka streams application > when we accidentally shut down all the brokers. The Kafka cluster was brought > back in about an hour. > Observations made : > # Normal Kafka producers and consumers started working fine after the > brokers were up again. > # The Kafka streams applications were stuck in the "rebalancing" state. > # The Kafka streams apps have exactly-once semantics enabled. > # The stack trace showed most of the stream threads sending the join group > requests to the group co-ordinator > # Few stream threads couldn't initiate the join group request since the call > to > [org.apache.kafka.clients.producer.KafkaProducer#sendOffsetsToTransaction|https://jira.corp.appdynamics.com/browse/ANLYTCS_ES-2062#sendOffsetsToTransaction%20which%20was%20hung] > was stuck. > # Seems like the join group requests were getting parked at the coordinator > since the expected members hadn't sent their own group join requests > # And after the timeout, the stream threads that were not stuck sent a new > join group requests. > # Maybe (6) and (7) is happening infinitely > # Sample values of the GroupMetadata object on the group co-ordinator > !Screen Shot 2019-07-11 at 12.08.09 PM.png! > # The list of notYetJoinedMembers client id's matched with the threads > waiting for their offsets to be committed. > {code:java} > [List(MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer-efa41349-3da1-43b6-9710-a662f68c63b1, > > clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer, > clientHost=/10.136.98.48, sessionTimeoutMs=15000, > rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), > MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer-7cc8e41b-ad98-4006-a18a-b22abe6350f4, > > clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer, > clientHost=/10.136.103.148, sessionTimeoutMs=15000, > rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), > MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer-9ffb96c1-3379-4cbd-bee1-5d4719fe6c9d, > > clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer, > clientHost=/10.136.98.48, sessionTimeoutMs=15000, > rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), > MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer-5b8a1f1f-84dd-4a87-86c8-7542c0e50d1f, > > clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer, > clientHost=/10.136.103.148, sessionTimeoutMs=15000, > rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), > MemberMetadata(memberId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer-3cb67ec9-c548-4386-962d-64d9772bf719, > > clientId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer, > clientHost=/10.136.99.15, sessionTimeoutMs=15000, > rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ))] > vabhinandan-mac:mp-jstack varsha.abhinandan$ cat jstack.* | grep > "metric-extractor-stream-c1-" | grep "StreamThread-" | grep "waiting on > condition" > "metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36" > #128 daemon prio=5 os_prio=0 tid=0x7fc53c047800 nid=0xac waiting on > condition [0x7fc4e68e7000] > "metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21" > #93 daemon prio=5 os_prio=0 tid=0x7fc53c2b5800 nid=0x9d waiting on > condition [0x7fc4e77f6000] > "metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33" > #125 daemon prio=5 os_prio=0 tid=0x7fe18017c800 nid=0xbc waiting on > condition [0x7fe12e7e8000] > "metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38" > #154
[jira] [Updated] (KAFKA-8673) Kafka stream threads stuck while sending offsets to transaction preventing join group from completing
[ https://issues.apache.org/jira/browse/KAFKA-8673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varsha Abhinandan updated KAFKA-8673: - Description: We observed a deadlock kind of a situation in our Kafka streams application when we accidentally shut down all the brokers. The Kafka cluster was brought back in about an hour. Observations made : # Normal Kafka producers and consumers started working fine after the brokers were up again. # The Kafka streams applications were stuck in the "rebalancing" state. # The Kafka streams apps have exactly-once semantics enabled. # The stack trace showed most of the stream threads sending the join group requests to the group co-ordinator # Few stream threads couldn't initiate the join group request since the call to [org.apache.kafka.clients.producer.KafkaProducer#sendOffsetsToTransaction|https://jira.corp.appdynamics.com/browse/ANLYTCS_ES-2062#sendOffsetsToTransaction%20which%20was%20hung] was stuck. # Seems like the join group requests were getting parked at the coordinator since the expected members hadn't sent their own group join requests # And after the timeout, the stream threads that were not stuck sent a new join group requests. # Maybe (6) and (7) is happening infinitely # Sample values of the GroupMetadata object on the group co-ordinator !Screen Shot 2019-07-11 at 12.08.09 PM.png! # The list of notYetJoinedMembers client id's matched with the threads waiting for their offsets to be committed. {code:java} [List(MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer-efa41349-3da1-43b6-9710-a662f68c63b1, clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer, clientHost=/10.136.98.48, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer-7cc8e41b-ad98-4006-a18a-b22abe6350f4, clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer, clientHost=/10.136.103.148, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer-9ffb96c1-3379-4cbd-bee1-5d4719fe6c9d, clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer, clientHost=/10.136.98.48, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer-5b8a1f1f-84dd-4a87-86c8-7542c0e50d1f, clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer, clientHost=/10.136.103.148, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), MemberMetadata(memberId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer-3cb67ec9-c548-4386-962d-64d9772bf719, clientId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer, clientHost=/10.136.99.15, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ))] vabhinandan-mac:mp-jstack varsha.abhinandan$ cat jstack.* | grep "metric-extractor-stream-c1-" | grep "StreamThread-" | grep "waiting on condition" "metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36" #128 daemon prio=5 os_prio=0 tid=0x7fc53c047800 nid=0xac waiting on condition [0x7fc4e68e7000] "metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21" #93 daemon prio=5 os_prio=0 tid=0x7fc53c2b5800 nid=0x9d waiting on condition [0x7fc4e77f6000] "metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33" #125 daemon prio=5 os_prio=0 tid=0x7fe18017c800 nid=0xbc waiting on condition [0x7fe12e7e8000] "metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38" #154 daemon prio=5 os_prio=0 tid=0x7f27c4225800 nid=0xc4 waiting on condition [0x7f2772bec000] "metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27" #143 daemon prio=5 os_prio=0 tid=0x7f27c4365800 nid=0xb9 waiting on condition [0x7f27736f7000] {code} was: We observed a deadlock kind of a situation in our Kafka streams application when we accidentally shut down all the brokers. The Kafka cluster was brought back in about an hour. Observations made : # Normal Kafka producers and consumers started working fine after the brokers were up again. # The Kafka streams applications were stuck in the "rebalancing" state. # The Kafka streams apps have exactly-once semantics enabled. # The stack
[jira] [Updated] (KAFKA-8673) Kafka stream threads stuck while sending offsets to transaction preventing join group from completing
[ https://issues.apache.org/jira/browse/KAFKA-8673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varsha Abhinandan updated KAFKA-8673: - Description: We observed a deadlock kind of a situation in our Kafka streams application when we accidentally shut down all the brokers. The Kafka cluster was brought back in about an hour. Observations made : # Normal Kafka producers and consumers started working fine after the brokers were up again. # The Kafka streams applications were stuck in the "rebalancing" state. # The Kafka streams apps have exactly-once semantics enabled. # The stack trace showed most of the stream threads sending the join group requests to the group co-ordinator # Few stream threads couldn't initiate the join group request since the call to [org.apache.kafka.clients.producer.KafkaProducer#sendOffsetsToTransaction|https://jira.corp.appdynamics.com/browse/ANLYTCS_ES-2062#sendOffsetsToTransaction%20which%20was%20hung] was stuck. # Seems like the join group requests were getting parked at the coordinator since the expected members hadn't sent their own group join requests # And after the timeout, the stream threads that were not stuck sent a new join group requests. # Maybe (6) and (7) is happening infinitely # Sample values of the GroupMetadata object on the group co-ordinator !Screen Shot 2019-07-11 at 12.08.09 PM.png|width=795,height=132! # The list of notYetJoinedMembers client id's matched with the threads waiting for their offsets to be committed. {code:java} [List(MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer-efa41349-3da1-43b6-9710-a662f68c63b1, clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer, clientHost=/10.136.98.48, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer-7cc8e41b-ad98-4006-a18a-b22abe6350f4, clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer, clientHost=/10.136.103.148, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer-9ffb96c1-3379-4cbd-bee1-5d4719fe6c9d, clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer, clientHost=/10.136.98.48, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer-5b8a1f1f-84dd-4a87-86c8-7542c0e50d1f, clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer, clientHost=/10.136.103.148, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), MemberMetadata(memberId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer-3cb67ec9-c548-4386-962d-64d9772bf719, clientId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer, clientHost=/10.136.99.15, sessionTimeoutMs=15000, rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ))] vabhinandan-mac:mp-jstack varsha.abhinandan$ cat jstack.* | grep "metric-extractor-stream-c1-" | grep "StreamThread-" | grep "waiting on condition" "metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36" #128 daemon prio=5 os_prio=0 tid=0x7fc53c047800 nid=0xac waiting on condition [0x7fc4e68e7000] "metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21" #93 daemon prio=5 os_prio=0 tid=0x7fc53c2b5800 nid=0x9d waiting on condition [0x7fc4e77f6000] "metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33" #125 daemon prio=5 os_prio=0 tid=0x7fe18017c800 nid=0xbc waiting on condition [0x7fe12e7e8000] "metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38" #154 daemon prio=5 os_prio=0 tid=0x7f27c4225800 nid=0xc4 waiting on condition [0x7f2772bec000] "metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27" #143 daemon prio=5 os_prio=0 tid=0x7f27c4365800 nid=0xb9 waiting on condition [0x7f27736f7000] {code} was: We observed a deadlock kind of a situation in our Kafka streams application when we accidentally shut down all the brokers. The Kafka cluster was brought back in about an hour. Observations made : # Normal Kafka producers and consumers started working fine after the brokers were up again. # The Kafka streams applications were stuck in the "rebalancing" state. # The Kafka streams apps have exactly-once semantics
[jira] [Updated] (KAFKA-8673) Kafka stream threads stuck while sending offsets to transaction preventing join group from completing
[ https://issues.apache.org/jira/browse/KAFKA-8673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varsha Abhinandan updated KAFKA-8673: - Description: We observed a deadlock kind of a situation in our Kafka streams application when we accidentally shut down all the brokers. The Kafka cluster was brought back in about an hour. Observations made : # Normal Kafka producers and consumers started working fine after the brokers were up again. # The Kafka streams applications were stuck in the "rebalancing" state. # The Kafka streams apps have exactly-once semantics enabled. # The stack trace showed most of the stream threads sending the join group requests to the group co-ordinator # Few stream threads couldn't initiate the join group request since the call to [org.apache.kafka.clients.producer.KafkaProducer#sendOffsetsToTransaction|https://jira.corp.appdynamics.com/browse/ANLYTCS_ES-2062#sendOffsetsToTransaction%20which%20was%20hung] was stuck. # Seems like the join group requests were getting parked at the coordinator since the expected members hadn't sent their own group join requests # And after the timeout, the stream threads that were not stuck sent a new join group requests. # Maybe (6) and (7) is happening infinitely # Sample values of the GroupMetadata object on the group co-ordinator !Screen Shot 2019-07-11 at 12.08.09 PM.png|width=319,height=53! > Kafka stream threads stuck while sending offsets to transaction preventing > join group from completing > - > > Key: KAFKA-8673 > URL: https://issues.apache.org/jira/browse/KAFKA-8673 > Project: Kafka > Issue Type: Bug > Components: consumer, streams >Affects Versions: 2.2.0 >Reporter: Varsha Abhinandan >Priority: Major > > We observed a deadlock kind of a situation in our Kafka streams application > when we accidentally shut down all the brokers. The Kafka cluster was brought > back in about an hour. > Observations made : > # Normal Kafka producers and consumers started working fine after the > brokers were up again. > # The Kafka streams applications were stuck in the "rebalancing" state. > # The Kafka streams apps have exactly-once semantics enabled. > # The stack trace showed most of the stream threads sending the join group > requests to the group co-ordinator > # Few stream threads couldn't initiate the join group request since the call > to > [org.apache.kafka.clients.producer.KafkaProducer#sendOffsetsToTransaction|https://jira.corp.appdynamics.com/browse/ANLYTCS_ES-2062#sendOffsetsToTransaction%20which%20was%20hung] > was stuck. > # Seems like the join group requests were getting parked at the coordinator > since the expected members hadn't sent their own group join requests > # And after the timeout, the stream threads that were not stuck sent a new > join group requests. > # Maybe (6) and (7) is happening infinitely > # Sample values of the GroupMetadata object on the group co-ordinator > !Screen Shot 2019-07-11 at 12.08.09 PM.png|width=319,height=53! -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (KAFKA-8673) Kafka stream threads stuck while sending offsets to transaction preventing join group from completing
Varsha Abhinandan created KAFKA-8673: Summary: Kafka stream threads stuck while sending offsets to transaction preventing join group from completing Key: KAFKA-8673 URL: https://issues.apache.org/jira/browse/KAFKA-8673 Project: Kafka Issue Type: Bug Components: consumer, streams Affects Versions: 2.2.0 Reporter: Varsha Abhinandan -- This message was sent by Atlassian JIRA (v7.6.14#76016)