[jira] [Commented] (KAFKA-10313) Out of range offset errors leading to offset reset

2020-09-17 Thread Varsha Abhinandan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-10313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17197722#comment-17197722
 ] 

Varsha Abhinandan commented on KAFKA-10313:
---

The issue mentioned in KAFKA-9543 seems to coincide with the segment rollover 
and also post 2.4.0 version. Unfortunately, we are facing this issue in 2.2.2 
version and according to the logs it's not around the time of segment rollover. 

> Out of range offset errors leading to offset reset
> --
>
> Key: KAFKA-10313
> URL: https://issues.apache.org/jira/browse/KAFKA-10313
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 2.2.2
>Reporter: Varsha Abhinandan
>Priority: Major
>
> Hi,
>   
>  We have been occasionally noticing offset resets happening on the Kafka 
> consumer because of offset out of range error. However, I don't see any 
> errors in the broker logs. No logs related to leader-election, replica lag, 
> Kafka broker pod restarts or anything. (just info logs were enabled in the 
> prod environment).
>   
>  It appeared from the logs that the out of range error was because of the 
> fetch offset being larger than the offset range on the broker. Noticed this 
> happening multiple times on different consumers, stream apps in the prod 
> environment. So, it doesn't seem like an application bug and more like a bug 
> in the KafkaConsumer. Would like to understand the cause for such errors.
>   
>  Also, none of the offset reset options are desirable. Choosing "earliest" 
> creates a sudden huge lag (we have a retention of 24hours) and choosing 
> "latest" leads to data loss (the records produced between the out of range 
> error and when offset reset happens on the consumer). So, wondering if it is 
> better for the Kafka client to separate out 'auto.offset.reset' config for 
> just offset not found. For, out of range error maybe the Kafka client can 
> automatically reset the offset to latest if the fetch offset is higher to 
> prevent data loss. Also, automatically reset it to earliest if the fetch 
> offset is lesser than the start offset. 
>   
>  Following are the logs on the consumer side :
> {noformat}
> [2020-07-17T08:46:00,322Z] [INFO ] [pipeline-thread-12 
> ([prd453-19-event-upsert]-bo-pipeline-12)] 
> [o.a.k.c.consumer.internals.Fetcher] [Consumer 
> clientId=bo-indexer-group-prd453-19-on-c19-bo-indexer-upsert-blue-5d665bcbb7-dnvkh-pid-1-kafka-message-source-id-544,
>  groupId=bo-indexer-group-prd453-19] Fetch offset 476383711 is out of range 
> for partition prd453-19-event-upsert-32, resetting offset
> [2020-07-17T08:46:00,330Z] [INFO ] [pipeline-thread-12 
> ([prd453-19-event-upsert]-bo-pipeline-12)] 
> [o.a.k.c.consumer.internals.Fetcher] [Consumer 
> clientId=bo-indexer-group-prd453-19-on-c19-bo-indexer-upsert-blue-5d665bcbb7-dnvkh-pid-1-kafka-message-source-id-544,
>  groupId=bo-indexer-group-prd453-19] Resetting offset for partition 
> prd453-19-event-upsert-32 to offset 453223789.
>   {noformat}
> Broker logs for the partition :
> {noformat}
> [2020-07-17T07:40:12,082Z]  [INFO ]  [kafka-scheduler-4]  [kafka.log.Log]  
> [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Found deletable 
> segments with base offsets [452091893] due to retention time 8640ms breach
>  [2020-07-17T07:40:12,082Z]  [INFO ]  [kafka-scheduler-4]  [kafka.log.Log]  
> [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Scheduling log 
> segment [baseOffset 452091893, size 1073741693] for deletion.
>  [2020-07-17T07:40:12,083Z]  [INFO ]  [kafka-scheduler-4]  [kafka.log.Log]  
> [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Incrementing log 
> start offset to 453223789
>  [2020-07-17T07:41:12,083Z]  [INFO ]  [kafka-scheduler-7]  [kafka.log.Log]  
> [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Deleting segment 
> 452091893
>  [2020-07-17T07:41:12,114Z]  [INFO ]  [kafka-scheduler-7]  
> [kafka.log.LogSegment]  Deleted log 
> /data/kafka/prd453-19-event-upsert-32/000452091893.log.deleted.
>  [2020-07-17T07:41:12,114Z]  [INFO ]  [kafka-scheduler-7]  
> [kafka.log.LogSegment]  Deleted offset index 
> /data/kafka/prd453-19-event-upsert-32/000452091893.index.deleted.
>  [2020-07-17T07:41:12,114Z]  [INFO ]  [kafka-scheduler-7]  
> [kafka.log.LogSegment]  Deleted time index 
> /data/kafka/prd453-19-event-upsert-32/000452091893.timeindex.deleted.
>  [2020-07-17T07:52:31,836Z]  [INFO ]  [data-plane-kafka-request-handler-3]  
> [kafka.log.ProducerStateManager]  [ProducerStateManager 
> partition=prd453-19-event-upsert-32] Writing producer snapshot at offset 
> 475609786
>  [2020-07-17T07:52:31,836Z]  [INFO ]  [data-plane-kafka-request-handler-3]  
> [kafka.log.Log]  [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] 
> Rolled new log 

[jira] [Comment Edited] (KAFKA-10313) Out of range offset errors leading to offset reset

2020-09-17 Thread Varsha Abhinandan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-10313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17197722#comment-17197722
 ] 

Varsha Abhinandan edited comment on KAFKA-10313 at 9/17/20, 2:26 PM:
-

The issue mentioned in KAFKA-9543 seems to coincide with the segment rollover 
and also post 2.4.0 version. Unfortunately, we are facing this issue in 2.2.2 
version and according to the logs it's not around the same time as segment 
rollover. 


was (Author: varsha.abhinandan):
The issue mentioned in KAFKA-9543 seems to coincide with the segment rollover 
and also post 2.4.0 version. Unfortunately, we are facing this issue in 2.2.2 
version and according to the logs it's not around the time of segment rollover. 

> Out of range offset errors leading to offset reset
> --
>
> Key: KAFKA-10313
> URL: https://issues.apache.org/jira/browse/KAFKA-10313
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 2.2.2
>Reporter: Varsha Abhinandan
>Priority: Major
>
> Hi,
>   
>  We have been occasionally noticing offset resets happening on the Kafka 
> consumer because of offset out of range error. However, I don't see any 
> errors in the broker logs. No logs related to leader-election, replica lag, 
> Kafka broker pod restarts or anything. (just info logs were enabled in the 
> prod environment).
>   
>  It appeared from the logs that the out of range error was because of the 
> fetch offset being larger than the offset range on the broker. Noticed this 
> happening multiple times on different consumers, stream apps in the prod 
> environment. So, it doesn't seem like an application bug and more like a bug 
> in the KafkaConsumer. Would like to understand the cause for such errors.
>   
>  Also, none of the offset reset options are desirable. Choosing "earliest" 
> creates a sudden huge lag (we have a retention of 24hours) and choosing 
> "latest" leads to data loss (the records produced between the out of range 
> error and when offset reset happens on the consumer). So, wondering if it is 
> better for the Kafka client to separate out 'auto.offset.reset' config for 
> just offset not found. For, out of range error maybe the Kafka client can 
> automatically reset the offset to latest if the fetch offset is higher to 
> prevent data loss. Also, automatically reset it to earliest if the fetch 
> offset is lesser than the start offset. 
>   
>  Following are the logs on the consumer side :
> {noformat}
> [2020-07-17T08:46:00,322Z] [INFO ] [pipeline-thread-12 
> ([prd453-19-event-upsert]-bo-pipeline-12)] 
> [o.a.k.c.consumer.internals.Fetcher] [Consumer 
> clientId=bo-indexer-group-prd453-19-on-c19-bo-indexer-upsert-blue-5d665bcbb7-dnvkh-pid-1-kafka-message-source-id-544,
>  groupId=bo-indexer-group-prd453-19] Fetch offset 476383711 is out of range 
> for partition prd453-19-event-upsert-32, resetting offset
> [2020-07-17T08:46:00,330Z] [INFO ] [pipeline-thread-12 
> ([prd453-19-event-upsert]-bo-pipeline-12)] 
> [o.a.k.c.consumer.internals.Fetcher] [Consumer 
> clientId=bo-indexer-group-prd453-19-on-c19-bo-indexer-upsert-blue-5d665bcbb7-dnvkh-pid-1-kafka-message-source-id-544,
>  groupId=bo-indexer-group-prd453-19] Resetting offset for partition 
> prd453-19-event-upsert-32 to offset 453223789.
>   {noformat}
> Broker logs for the partition :
> {noformat}
> [2020-07-17T07:40:12,082Z]  [INFO ]  [kafka-scheduler-4]  [kafka.log.Log]  
> [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Found deletable 
> segments with base offsets [452091893] due to retention time 8640ms breach
>  [2020-07-17T07:40:12,082Z]  [INFO ]  [kafka-scheduler-4]  [kafka.log.Log]  
> [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Scheduling log 
> segment [baseOffset 452091893, size 1073741693] for deletion.
>  [2020-07-17T07:40:12,083Z]  [INFO ]  [kafka-scheduler-4]  [kafka.log.Log]  
> [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Incrementing log 
> start offset to 453223789
>  [2020-07-17T07:41:12,083Z]  [INFO ]  [kafka-scheduler-7]  [kafka.log.Log]  
> [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Deleting segment 
> 452091893
>  [2020-07-17T07:41:12,114Z]  [INFO ]  [kafka-scheduler-7]  
> [kafka.log.LogSegment]  Deleted log 
> /data/kafka/prd453-19-event-upsert-32/000452091893.log.deleted.
>  [2020-07-17T07:41:12,114Z]  [INFO ]  [kafka-scheduler-7]  
> [kafka.log.LogSegment]  Deleted offset index 
> /data/kafka/prd453-19-event-upsert-32/000452091893.index.deleted.
>  [2020-07-17T07:41:12,114Z]  [INFO ]  [kafka-scheduler-7]  
> [kafka.log.LogSegment]  Deleted time index 
> /data/kafka/prd453-19-event-upsert-32/000452091893.timeindex.deleted.
>  [2020-07-17T07:52:31,836Z]  [INFO ]  

[jira] [Updated] (KAFKA-10313) Out of range offset errors leading to offset reset

2020-07-27 Thread Varsha Abhinandan (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-10313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varsha Abhinandan updated KAFKA-10313:
--
Description: 
Hi,
  
 We have been occasionally noticing offset resets happening on the Kafka 
consumer because of offset out of range error. However, I don't see any errors 
in the broker logs. No logs related to leader-election, replica lag, Kafka 
broker pod restarts or anything. (just info logs were enabled in the prod 
environment).
  
 It appeared from the logs that the out of range error was because of the fetch 
offset being larger than the offset range on the broker. Noticed this happening 
multiple times on different consumers, stream apps in the prod environment. So, 
it doesn't seem like an application bug and more like a bug in the 
KafkaConsumer. Would like to understand the cause for such errors.
  
 Also, none of the offset reset options are desirable. Choosing "earliest" 
creates a sudden huge lag (we have a retention of 24hours) and choosing 
"latest" leads to data loss (the records produced between the out of range 
error and when offset reset happens on the consumer). So, wondering if it is 
better for the Kafka client to separate out 'auto.offset.reset' config for just 
offset not found. For, out of range error maybe the Kafka client can 
automatically reset the offset to latest if the fetch offset is higher to 
prevent data loss. Also, automatically reset it to earliest if the fetch offset 
is lesser than the start offset. 
  
 Following are the logs on the consumer side :
{noformat}
[2020-07-17T08:46:00,322Z] [INFO ] [pipeline-thread-12 
([prd453-19-event-upsert]-bo-pipeline-12)] [o.a.k.c.consumer.internals.Fetcher] 
[Consumer 
clientId=bo-indexer-group-prd453-19-on-c19-bo-indexer-upsert-blue-5d665bcbb7-dnvkh-pid-1-kafka-message-source-id-544,
 groupId=bo-indexer-group-prd453-19] Fetch offset 476383711 is out of range for 
partition prd453-19-event-upsert-32, resetting offset

[2020-07-17T08:46:00,330Z] [INFO ] [pipeline-thread-12 
([prd453-19-event-upsert]-bo-pipeline-12)] [o.a.k.c.consumer.internals.Fetcher] 
[Consumer 
clientId=bo-indexer-group-prd453-19-on-c19-bo-indexer-upsert-blue-5d665bcbb7-dnvkh-pid-1-kafka-message-source-id-544,
 groupId=bo-indexer-group-prd453-19] Resetting offset for partition 
prd453-19-event-upsert-32 to offset 453223789.
  {noformat}
Broker logs for the partition :
{noformat}
[2020-07-17T07:40:12,082Z]  [INFO ]  [kafka-scheduler-4]  [kafka.log.Log]  [Log 
partition=prd453-19-event-upsert-32, dir=/data/kafka] Found deletable segments 
with base offsets [452091893] due to retention time 8640ms breach
 [2020-07-17T07:40:12,082Z]  [INFO ]  [kafka-scheduler-4]  [kafka.log.Log]  
[Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Scheduling log 
segment [baseOffset 452091893, size 1073741693] for deletion.
 [2020-07-17T07:40:12,083Z]  [INFO ]  [kafka-scheduler-4]  [kafka.log.Log]  
[Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Incrementing log 
start offset to 453223789
 [2020-07-17T07:41:12,083Z]  [INFO ]  [kafka-scheduler-7]  [kafka.log.Log]  
[Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Deleting segment 
452091893
 [2020-07-17T07:41:12,114Z]  [INFO ]  [kafka-scheduler-7]  
[kafka.log.LogSegment]  Deleted log 
/data/kafka/prd453-19-event-upsert-32/000452091893.log.deleted.
 [2020-07-17T07:41:12,114Z]  [INFO ]  [kafka-scheduler-7]  
[kafka.log.LogSegment]  Deleted offset index 
/data/kafka/prd453-19-event-upsert-32/000452091893.index.deleted.
 [2020-07-17T07:41:12,114Z]  [INFO ]  [kafka-scheduler-7]  
[kafka.log.LogSegment]  Deleted time index 
/data/kafka/prd453-19-event-upsert-32/000452091893.timeindex.deleted.
 [2020-07-17T07:52:31,836Z]  [INFO ]  [data-plane-kafka-request-handler-3]  
[kafka.log.ProducerStateManager]  [ProducerStateManager 
partition=prd453-19-event-upsert-32] Writing producer snapshot at offset 
475609786
 [2020-07-17T07:52:31,836Z]  [INFO ]  [data-plane-kafka-request-handler-3]  
[kafka.log.Log]  [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] 
Rolled new log segment at offset 475609786 in 1 ms.{noformat}
 
{noformat}
[2020-07-17T09:05:12,075Z]  [INFO ]  [kafka-scheduler-2]  [kafka.log.Log]  [Log 
partition=prd453-19-event-upsert-32, dir=/data/kafka] Found deletable segments 
with base offsets [453223789] due to retention time 8640ms breach
 [2020-07-17T09:05:12,075Z]  [INFO ]  [kafka-scheduler-2]  [kafka.log.Log]  
[Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Scheduling log 
segment [baseOffset 453223789, size 1073741355] for deletion.
 [2020-07-17T09:05:12,075Z]  [INFO ]  [kafka-scheduler-2]  [kafka.log.Log]  
[Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Incrementing log 
start offset to 454388428
 [2020-07-17T09:06:12,075Z]  [INFO ]  [kafka-scheduler-6]  [kafka.log.Log]  
[Log partition=prd453-19-event-upsert-32, 

[jira] [Updated] (KAFKA-10313) Out of range offset errors leading to offset reset

2020-07-27 Thread Varsha Abhinandan (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-10313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varsha Abhinandan updated KAFKA-10313:
--
Priority: Minor  (was: Critical)

> Out of range offset errors leading to offset reset
> --
>
> Key: KAFKA-10313
> URL: https://issues.apache.org/jira/browse/KAFKA-10313
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 2.2.2
>Reporter: Varsha Abhinandan
>Priority: Minor
>
> Hi,
>   
>  We have been occasionally noticing offset resets happening on the Kafka 
> consumer because of offset out of range error. However, I don't see any 
> errors in the broker logs. No logs related to leader-election, replica lag, 
> Kafka broker pod restarts or anything. (just info logs were enabled in the 
> prod environment).
>   
>  It appeared from the logs that the out of range error was because of the 
> fetch offset being larger than the offset range on the broker. Noticed this 
> happening multiple times on different consumers, stream apps in the prod 
> environment. So, it doesn't seem like an application bug and more like a bug 
> in the KafkaConsumer. Would like to understand the cause for such errors.
>   
>  Also, none of the offset reset options are desirable. Choosing "earliest" 
> creates a sudden huge lag (we have a retention of 24hours) and choosing 
> "latest" leads to data loss (the records produced between the out of range 
> error and when offset reset happens on the consumer). So, wondering if it is 
> better for the Kafka client to separate out 'auto.offset.reset' config for 
> just offset not found. For, out of range error maybe the Kafka client can 
> automatically reset the offset to latest if the fetch offset is higher to 
> prevent data loss. Also, automatically reset it to earliest if the fetch 
> offset is lesser than the start offset. 
>   
>   
>  Following are the logs on the consumer side :
>   
> {noformat}
> [2020-07-17T08:46:00,322Z] [INFO ] [pipeline-thread-12 
> ([prd453-19-event-upsert]-bo-pipeline-12)] 
> [o.a.k.c.consumer.internals.Fetcher] [Consumer 
> clientId=bo-indexer-group-prd453-19-on-c19-bo-indexer-upsert-blue-5d665bcbb7-dnvkh-pid-1-kafka-message-source-id-544,
>  groupId=bo-indexer-group-prd453-19] Fetch offset 476383711 is out of range 
> for partition prd453-19-event-upsert-32, resetting offset
> [2020-07-17T08:46:00,330Z] [INFO ] [pipeline-thread-12 
> ([prd453-19-event-upsert]-bo-pipeline-12)] 
> [o.a.k.c.consumer.internals.Fetcher] [Consumer 
> clientId=bo-indexer-group-prd453-19-on-c19-bo-indexer-upsert-blue-5d665bcbb7-dnvkh-pid-1-kafka-message-source-id-544,
>  groupId=bo-indexer-group-prd453-19] Resetting offset for partition 
> prd453-19-event-upsert-32 to offset 453223789.
>   {noformat}
> Broker logs for the partition :
> {noformat}
> [2020-07-17T07:40:12,082Z]  [INFO ]  [kafka-scheduler-4]  [kafka.log.Log]  
> [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Found deletable 
> segments with base offsets [452091893] due to retention time 8640ms breach
>  [2020-07-17T07:40:12,082Z]  [INFO ]  [kafka-scheduler-4]  [kafka.log.Log]  
> [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Scheduling log 
> segment [baseOffset 452091893, size 1073741693] for deletion.
>  [2020-07-17T07:40:12,083Z]  [INFO ]  [kafka-scheduler-4]  [kafka.log.Log]  
> [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Incrementing log 
> start offset to 453223789
>  [2020-07-17T07:41:12,083Z]  [INFO ]  [kafka-scheduler-7]  [kafka.log.Log]  
> [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Deleting segment 
> 452091893
>  [2020-07-17T07:41:12,114Z]  [INFO ]  [kafka-scheduler-7]  
> [kafka.log.LogSegment]  Deleted log 
> /data/kafka/prd453-19-event-upsert-32/000452091893.log.deleted.
>  [2020-07-17T07:41:12,114Z]  [INFO ]  [kafka-scheduler-7]  
> [kafka.log.LogSegment]  Deleted offset index 
> /data/kafka/prd453-19-event-upsert-32/000452091893.index.deleted.
>  [2020-07-17T07:41:12,114Z]  [INFO ]  [kafka-scheduler-7]  
> [kafka.log.LogSegment]  Deleted time index 
> /data/kafka/prd453-19-event-upsert-32/000452091893.timeindex.deleted.
>  [2020-07-17T07:52:31,836Z]  [INFO ]  [data-plane-kafka-request-handler-3]  
> [kafka.log.ProducerStateManager]  [ProducerStateManager 
> partition=prd453-19-event-upsert-32] Writing producer snapshot at offset 
> 475609786
>  [2020-07-17T07:52:31,836Z]  [INFO ]  [data-plane-kafka-request-handler-3]  
> [kafka.log.Log]  [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] 
> Rolled new log segment at offset 475609786 in 1 ms.{noformat}
>  
> {noformat}
> [2020-07-17T09:05:12,075Z]  [INFO ]  [kafka-scheduler-2]  [kafka.log.Log]  
> [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Found deletable 
> segments with 

[jira] [Updated] (KAFKA-10313) Out of range offset errors leading to offset reset

2020-07-27 Thread Varsha Abhinandan (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-10313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varsha Abhinandan updated KAFKA-10313:
--
Priority: Major  (was: Minor)

> Out of range offset errors leading to offset reset
> --
>
> Key: KAFKA-10313
> URL: https://issues.apache.org/jira/browse/KAFKA-10313
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 2.2.2
>Reporter: Varsha Abhinandan
>Priority: Major
>
> Hi,
>   
>  We have been occasionally noticing offset resets happening on the Kafka 
> consumer because of offset out of range error. However, I don't see any 
> errors in the broker logs. No logs related to leader-election, replica lag, 
> Kafka broker pod restarts or anything. (just info logs were enabled in the 
> prod environment).
>   
>  It appeared from the logs that the out of range error was because of the 
> fetch offset being larger than the offset range on the broker. Noticed this 
> happening multiple times on different consumers, stream apps in the prod 
> environment. So, it doesn't seem like an application bug and more like a bug 
> in the KafkaConsumer. Would like to understand the cause for such errors.
>   
>  Also, none of the offset reset options are desirable. Choosing "earliest" 
> creates a sudden huge lag (we have a retention of 24hours) and choosing 
> "latest" leads to data loss (the records produced between the out of range 
> error and when offset reset happens on the consumer). So, wondering if it is 
> better for the Kafka client to separate out 'auto.offset.reset' config for 
> just offset not found. For, out of range error maybe the Kafka client can 
> automatically reset the offset to latest if the fetch offset is higher to 
> prevent data loss. Also, automatically reset it to earliest if the fetch 
> offset is lesser than the start offset. 
>   
>   
>  Following are the logs on the consumer side :
>   
> {noformat}
> [2020-07-17T08:46:00,322Z] [INFO ] [pipeline-thread-12 
> ([prd453-19-event-upsert]-bo-pipeline-12)] 
> [o.a.k.c.consumer.internals.Fetcher] [Consumer 
> clientId=bo-indexer-group-prd453-19-on-c19-bo-indexer-upsert-blue-5d665bcbb7-dnvkh-pid-1-kafka-message-source-id-544,
>  groupId=bo-indexer-group-prd453-19] Fetch offset 476383711 is out of range 
> for partition prd453-19-event-upsert-32, resetting offset
> [2020-07-17T08:46:00,330Z] [INFO ] [pipeline-thread-12 
> ([prd453-19-event-upsert]-bo-pipeline-12)] 
> [o.a.k.c.consumer.internals.Fetcher] [Consumer 
> clientId=bo-indexer-group-prd453-19-on-c19-bo-indexer-upsert-blue-5d665bcbb7-dnvkh-pid-1-kafka-message-source-id-544,
>  groupId=bo-indexer-group-prd453-19] Resetting offset for partition 
> prd453-19-event-upsert-32 to offset 453223789.
>   {noformat}
> Broker logs for the partition :
> {noformat}
> [2020-07-17T07:40:12,082Z]  [INFO ]  [kafka-scheduler-4]  [kafka.log.Log]  
> [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Found deletable 
> segments with base offsets [452091893] due to retention time 8640ms breach
>  [2020-07-17T07:40:12,082Z]  [INFO ]  [kafka-scheduler-4]  [kafka.log.Log]  
> [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Scheduling log 
> segment [baseOffset 452091893, size 1073741693] for deletion.
>  [2020-07-17T07:40:12,083Z]  [INFO ]  [kafka-scheduler-4]  [kafka.log.Log]  
> [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Incrementing log 
> start offset to 453223789
>  [2020-07-17T07:41:12,083Z]  [INFO ]  [kafka-scheduler-7]  [kafka.log.Log]  
> [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Deleting segment 
> 452091893
>  [2020-07-17T07:41:12,114Z]  [INFO ]  [kafka-scheduler-7]  
> [kafka.log.LogSegment]  Deleted log 
> /data/kafka/prd453-19-event-upsert-32/000452091893.log.deleted.
>  [2020-07-17T07:41:12,114Z]  [INFO ]  [kafka-scheduler-7]  
> [kafka.log.LogSegment]  Deleted offset index 
> /data/kafka/prd453-19-event-upsert-32/000452091893.index.deleted.
>  [2020-07-17T07:41:12,114Z]  [INFO ]  [kafka-scheduler-7]  
> [kafka.log.LogSegment]  Deleted time index 
> /data/kafka/prd453-19-event-upsert-32/000452091893.timeindex.deleted.
>  [2020-07-17T07:52:31,836Z]  [INFO ]  [data-plane-kafka-request-handler-3]  
> [kafka.log.ProducerStateManager]  [ProducerStateManager 
> partition=prd453-19-event-upsert-32] Writing producer snapshot at offset 
> 475609786
>  [2020-07-17T07:52:31,836Z]  [INFO ]  [data-plane-kafka-request-handler-3]  
> [kafka.log.Log]  [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] 
> Rolled new log segment at offset 475609786 in 1 ms.{noformat}
>  
> {noformat}
> [2020-07-17T09:05:12,075Z]  [INFO ]  [kafka-scheduler-2]  [kafka.log.Log]  
> [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Found deletable 
> segments with base 

[jira] [Updated] (KAFKA-10313) Out of range offset errors leading to offset reset

2020-07-27 Thread Varsha Abhinandan (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-10313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varsha Abhinandan updated KAFKA-10313:
--
Priority: Critical  (was: Major)

> Out of range offset errors leading to offset reset
> --
>
> Key: KAFKA-10313
> URL: https://issues.apache.org/jira/browse/KAFKA-10313
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 2.2.2
>Reporter: Varsha Abhinandan
>Priority: Critical
>
> Hi,
>   
>  We have been occasionally noticing offset resets happening on the Kafka 
> consumer because of offset out of range error. However, I don't see any 
> errors in the broker logs. No logs related to leader-election, replica lag, 
> Kafka broker pod restarts or anything. (just info logs were enabled in the 
> prod environment).
>   
>  It appeared from the logs that the out of range error was because of the 
> fetch offset being larger than the offset range on the broker. Noticed this 
> happening multiple times on different consumers, stream apps in the prod 
> environment. So, it doesn't seem like an application bug and more like a bug 
> in the KafkaConsumer. Would like to understand the cause for such errors.
>   
>  Also, none of the offset reset options are desirable. Choosing "earliest" 
> creates a sudden huge lag (we have a retention of 24hours) and choosing 
> "latest" leads to data loss (the records produced between the out of range 
> error and when offset reset happens on the consumer). So, wondering if it is 
> better for the Kafka client to separate out 'auto.offset.reset' config for 
> just offset not found. For, out of range error maybe the Kafka client can 
> automatically reset the offset to latest if the fetch offset is higher to 
> prevent data loss. Also, automatically reset it to earliest if the fetch 
> offset is lesser than the start offset. 
>   
>   
>  Following are the logs on the consumer side :
>   
> {noformat}
> [2020-07-17T08:46:00,322Z] [INFO ] [pipeline-thread-12 
> ([prd453-19-event-upsert]-bo-pipeline-12)] 
> [o.a.k.c.consumer.internals.Fetcher] [Consumer 
> clientId=bo-indexer-group-prd453-19-on-c19-bo-indexer-upsert-blue-5d665bcbb7-dnvkh-pid-1-kafka-message-source-id-544,
>  groupId=bo-indexer-group-prd453-19] Fetch offset 476383711 is out of range 
> for partition prd453-19-event-upsert-32, resetting offset
> [2020-07-17T08:46:00,330Z] [INFO ] [pipeline-thread-12 
> ([prd453-19-event-upsert]-bo-pipeline-12)] 
> [o.a.k.c.consumer.internals.Fetcher] [Consumer 
> clientId=bo-indexer-group-prd453-19-on-c19-bo-indexer-upsert-blue-5d665bcbb7-dnvkh-pid-1-kafka-message-source-id-544,
>  groupId=bo-indexer-group-prd453-19] Resetting offset for partition 
> prd453-19-event-upsert-32 to offset 453223789.
>   {noformat}
> Broker logs for the partition :
> {noformat}
> [2020-07-17T07:40:12,082Z]  [INFO ]  [kafka-scheduler-4]  [kafka.log.Log]  
> [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Found deletable 
> segments with base offsets [452091893] due to retention time 8640ms breach
>  [2020-07-17T07:40:12,082Z]  [INFO ]  [kafka-scheduler-4]  [kafka.log.Log]  
> [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Scheduling log 
> segment [baseOffset 452091893, size 1073741693] for deletion.
>  [2020-07-17T07:40:12,083Z]  [INFO ]  [kafka-scheduler-4]  [kafka.log.Log]  
> [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Incrementing log 
> start offset to 453223789
>  [2020-07-17T07:41:12,083Z]  [INFO ]  [kafka-scheduler-7]  [kafka.log.Log]  
> [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Deleting segment 
> 452091893
>  [2020-07-17T07:41:12,114Z]  [INFO ]  [kafka-scheduler-7]  
> [kafka.log.LogSegment]  Deleted log 
> /data/kafka/prd453-19-event-upsert-32/000452091893.log.deleted.
>  [2020-07-17T07:41:12,114Z]  [INFO ]  [kafka-scheduler-7]  
> [kafka.log.LogSegment]  Deleted offset index 
> /data/kafka/prd453-19-event-upsert-32/000452091893.index.deleted.
>  [2020-07-17T07:41:12,114Z]  [INFO ]  [kafka-scheduler-7]  
> [kafka.log.LogSegment]  Deleted time index 
> /data/kafka/prd453-19-event-upsert-32/000452091893.timeindex.deleted.
>  [2020-07-17T07:52:31,836Z]  [INFO ]  [data-plane-kafka-request-handler-3]  
> [kafka.log.ProducerStateManager]  [ProducerStateManager 
> partition=prd453-19-event-upsert-32] Writing producer snapshot at offset 
> 475609786
>  [2020-07-17T07:52:31,836Z]  [INFO ]  [data-plane-kafka-request-handler-3]  
> [kafka.log.Log]  [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] 
> Rolled new log segment at offset 475609786 in 1 ms.{noformat}
>  
> {noformat}
> [2020-07-17T09:05:12,075Z]  [INFO ]  [kafka-scheduler-2]  [kafka.log.Log]  
> [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Found deletable 
> segments with 

[jira] [Updated] (KAFKA-10313) Out of range offset errors leading to offset reset

2020-07-27 Thread Varsha Abhinandan (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-10313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varsha Abhinandan updated KAFKA-10313:
--
Description: 
Hi,
  
 We have been occasionally noticing offset resets happening on the Kafka 
consumer because of offset out of range error. However, I don't see any errors 
in the broker logs. No logs related to leader-election, replica lag, Kafka 
broker pod restarts or anything. (just info logs were enabled in the prod 
environment).
  
 It appeared from the logs that the out of range error was because of the fetch 
offset being larger than the offset range on the broker. Noticed this happening 
multiple times on different consumers, stream apps in the prod environment. So, 
it doesn't seem like an application bug and more like a bug in the 
KafkaConsumer. Would like to understand the cause for such errors.
  
 Also, none of the offset reset options are desirable. Choosing "earliest" 
creates a sudden huge lag (we have a retention of 24hours) and choosing 
"latest" leads to data loss (the records produced between the out of range 
error and when offset reset happens on the consumer). So, wondering if it is 
better for the Kafka client to separate out 'auto.offset.reset' config for just 
offset not found. For, out of range error maybe the Kafka client can 
automatically reset the offset to latest if the fetch offset is higher to 
prevent data loss. Also, automatically reset it to earliest if the fetch offset 
is lesser than the start offset. 
  
  
 Following are the logs on the consumer side :
  
{noformat}
[2020-07-17T08:46:00,322Z] [INFO ] [pipeline-thread-12 
([prd453-19-event-upsert]-bo-pipeline-12)] [o.a.k.c.consumer.internals.Fetcher] 
[Consumer 
clientId=bo-indexer-group-prd453-19-on-c19-bo-indexer-upsert-blue-5d665bcbb7-dnvkh-pid-1-kafka-message-source-id-544,
 groupId=bo-indexer-group-prd453-19] Fetch offset 476383711 is out of range for 
partition prd453-19-event-upsert-32, resetting offset

[2020-07-17T08:46:00,330Z] [INFO ] [pipeline-thread-12 
([prd453-19-event-upsert]-bo-pipeline-12)] [o.a.k.c.consumer.internals.Fetcher] 
[Consumer 
clientId=bo-indexer-group-prd453-19-on-c19-bo-indexer-upsert-blue-5d665bcbb7-dnvkh-pid-1-kafka-message-source-id-544,
 groupId=bo-indexer-group-prd453-19] Resetting offset for partition 
prd453-19-event-upsert-32 to offset 453223789.
  {noformat}

 Broker logs for the partition :
{noformat}
[2020-07-17T07:40:12,082Z]  [INFO ]  [kafka-scheduler-4]  [kafka.log.Log]  [Log 
partition=prd453-19-event-upsert-32, dir=/data/kafka] Found deletable segments 
with base offsets [452091893] due to retention time 8640ms breach
 [2020-07-17T07:40:12,082Z]  [INFO ]  [kafka-scheduler-4]  [kafka.log.Log]  
[Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Scheduling log 
segment [baseOffset 452091893, size 1073741693] for deletion.
 [2020-07-17T07:40:12,083Z]  [INFO ]  [kafka-scheduler-4]  [kafka.log.Log]  
[Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Incrementing log 
start offset to 453223789
 [2020-07-17T07:41:12,083Z]  [INFO ]  [kafka-scheduler-7]  [kafka.log.Log]  
[Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Deleting segment 
452091893
 [2020-07-17T07:41:12,114Z]  [INFO ]  [kafka-scheduler-7]  
[kafka.log.LogSegment]  Deleted log 
/data/kafka/prd453-19-event-upsert-32/000452091893.log.deleted.
 [2020-07-17T07:41:12,114Z]  [INFO ]  [kafka-scheduler-7]  
[kafka.log.LogSegment]  Deleted offset index 
/data/kafka/prd453-19-event-upsert-32/000452091893.index.deleted.
 [2020-07-17T07:41:12,114Z]  [INFO ]  [kafka-scheduler-7]  
[kafka.log.LogSegment]  Deleted time index 
/data/kafka/prd453-19-event-upsert-32/000452091893.timeindex.deleted.
 [2020-07-17T07:52:31,836Z]  [INFO ]  [data-plane-kafka-request-handler-3]  
[kafka.log.ProducerStateManager]  [ProducerStateManager 
partition=prd453-19-event-upsert-32] Writing producer snapshot at offset 
475609786
 [2020-07-17T07:52:31,836Z]  [INFO ]  [data-plane-kafka-request-handler-3]  
[kafka.log.Log]  [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] 
Rolled new log segment at offset 475609786 in 1 ms.{noformat}
 

 

 
{noformat}
[2020-07-17T09:05:12,075Z]  [INFO ]  [kafka-scheduler-2]  [kafka.log.Log]  [Log 
partition=prd453-19-event-upsert-32, dir=/data/kafka] Found deletable segments 
with base offsets [453223789] due to retention time 8640ms breach
 [2020-07-17T09:05:12,075Z]  [INFO ]  [kafka-scheduler-2]  [kafka.log.Log]  
[Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Scheduling log 
segment [baseOffset 453223789, size 1073741355] for deletion.
 [2020-07-17T09:05:12,075Z]  [INFO ]  [kafka-scheduler-2]  [kafka.log.Log]  
[Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Incrementing log 
start offset to 454388428
 [2020-07-17T09:06:12,075Z]  [INFO ]  [kafka-scheduler-6]  [kafka.log.Log]  
[Log 

[jira] [Updated] (KAFKA-10313) Out of range offset errors leading to offset reset

2020-07-27 Thread Varsha Abhinandan (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-10313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varsha Abhinandan updated KAFKA-10313:
--
Description: 
Hi,
  
 We have been occasionally noticing offset resets happening on the Kafka 
consumer because of offset out of range error. However, I don't see any errors 
in the broker logs. No logs related to leader-election, replica lag, Kafka 
broker pod restarts or anything. (just info logs were enabled in the prod 
environment).
  
 It appeared from the logs that the out of range error was because of the fetch 
offset being larger than the offset range on the broker. Noticed this happening 
multiple times on different consumers, stream apps in the prod environment. So, 
it doesn't seem like an application bug and more like a bug in the 
KafkaConsumer. Would like to understand the cause for such errors.
  
 Also, none of the offset reset options are desirable. Choosing "earliest" 
creates a sudden huge lag (we have a retention of 24hours) and choosing 
"latest" leads to data loss (the records produced between the out of range 
error and when offset reset happens on the consumer). So, wondering if it is 
better for the Kafka client to separate out 'auto.offset.reset' config for just 
offset not found. For, out of range error maybe the Kafka client can 
automatically reset the offset to latest if the fetch offset is higher to 
prevent data loss. Also, automatically reset it to earliest if the fetch offset 
is lesser than the start offset. 
  
  
 Following are the logs on the consumer side :
  
{noformat}
[2020-07-17T08:46:00,322Z] [INFO ] [pipeline-thread-12 
([prd453-19-event-upsert]-bo-pipeline-12)] [o.a.k.c.consumer.internals.Fetcher] 
[Consumer 
clientId=bo-indexer-group-prd453-19-on-c19-bo-indexer-upsert-blue-5d665bcbb7-dnvkh-pid-1-kafka-message-source-id-544,
 groupId=bo-indexer-group-prd453-19] Fetch offset 476383711 is out of range for 
partition prd453-19-event-upsert-32, resetting offset

[2020-07-17T08:46:00,330Z] [INFO ] [pipeline-thread-12 
([prd453-19-event-upsert]-bo-pipeline-12)] [o.a.k.c.consumer.internals.Fetcher] 
[Consumer 
clientId=bo-indexer-group-prd453-19-on-c19-bo-indexer-upsert-blue-5d665bcbb7-dnvkh-pid-1-kafka-message-source-id-544,
 groupId=bo-indexer-group-prd453-19] Resetting offset for partition 
prd453-19-event-upsert-32 to offset 453223789.
  {noformat}
Broker logs for the partition :
{noformat}
[2020-07-17T07:40:12,082Z]  [INFO ]  [kafka-scheduler-4]  [kafka.log.Log]  [Log 
partition=prd453-19-event-upsert-32, dir=/data/kafka] Found deletable segments 
with base offsets [452091893] due to retention time 8640ms breach
 [2020-07-17T07:40:12,082Z]  [INFO ]  [kafka-scheduler-4]  [kafka.log.Log]  
[Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Scheduling log 
segment [baseOffset 452091893, size 1073741693] for deletion.
 [2020-07-17T07:40:12,083Z]  [INFO ]  [kafka-scheduler-4]  [kafka.log.Log]  
[Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Incrementing log 
start offset to 453223789
 [2020-07-17T07:41:12,083Z]  [INFO ]  [kafka-scheduler-7]  [kafka.log.Log]  
[Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Deleting segment 
452091893
 [2020-07-17T07:41:12,114Z]  [INFO ]  [kafka-scheduler-7]  
[kafka.log.LogSegment]  Deleted log 
/data/kafka/prd453-19-event-upsert-32/000452091893.log.deleted.
 [2020-07-17T07:41:12,114Z]  [INFO ]  [kafka-scheduler-7]  
[kafka.log.LogSegment]  Deleted offset index 
/data/kafka/prd453-19-event-upsert-32/000452091893.index.deleted.
 [2020-07-17T07:41:12,114Z]  [INFO ]  [kafka-scheduler-7]  
[kafka.log.LogSegment]  Deleted time index 
/data/kafka/prd453-19-event-upsert-32/000452091893.timeindex.deleted.
 [2020-07-17T07:52:31,836Z]  [INFO ]  [data-plane-kafka-request-handler-3]  
[kafka.log.ProducerStateManager]  [ProducerStateManager 
partition=prd453-19-event-upsert-32] Writing producer snapshot at offset 
475609786
 [2020-07-17T07:52:31,836Z]  [INFO ]  [data-plane-kafka-request-handler-3]  
[kafka.log.Log]  [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] 
Rolled new log segment at offset 475609786 in 1 ms.{noformat}
 
{noformat}
[2020-07-17T09:05:12,075Z]  [INFO ]  [kafka-scheduler-2]  [kafka.log.Log]  [Log 
partition=prd453-19-event-upsert-32, dir=/data/kafka] Found deletable segments 
with base offsets [453223789] due to retention time 8640ms breach
 [2020-07-17T09:05:12,075Z]  [INFO ]  [kafka-scheduler-2]  [kafka.log.Log]  
[Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Scheduling log 
segment [baseOffset 453223789, size 1073741355] for deletion.
 [2020-07-17T09:05:12,075Z]  [INFO ]  [kafka-scheduler-2]  [kafka.log.Log]  
[Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Incrementing log 
start offset to 454388428
 [2020-07-17T09:06:12,075Z]  [INFO ]  [kafka-scheduler-6]  [kafka.log.Log]  
[Log partition=prd453-19-event-upsert-32, 

[jira] [Created] (KAFKA-10313) Out of range offset errors leading to offset reset

2020-07-27 Thread Varsha Abhinandan (Jira)
Varsha Abhinandan created KAFKA-10313:
-

 Summary: Out of range offset errors leading to offset reset
 Key: KAFKA-10313
 URL: https://issues.apache.org/jira/browse/KAFKA-10313
 Project: Kafka
  Issue Type: Bug
  Components: consumer
Affects Versions: 2.2.2
Reporter: Varsha Abhinandan


Hi,
 
We have been occasionally noticing offset resets happening on the Kafka 
consumer because of offset out of range error. However, I don't see any errors 
in the broker logs. No logs related to leader-election, replica lag, Kafka 
broker pod restarts or anything. (just info logs were enabled in the prod 
environment).
 
It appeared from the logs that the out of range error was because of the fetch 
offset being larger than the offset range on the broker. Noticed this happening 
multiple times on different consumers, stream apps in the prod environment. So, 
it doesn't seem like an application bug and more like a bug in the 
KafkaConsumer. Would like to understand the cause for such errors.
 
Also, none of the offset reset options are desirable. Choosing "earliest" 
creates a sudden huge lag (we have a retention of 24hours) and choosing 
"latest" leads to data loss (the records produced between the out of range 
error and when offset reset happens on the consumer). So, wondering if it is 
better for the Kafka client to separate out 'auto.offset.reset' config for just 
offset not found. For, out of range error maybe the Kafka client can 
automatically reset the offset to latest if the fetch offset is higher to 
prevent data loss. Also, automatically reset it to earliest if the fetch offset 
is lesser than the start offset. 
 
 
Following are the logs on the consumer side :
 
[2020-07-17T08:46:00,322Z] [INFO ] [pipeline-thread-12 
([prd453-19-event-upsert]-bo-pipeline-12)] [o.a.k.c.consumer.internals.Fetcher] 
[Consumer 
clientId=bo-indexer-group-prd453-19-on-c19-bo-indexer-upsert-blue-5d665bcbb7-dnvkh-pid-1-kafka-message-source-id-544,
 groupId=bo-indexer-group-prd453-19] Fetch offset 476383711 is out of range for 
partition prd453-19-event-upsert-32, resetting offset[2020-07-17T08:46:00,330Z] 
[INFO ] [pipeline-thread-12 ([prd453-19-event-upsert]-bo-pipeline-12)] 
[o.a.k.c.consumer.internals.Fetcher] [Consumer 
clientId=bo-indexer-group-prd453-19-on-c19-bo-indexer-upsert-blue-5d665bcbb7-dnvkh-pid-1-kafka-message-source-id-544,
 groupId=bo-indexer-group-prd453-19] Resetting offset for partition 
prd453-19-event-upsert-32 to offset 453223789.
 
Broker logs for the partition :
_[2020-07-17T07:40:12,082Z]  [INFO ]  [kafka-scheduler-4]  [kafka.log.Log]  
[Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Found deletable 
segments with base offsets [452091893] due to retention time 8640ms breach_
_[2020-07-17T07:40:12,082Z]  [INFO ]  [kafka-scheduler-4]  [kafka.log.Log]  
[Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Scheduling log 
segment [baseOffset 452091893, size 1073741693] for deletion._
_[2020-07-17T07:40:12,083Z]  [INFO ]  [kafka-scheduler-4]  [kafka.log.Log]  
[Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Incrementing log 
start offset to 453223789_
_[2020-07-17T07:41:12,083Z]  [INFO ]  [kafka-scheduler-7]  [kafka.log.Log]  
[Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Deleting segment 
452091893_
_[2020-07-17T07:41:12,114Z]  [INFO ]  [kafka-scheduler-7]  
[kafka.log.LogSegment]  Deleted log 
/data/kafka/prd453-19-event-upsert-32/000452091893.log.deleted._
_[2020-07-17T07:41:12,114Z]  [INFO ]  [kafka-scheduler-7]  
[kafka.log.LogSegment]  Deleted offset index 
/data/kafka/prd453-19-event-upsert-32/000452091893.index.deleted._
_[2020-07-17T07:41:12,114Z]  [INFO ]  [kafka-scheduler-7]  
[kafka.log.LogSegment]  Deleted time index 
/data/kafka/prd453-19-event-upsert-32/000452091893.timeindex.deleted._
_[2020-07-17T07:52:31,836Z]  [INFO ]  [data-plane-kafka-request-handler-3]  
[kafka.log.ProducerStateManager]  [ProducerStateManager 
partition=prd453-19-event-upsert-32] Writing producer snapshot at offset 
475609786_
_[2020-07-17T07:52:31,836Z]  [INFO ]  [data-plane-kafka-request-handler-3]  
[kafka.log.Log]  [Log partition=prd453-19-event-upsert-32, dir=/data/kafka] 
Rolled new log segment at offset 475609786 in 1 ms._

_[2020-07-17T09:05:12,075Z]  [INFO ]  [kafka-scheduler-2]  [kafka.log.Log]  
[Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Found deletable 
segments with base offsets [453223789] due to retention time 8640ms breach_
_[2020-07-17T09:05:12,075Z]  [INFO ]  [kafka-scheduler-2]  [kafka.log.Log]  
[Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Scheduling log 
segment [baseOffset 453223789, size 1073741355] for deletion._
_[2020-07-17T09:05:12,075Z]  [INFO ]  [kafka-scheduler-2]  [kafka.log.Log]  
[Log partition=prd453-19-event-upsert-32, dir=/data/kafka] Incrementing log 
start 

[jira] [Comment Edited] (KAFKA-8673) Kafka stream threads stuck while sending offsets to transaction preventing join group from completing

2019-07-29 Thread Varsha Abhinandan (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892014#comment-16892014
 ] 

Varsha Abhinandan edited comment on KAFKA-8673 at 7/29/19 7:34 AM:
---

Hi [~guozhang], the threads were blocked on TransactionalRequestResult.await 
for about 4 days. The rebalance completed only after we restarted the processes 
which had the stream threads stuck on TransactionalRequestResult.await.


was (Author: varsha.abhinandan):
Hi [~guozhang], the threads were blocked on TransactionalRequestResult.await 
for about 4 days. The rebalance completed only after we restarted the processes 
which had the stream threads stuck on TransactionalRequestResult.await. 

 

 

> Kafka stream threads stuck while sending offsets to transaction preventing 
> join group from completing
> -
>
> Key: KAFKA-8673
> URL: https://issues.apache.org/jira/browse/KAFKA-8673
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer, streams
>Affects Versions: 2.2.0
>Reporter: Varsha Abhinandan
>Priority: Major
> Attachments: Screen Shot 2019-07-11 at 12.08.09 PM.png
>
>
> We observed a deadlock kind of a situation in our Kafka streams application 
> when we accidentally shut down all the brokers. The Kafka cluster was brought 
> back in about an hour. 
> Observations made :
>  # Normal Kafka producers and consumers started working fine after the 
> brokers were up again. 
>  # The Kafka streams applications were stuck in the "rebalancing" state.
>  # The Kafka streams apps have exactly-once semantics enabled.
>  # The stack trace showed most of the stream threads sending the join group 
> requests to the group co-ordinator
>  # Few stream threads couldn't initiate the join group request since the call 
> to 
> [org.apache.kafka.clients.producer.KafkaProducer#sendOffsetsToTransaction|https://jira.corp.appdynamics.com/browse/ANLYTCS_ES-2062#sendOffsetsToTransaction%20which%20was%20hung]
>  was stuck.
>  # Seems like the join group requests were getting parked at the coordinator 
> since the expected members hadn't sent their own group join requests
>  # And after the timeout, the stream threads that were not stuck sent a new 
> join group requests.  
>  # Maybe (6) and (7) is happening infinitely
>  # Sample values of the GroupMetadata object on the group co-ordinator - 
> [^Screen Shot 2019-07-11 at 12.08.09 PM.png]
>  # The list of notYetJoinedMembers client id's matched with the threads 
> waiting for their offsets to be committed. 
> {code:java}
> [List(MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer-efa41349-3da1-43b6-9710-a662f68c63b1,
>  
> clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer,
>  clientHost=/10.136.98.48, sessionTimeoutMs=15000, 
> rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
> MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer-7cc8e41b-ad98-4006-a18a-b22abe6350f4,
>  
> clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer,
>  clientHost=/10.136.103.148, sessionTimeoutMs=15000, 
> rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
> MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer-9ffb96c1-3379-4cbd-bee1-5d4719fe6c9d,
>  
> clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer,
>  clientHost=/10.136.98.48, sessionTimeoutMs=15000, 
> rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
> MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer-5b8a1f1f-84dd-4a87-86c8-7542c0e50d1f,
>  
> clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer,
>  clientHost=/10.136.103.148, sessionTimeoutMs=15000, 
> rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
> MemberMetadata(memberId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer-3cb67ec9-c548-4386-962d-64d9772bf719,
>  
> clientId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer,
>  clientHost=/10.136.99.15, sessionTimeoutMs=15000, 
> rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ))]
> vabhinandan-mac:mp-jstack varsha.abhinandan$ cat jstack.* | grep 
> "metric-extractor-stream-c1-" | grep "StreamThread-" | grep "waiting on 
> condition"
> "metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36"
>  #128 daemon prio=5 os_prio=0 

[jira] [Commented] (KAFKA-8673) Kafka stream threads stuck while sending offsets to transaction preventing join group from completing

2019-07-24 Thread Varsha Abhinandan (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892014#comment-16892014
 ] 

Varsha Abhinandan commented on KAFKA-8673:
--

Hi [~guozhang], the threads were blocked on TransactionalRequestResult.await 
for about 4 days. The rebalance completed only after we restarted the processes 
which had the stream threads stuck on TransactionalRequestResult.await. 

 

 

> Kafka stream threads stuck while sending offsets to transaction preventing 
> join group from completing
> -
>
> Key: KAFKA-8673
> URL: https://issues.apache.org/jira/browse/KAFKA-8673
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer, streams
>Affects Versions: 2.2.0
>Reporter: Varsha Abhinandan
>Priority: Major
> Attachments: Screen Shot 2019-07-11 at 12.08.09 PM.png
>
>
> We observed a deadlock kind of a situation in our Kafka streams application 
> when we accidentally shut down all the brokers. The Kafka cluster was brought 
> back in about an hour. 
> Observations made :
>  # Normal Kafka producers and consumers started working fine after the 
> brokers were up again. 
>  # The Kafka streams applications were stuck in the "rebalancing" state.
>  # The Kafka streams apps have exactly-once semantics enabled.
>  # The stack trace showed most of the stream threads sending the join group 
> requests to the group co-ordinator
>  # Few stream threads couldn't initiate the join group request since the call 
> to 
> [org.apache.kafka.clients.producer.KafkaProducer#sendOffsetsToTransaction|https://jira.corp.appdynamics.com/browse/ANLYTCS_ES-2062#sendOffsetsToTransaction%20which%20was%20hung]
>  was stuck.
>  # Seems like the join group requests were getting parked at the coordinator 
> since the expected members hadn't sent their own group join requests
>  # And after the timeout, the stream threads that were not stuck sent a new 
> join group requests.  
>  # Maybe (6) and (7) is happening infinitely
>  # Sample values of the GroupMetadata object on the group co-ordinator - 
> [^Screen Shot 2019-07-11 at 12.08.09 PM.png]
>  # The list of notYetJoinedMembers client id's matched with the threads 
> waiting for their offsets to be committed. 
> {code:java}
> [List(MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer-efa41349-3da1-43b6-9710-a662f68c63b1,
>  
> clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer,
>  clientHost=/10.136.98.48, sessionTimeoutMs=15000, 
> rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
> MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer-7cc8e41b-ad98-4006-a18a-b22abe6350f4,
>  
> clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer,
>  clientHost=/10.136.103.148, sessionTimeoutMs=15000, 
> rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
> MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer-9ffb96c1-3379-4cbd-bee1-5d4719fe6c9d,
>  
> clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer,
>  clientHost=/10.136.98.48, sessionTimeoutMs=15000, 
> rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
> MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer-5b8a1f1f-84dd-4a87-86c8-7542c0e50d1f,
>  
> clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer,
>  clientHost=/10.136.103.148, sessionTimeoutMs=15000, 
> rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
> MemberMetadata(memberId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer-3cb67ec9-c548-4386-962d-64d9772bf719,
>  
> clientId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer,
>  clientHost=/10.136.99.15, sessionTimeoutMs=15000, 
> rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ))]
> vabhinandan-mac:mp-jstack varsha.abhinandan$ cat jstack.* | grep 
> "metric-extractor-stream-c1-" | grep "StreamThread-" | grep "waiting on 
> condition"
> "metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36"
>  #128 daemon prio=5 os_prio=0 tid=0x7fc53c047800 nid=0xac waiting on 
> condition [0x7fc4e68e7000]
> "metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21"
>  #93 daemon prio=5 os_prio=0 tid=0x7fc53c2b5800 nid=0x9d waiting on 
> condition [0x7fc4e77f6000]
> 

[jira] [Updated] (KAFKA-8673) Kafka stream threads stuck while sending offsets to transaction preventing join group from completing

2019-07-16 Thread Varsha Abhinandan (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varsha Abhinandan updated KAFKA-8673:
-
Description: 
We observed a deadlock kind of a situation in our Kafka streams application 
when we accidentally shut down all the brokers. The Kafka cluster was brought 
back in about an hour. 

Observations made :
 # Normal Kafka producers and consumers started working fine after the brokers 
were up again. 
 # The Kafka streams applications were stuck in the "rebalancing" state.
 # The Kafka streams apps have exactly-once semantics enabled.
 # The stack trace showed most of the stream threads sending the join group 
requests to the group co-ordinator
 # Few stream threads couldn't initiate the join group request since the call 
to 
[org.apache.kafka.clients.producer.KafkaProducer#sendOffsetsToTransaction|https://jira.corp.appdynamics.com/browse/ANLYTCS_ES-2062#sendOffsetsToTransaction%20which%20was%20hung]
 was stuck.
 # Seems like the join group requests were getting parked at the coordinator 
since the expected members hadn't sent their own group join requests
 # And after the timeout, the stream threads that were not stuck sent a new 
join group requests.  
 # Maybe (6) and (7) is happening infinitely
 # Sample values of the GroupMetadata object on the group co-ordinator - 
[^Screen Shot 2019-07-11 at 12.08.09 PM.png]
 # The list of notYetJoinedMembers client id's matched with the threads waiting 
for their offsets to be committed. 
{code:java}
[List(MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer-efa41349-3da1-43b6-9710-a662f68c63b1,
 
clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer,
 clientHost=/10.136.98.48, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer-7cc8e41b-ad98-4006-a18a-b22abe6350f4,
 
clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer,
 clientHost=/10.136.103.148, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer-9ffb96c1-3379-4cbd-bee1-5d4719fe6c9d,
 
clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer,
 clientHost=/10.136.98.48, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer-5b8a1f1f-84dd-4a87-86c8-7542c0e50d1f,
 
clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer,
 clientHost=/10.136.103.148, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer-3cb67ec9-c548-4386-962d-64d9772bf719,
 
clientId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer,
 clientHost=/10.136.99.15, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ))]

vabhinandan-mac:mp-jstack varsha.abhinandan$ cat jstack.* | grep 
"metric-extractor-stream-c1-" | grep "StreamThread-" | grep "waiting on 
condition"
"metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36"
 #128 daemon prio=5 os_prio=0 tid=0x7fc53c047800 nid=0xac waiting on 
condition [0x7fc4e68e7000]
"metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21"
 #93 daemon prio=5 os_prio=0 tid=0x7fc53c2b5800 nid=0x9d waiting on 
condition [0x7fc4e77f6000]
"metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33"
 #125 daemon prio=5 os_prio=0 tid=0x7fe18017c800 nid=0xbc waiting on 
condition [0x7fe12e7e8000]
"metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38"
 #154 daemon prio=5 os_prio=0 tid=0x7f27c4225800 nid=0xc4 waiting on 
condition [0x7f2772bec000]
"metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27"
 #143 daemon prio=5 os_prio=0 tid=0x7f27c4365800 nid=0xb9 waiting on 
condition [0x7f27736f7000]
{code}
11. Sample Stream Thread stuck - 
{noformat}
"metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36"
 #128 daemon prio=5 os_prio=0 tid=0x7fc53c047800 nid=0xac waiting on 
condition [0x7fc4e68e7000]
 java.lang.Thread.State: WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 - parking to wait for <0x000723587580> (a 
java.util.concurrent.CountDownLatch$Sync)
 at 

[jira] [Updated] (KAFKA-8673) Kafka stream threads stuck while sending offsets to transaction preventing join group from completing

2019-07-16 Thread Varsha Abhinandan (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varsha Abhinandan updated KAFKA-8673:
-
Description: 
We observed a deadlock kind of a situation in our Kafka streams application 
when we accidentally shut down all the brokers. The Kafka cluster was brought 
back in about an hour. 

Observations made :
 # Normal Kafka producers and consumers started working fine after the brokers 
were up again. 
 # The Kafka streams applications were stuck in the "rebalancing" state.
 # The Kafka streams apps have exactly-once semantics enabled.
 # The stack trace showed most of the stream threads sending the join group 
requests to the group co-ordinator
 # Few stream threads couldn't initiate the join group request since the call 
to 
[org.apache.kafka.clients.producer.KafkaProducer#sendOffsetsToTransaction|https://jira.corp.appdynamics.com/browse/ANLYTCS_ES-2062#sendOffsetsToTransaction%20which%20was%20hung]
 was stuck.
 # Seems like the join group requests were getting parked at the coordinator 
since the expected members hadn't sent their own group join requests
 # And after the timeout, the stream threads that were not stuck sent a new 
join group requests.  
 # Maybe (6) and (7) is happening infinitely
 # Sample values of the GroupMetadata object on the group co-ordinator - 
[^Screen Shot 2019-07-11 at 12.08.09 PM.png]
 # The list of notYetJoinedMembers client id's matched with the threads waiting 
for their offsets to be committed. 
{code:java}
[List(MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer-efa41349-3da1-43b6-9710-a662f68c63b1,
 
clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer,
 clientHost=/10.136.98.48, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer-7cc8e41b-ad98-4006-a18a-b22abe6350f4,
 
clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer,
 clientHost=/10.136.103.148, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer-9ffb96c1-3379-4cbd-bee1-5d4719fe6c9d,
 
clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer,
 clientHost=/10.136.98.48, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer-5b8a1f1f-84dd-4a87-86c8-7542c0e50d1f,
 
clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer,
 clientHost=/10.136.103.148, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer-3cb67ec9-c548-4386-962d-64d9772bf719,
 
clientId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer,
 clientHost=/10.136.99.15, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ))]

vabhinandan-mac:mp-jstack varsha.abhinandan$ cat jstack.* | grep 
"metric-extractor-stream-c1-" | grep "StreamThread-" | grep "waiting on 
condition"
"metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36"
 #128 daemon prio=5 os_prio=0 tid=0x7fc53c047800 nid=0xac waiting on 
condition [0x7fc4e68e7000]
"metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21"
 #93 daemon prio=5 os_prio=0 tid=0x7fc53c2b5800 nid=0x9d waiting on 
condition [0x7fc4e77f6000]
"metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33"
 #125 daemon prio=5 os_prio=0 tid=0x7fe18017c800 nid=0xbc waiting on 
condition [0x7fe12e7e8000]
"metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38"
 #154 daemon prio=5 os_prio=0 tid=0x7f27c4225800 nid=0xc4 waiting on 
condition [0x7f2772bec000]
"metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27"
 #143 daemon prio=5 os_prio=0 tid=0x7f27c4365800 nid=0xb9 waiting on 
condition [0x7f27736f7000]
{code}

 # Sample Stream Thread stuck - 
{noformat}
"metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36"
 #128 daemon prio=5 os_prio=0 tid=0x7fc53c047800 nid=0xac waiting on 
condition [0x7fc4e68e7000]
 java.lang.Thread.State: WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 - parking to wait for <0x000723587580> (a 
java.util.concurrent.CountDownLatch$Sync)
 at 

[jira] [Updated] (KAFKA-8673) Kafka stream threads stuck while sending offsets to transaction preventing join group from completing

2019-07-16 Thread Varsha Abhinandan (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varsha Abhinandan updated KAFKA-8673:
-
Description: 
We observed a deadlock kind of a situation in our Kafka streams application 
when we accidentally shut down all the brokers. The Kafka cluster was brought 
back in about an hour. 

Observations made :
 # Normal Kafka producers and consumers started working fine after the brokers 
were up again. 
 # The Kafka streams applications were stuck in the "rebalancing" state.
 # The Kafka streams apps have exactly-once semantics enabled.
 # The stack trace showed most of the stream threads sending the join group 
requests to the group co-ordinator
 # Few stream threads couldn't initiate the join group request since the call 
to 
[org.apache.kafka.clients.producer.KafkaProducer#sendOffsetsToTransaction|https://jira.corp.appdynamics.com/browse/ANLYTCS_ES-2062#sendOffsetsToTransaction%20which%20was%20hung]
 was stuck.
 # Seems like the join group requests were getting parked at the coordinator 
since the expected members hadn't sent their own group join requests
 # And after the timeout, the stream threads that were not stuck sent a new 
join group requests.  
 # Maybe (6) and (7) is happening infinitely
 # Sample values of the GroupMetadata object on the group co-ordinator - 
[^Screen Shot 2019-07-11 at 12.08.09 PM.png]
 # The list of notYetJoinedMembers client id's matched with the threads waiting 
for their offsets to be committed. 
{code:java}
[List(MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer-efa41349-3da1-43b6-9710-a662f68c63b1,
 
clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer,
 clientHost=/10.136.98.48, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer-7cc8e41b-ad98-4006-a18a-b22abe6350f4,
 
clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer,
 clientHost=/10.136.103.148, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer-9ffb96c1-3379-4cbd-bee1-5d4719fe6c9d,
 
clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer,
 clientHost=/10.136.98.48, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer-5b8a1f1f-84dd-4a87-86c8-7542c0e50d1f,
 
clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer,
 clientHost=/10.136.103.148, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer-3cb67ec9-c548-4386-962d-64d9772bf719,
 
clientId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer,
 clientHost=/10.136.99.15, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ))]

vabhinandan-mac:mp-jstack varsha.abhinandan$ cat jstack.* | grep 
"metric-extractor-stream-c1-" | grep "StreamThread-" | grep "waiting on 
condition"
"metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36"
 #128 daemon prio=5 os_prio=0 tid=0x7fc53c047800 nid=0xac waiting on 
condition [0x7fc4e68e7000]
"metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21"
 #93 daemon prio=5 os_prio=0 tid=0x7fc53c2b5800 nid=0x9d waiting on 
condition [0x7fc4e77f6000]
"metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33"
 #125 daemon prio=5 os_prio=0 tid=0x7fe18017c800 nid=0xbc waiting on 
condition [0x7fe12e7e8000]
"metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38"
 #154 daemon prio=5 os_prio=0 tid=0x7f27c4225800 nid=0xc4 waiting on 
condition [0x7f2772bec000]
"metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27"
 #143 daemon prio=5 os_prio=0 tid=0x7f27c4365800 nid=0xb9 waiting on 
condition [0x7f27736f7000]
{code}
{noformat}
"metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36"
 #128 daemon prio=5 os_prio=0 tid=0x7fc53c047800 nid=0xac waiting on 
condition [0x7fc4e68e7000] java.lang.Thread.State: WAITING (parking) at 
sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000723587580> 
(a java.util.concurrent.CountDownLatch$Sync) at 
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at 

[jira] [Updated] (KAFKA-8673) Kafka stream threads stuck while sending offsets to transaction preventing join group from completing

2019-07-16 Thread Varsha Abhinandan (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varsha Abhinandan updated KAFKA-8673:
-
Description: 
We observed a deadlock kind of a situation in our Kafka streams application 
when we accidentally shut down all the brokers. The Kafka cluster was brought 
back in about an hour. 

Observations made :
 # Normal Kafka producers and consumers started working fine after the brokers 
were up again. 
 # The Kafka streams applications were stuck in the "rebalancing" state.
 # The Kafka streams apps have exactly-once semantics enabled.
 # The stack trace showed most of the stream threads sending the join group 
requests to the group co-ordinator
 # Few stream threads couldn't initiate the join group request since the call 
to 
[org.apache.kafka.clients.producer.KafkaProducer#sendOffsetsToTransaction|https://jira.corp.appdynamics.com/browse/ANLYTCS_ES-2062#sendOffsetsToTransaction%20which%20was%20hung]
 was stuck.
 # Seems like the join group requests were getting parked at the coordinator 
since the expected members hadn't sent their own group join requests
 # And after the timeout, the stream threads that were not stuck sent a new 
join group requests.  
 # Maybe (6) and (7) is happening infinitely
 # Sample values of the GroupMetadata object on the group co-ordinator - 
https://issues.apache.org/jira/secure/attachment/12974837/Screen%20Shot%202019-07-11%20at%2012.08.09%20PM.png
 # The list of notYetJoinedMembers client id's matched with the threads waiting 
for their offsets to be committed. 
{code:java}
[List(MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer-efa41349-3da1-43b6-9710-a662f68c63b1,
 
clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer,
 clientHost=/10.136.98.48, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer-7cc8e41b-ad98-4006-a18a-b22abe6350f4,
 
clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer,
 clientHost=/10.136.103.148, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer-9ffb96c1-3379-4cbd-bee1-5d4719fe6c9d,
 
clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer,
 clientHost=/10.136.98.48, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer-5b8a1f1f-84dd-4a87-86c8-7542c0e50d1f,
 
clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer,
 clientHost=/10.136.103.148, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer-3cb67ec9-c548-4386-962d-64d9772bf719,
 
clientId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer,
 clientHost=/10.136.99.15, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ))]

vabhinandan-mac:mp-jstack varsha.abhinandan$ cat jstack.* | grep 
"metric-extractor-stream-c1-" | grep "StreamThread-" | grep "waiting on 
condition"
"metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36"
 #128 daemon prio=5 os_prio=0 tid=0x7fc53c047800 nid=0xac waiting on 
condition [0x7fc4e68e7000]
"metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21"
 #93 daemon prio=5 os_prio=0 tid=0x7fc53c2b5800 nid=0x9d waiting on 
condition [0x7fc4e77f6000]
"metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33"
 #125 daemon prio=5 os_prio=0 tid=0x7fe18017c800 nid=0xbc waiting on 
condition [0x7fe12e7e8000]
"metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38"
 #154 daemon prio=5 os_prio=0 tid=0x7f27c4225800 nid=0xc4 waiting on 
condition [0x7f2772bec000]
"metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27"
 #143 daemon prio=5 os_prio=0 tid=0x7f27c4365800 nid=0xb9 waiting on 
condition [0x7f27736f7000]
{code}
 

  was:
We observed a deadlock kind of a situation in our Kafka streams application 
when we accidentally shut down all the brokers. The Kafka cluster was brought 
back in about an hour. 

Observations made :
 # Normal Kafka producers and consumers started working fine after the brokers 
were up again. 
 # The Kafka streams applications were stuck in the "rebalancing" state.
 # The 

[jira] [Updated] (KAFKA-8673) Kafka stream threads stuck while sending offsets to transaction preventing join group from completing

2019-07-16 Thread Varsha Abhinandan (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varsha Abhinandan updated KAFKA-8673:
-
Description: 
We observed a deadlock kind of a situation in our Kafka streams application 
when we accidentally shut down all the brokers. The Kafka cluster was brought 
back in about an hour. 

Observations made :
 # Normal Kafka producers and consumers started working fine after the brokers 
were up again. 
 # The Kafka streams applications were stuck in the "rebalancing" state.
 # The Kafka streams apps have exactly-once semantics enabled.
 # The stack trace showed most of the stream threads sending the join group 
requests to the group co-ordinator
 # Few stream threads couldn't initiate the join group request since the call 
to 
[org.apache.kafka.clients.producer.KafkaProducer#sendOffsetsToTransaction|https://jira.corp.appdynamics.com/browse/ANLYTCS_ES-2062#sendOffsetsToTransaction%20which%20was%20hung]
 was stuck.
 # Seems like the join group requests were getting parked at the coordinator 
since the expected members hadn't sent their own group join requests
 # And after the timeout, the stream threads that were not stuck sent a new 
join group requests.  
 # Maybe (6) and (7) is happening infinitely
 # Sample values of the GroupMetadata object on the group co-ordinator - 
 # The list of notYetJoinedMembers client id's matched with the threads waiting 
for their offsets to be committed. 
{code:java}
[List(MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer-efa41349-3da1-43b6-9710-a662f68c63b1,
 
clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer,
 clientHost=/10.136.98.48, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer-7cc8e41b-ad98-4006-a18a-b22abe6350f4,
 
clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer,
 clientHost=/10.136.103.148, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer-9ffb96c1-3379-4cbd-bee1-5d4719fe6c9d,
 
clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer,
 clientHost=/10.136.98.48, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer-5b8a1f1f-84dd-4a87-86c8-7542c0e50d1f,
 
clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer,
 clientHost=/10.136.103.148, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer-3cb67ec9-c548-4386-962d-64d9772bf719,
 
clientId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer,
 clientHost=/10.136.99.15, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ))]

vabhinandan-mac:mp-jstack varsha.abhinandan$ cat jstack.* | grep 
"metric-extractor-stream-c1-" | grep "StreamThread-" | grep "waiting on 
condition"
"metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36"
 #128 daemon prio=5 os_prio=0 tid=0x7fc53c047800 nid=0xac waiting on 
condition [0x7fc4e68e7000]
"metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21"
 #93 daemon prio=5 os_prio=0 tid=0x7fc53c2b5800 nid=0x9d waiting on 
condition [0x7fc4e77f6000]
"metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33"
 #125 daemon prio=5 os_prio=0 tid=0x7fe18017c800 nid=0xbc waiting on 
condition [0x7fe12e7e8000]
"metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38"
 #154 daemon prio=5 os_prio=0 tid=0x7f27c4225800 nid=0xc4 waiting on 
condition [0x7f2772bec000]
"metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27"
 #143 daemon prio=5 os_prio=0 tid=0x7f27c4365800 nid=0xb9 waiting on 
condition [0x7f27736f7000]
{code}
 

  was:
We observed a deadlock kind of a situation in our Kafka streams application 
when we accidentally shut down all the brokers. The Kafka cluster was brought 
back in about an hour. 

Observations made :
 # Normal Kafka producers and consumers started working fine after the brokers 
were up again. 
 # The Kafka streams applications were stuck in the "rebalancing" state.
 # The Kafka streams apps have exactly-once semantics enabled.
 # The stack trace showed most of the stream threads 

[jira] [Updated] (KAFKA-8673) Kafka stream threads stuck while sending offsets to transaction preventing join group from completing

2019-07-16 Thread Varsha Abhinandan (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varsha Abhinandan updated KAFKA-8673:
-
Description: 
We observed a deadlock kind of a situation in our Kafka streams application 
when we accidentally shut down all the brokers. The Kafka cluster was brought 
back in about an hour. 

Observations made :
 # Normal Kafka producers and consumers started working fine after the brokers 
were up again. 
 # The Kafka streams applications were stuck in the "rebalancing" state.
 # The Kafka streams apps have exactly-once semantics enabled.
 # The stack trace showed most of the stream threads sending the join group 
requests to the group co-ordinator
 # Few stream threads couldn't initiate the join group request since the call 
to 
[org.apache.kafka.clients.producer.KafkaProducer#sendOffsetsToTransaction|https://jira.corp.appdynamics.com/browse/ANLYTCS_ES-2062#sendOffsetsToTransaction%20which%20was%20hung]
 was stuck.
 # Seems like the join group requests were getting parked at the coordinator 
since the expected members hadn't sent their own group join requests
 # And after the timeout, the stream threads that were not stuck sent a new 
join group requests.  
 # Maybe (6) and (7) is happening infinitely
 # Sample values of the GroupMetadata object on the group co-ordinator - 

 # The list of notYetJoinedMembers client id's matched with the threads waiting 
for their offsets to be committed. 
{code:java}
[List(MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer-efa41349-3da1-43b6-9710-a662f68c63b1,
 
clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer,
 clientHost=/10.136.98.48, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer-7cc8e41b-ad98-4006-a18a-b22abe6350f4,
 
clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer,
 clientHost=/10.136.103.148, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer-9ffb96c1-3379-4cbd-bee1-5d4719fe6c9d,
 
clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer,
 clientHost=/10.136.98.48, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer-5b8a1f1f-84dd-4a87-86c8-7542c0e50d1f,
 
clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer,
 clientHost=/10.136.103.148, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer-3cb67ec9-c548-4386-962d-64d9772bf719,
 
clientId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer,
 clientHost=/10.136.99.15, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ))]

vabhinandan-mac:mp-jstack varsha.abhinandan$ cat jstack.* | grep 
"metric-extractor-stream-c1-" | grep "StreamThread-" | grep "waiting on 
condition"
"metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36"
 #128 daemon prio=5 os_prio=0 tid=0x7fc53c047800 nid=0xac waiting on 
condition [0x7fc4e68e7000]
"metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21"
 #93 daemon prio=5 os_prio=0 tid=0x7fc53c2b5800 nid=0x9d waiting on 
condition [0x7fc4e77f6000]
"metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33"
 #125 daemon prio=5 os_prio=0 tid=0x7fe18017c800 nid=0xbc waiting on 
condition [0x7fe12e7e8000]
"metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38"
 #154 daemon prio=5 os_prio=0 tid=0x7f27c4225800 nid=0xc4 waiting on 
condition [0x7f2772bec000]
"metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27"
 #143 daemon prio=5 os_prio=0 tid=0x7f27c4365800 nid=0xb9 waiting on 
condition [0x7f27736f7000]
{code}
 

  was:
We observed a deadlock kind of a situation in our Kafka streams application 
when we accidentally shut down all the brokers. The Kafka cluster was brought 
back in about an hour. 

Observations made :
 # Normal Kafka producers and consumers started working fine after the brokers 
were up again. 
 # The Kafka streams applications were stuck in the "rebalancing" state.
 # The Kafka streams apps have exactly-once semantics enabled.
 # The stack trace showed most of the stream threads 

[jira] [Updated] (KAFKA-8673) Kafka stream threads stuck while sending offsets to transaction preventing join group from completing

2019-07-16 Thread Varsha Abhinandan (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varsha Abhinandan updated KAFKA-8673:
-
Description: 
We observed a deadlock kind of a situation in our Kafka streams application 
when we accidentally shut down all the brokers. The Kafka cluster was brought 
back in about an hour. 

Observations made :
 # Normal Kafka producers and consumers started working fine after the brokers 
were up again. 
 # The Kafka streams applications were stuck in the "rebalancing" state.
 # The Kafka streams apps have exactly-once semantics enabled.
 # The stack trace showed most of the stream threads sending the join group 
requests to the group co-ordinator
 # Few stream threads couldn't initiate the join group request since the call 
to 
[org.apache.kafka.clients.producer.KafkaProducer#sendOffsetsToTransaction|https://jira.corp.appdynamics.com/browse/ANLYTCS_ES-2062#sendOffsetsToTransaction%20which%20was%20hung]
 was stuck.
 # Seems like the join group requests were getting parked at the coordinator 
since the expected members hadn't sent their own group join requests
 # And after the timeout, the stream threads that were not stuck sent a new 
join group requests.  
 # Maybe (6) and (7) is happening infinitely
 # Sample values of the GroupMetadata object on the group co-ordinator - 

  was:
We observed a deadlock kind of a situation in our Kafka streams application 
when we accidentally shut down all the brokers. The Kafka cluster was brought 
back in about an hour. 

Observations made :
 # Normal Kafka producers and consumers started working fine after the brokers 
were up again. 
 # The Kafka streams applications were stuck in the "rebalancing" state.
 # The Kafka streams apps have exactly-once semantics enabled.
 # The stack trace showed most of the stream threads sending the join group 
requests to the group co-ordinator
 # Few stream threads couldn't initiate the join group request since the call 
to 
[org.apache.kafka.clients.producer.KafkaProducer#sendOffsetsToTransaction|https://jira.corp.appdynamics.com/browse/ANLYTCS_ES-2062#sendOffsetsToTransaction%20which%20was%20hung]
 was stuck.
 # Seems like the join group requests were getting parked at the coordinator 
since the expected members hadn't sent their own group join requests
 # And after the timeout, the stream threads that were not stuck sent a new 
join group requests.  
 # Maybe (6) and (7) is happening infinitely
 # Sample values of the GroupMetadata object on the group co-ordinator -  
!Screen Shot 2019-07-11 at 12.08.09 PM.png!
 # The list of notYetJoinedMembers client id's matched with the threads waiting 
for their offsets to be committed. 
{code:java}
[List(MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer-efa41349-3da1-43b6-9710-a662f68c63b1,
 
clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer,
 clientHost=/10.136.98.48, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer-7cc8e41b-ad98-4006-a18a-b22abe6350f4,
 
clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer,
 clientHost=/10.136.103.148, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer-9ffb96c1-3379-4cbd-bee1-5d4719fe6c9d,
 
clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer,
 clientHost=/10.136.98.48, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer-5b8a1f1f-84dd-4a87-86c8-7542c0e50d1f,
 
clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer,
 clientHost=/10.136.103.148, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer-3cb67ec9-c548-4386-962d-64d9772bf719,
 
clientId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer,
 clientHost=/10.136.99.15, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ))]

vabhinandan-mac:mp-jstack varsha.abhinandan$ cat jstack.* | grep 
"metric-extractor-stream-c1-" | grep "StreamThread-" | grep "waiting on 
condition"
"metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36"
 #128 daemon prio=5 os_prio=0 tid=0x7fc53c047800 nid=0xac waiting on 
condition [0x7fc4e68e7000]

[jira] [Updated] (KAFKA-8673) Kafka stream threads stuck while sending offsets to transaction preventing join group from completing

2019-07-16 Thread Varsha Abhinandan (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varsha Abhinandan updated KAFKA-8673:
-
Description: 
We observed a deadlock kind of a situation in our Kafka streams application 
when we accidentally shut down all the brokers. The Kafka cluster was brought 
back in about an hour. 

Observations made :
 # Normal Kafka producers and consumers started working fine after the brokers 
were up again. 
 # The Kafka streams applications were stuck in the "rebalancing" state.
 # The Kafka streams apps have exactly-once semantics enabled.
 # The stack trace showed most of the stream threads sending the join group 
requests to the group co-ordinator
 # Few stream threads couldn't initiate the join group request since the call 
to 
[org.apache.kafka.clients.producer.KafkaProducer#sendOffsetsToTransaction|https://jira.corp.appdynamics.com/browse/ANLYTCS_ES-2062#sendOffsetsToTransaction%20which%20was%20hung]
 was stuck.
 # Seems like the join group requests were getting parked at the coordinator 
since the expected members hadn't sent their own group join requests
 # And after the timeout, the stream threads that were not stuck sent a new 
join group requests.  
 # Maybe (6) and (7) is happening infinitely
 # Sample values of the GroupMetadata object on the group co-ordinator 
 # The list of notYetJoinedMembers client id's matched with the threads waiting 
for their offsets to be committed. 
{code:java}
[List(MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer-efa41349-3da1-43b6-9710-a662f68c63b1,
 
clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer,
 clientHost=/10.136.98.48, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer-7cc8e41b-ad98-4006-a18a-b22abe6350f4,
 
clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer,
 clientHost=/10.136.103.148, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer-9ffb96c1-3379-4cbd-bee1-5d4719fe6c9d,
 
clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer,
 clientHost=/10.136.98.48, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer-5b8a1f1f-84dd-4a87-86c8-7542c0e50d1f,
 
clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer,
 clientHost=/10.136.103.148, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer-3cb67ec9-c548-4386-962d-64d9772bf719,
 
clientId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer,
 clientHost=/10.136.99.15, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ))]

vabhinandan-mac:mp-jstack varsha.abhinandan$ cat jstack.* | grep 
"metric-extractor-stream-c1-" | grep "StreamThread-" | grep "waiting on 
condition"
"metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36"
 #128 daemon prio=5 os_prio=0 tid=0x7fc53c047800 nid=0xac waiting on 
condition [0x7fc4e68e7000]
"metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21"
 #93 daemon prio=5 os_prio=0 tid=0x7fc53c2b5800 nid=0x9d waiting on 
condition [0x7fc4e77f6000]
"metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33"
 #125 daemon prio=5 os_prio=0 tid=0x7fe18017c800 nid=0xbc waiting on 
condition [0x7fe12e7e8000]
"metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38"
 #154 daemon prio=5 os_prio=0 tid=0x7f27c4225800 nid=0xc4 waiting on 
condition [0x7f2772bec000]
"metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27"
 #143 daemon prio=5 os_prio=0 tid=0x7f27c4365800 nid=0xb9 waiting on 
condition [0x7f27736f7000]
{code}
 

  was:
We observed a deadlock kind of a situation in our Kafka streams application 
when we accidentally shut down all the brokers. The Kafka cluster was brought 
back in about an hour. 

Observations made :
 # Normal Kafka producers and consumers started working fine after the brokers 
were up again. 
 # The Kafka streams applications were stuck in the "rebalancing" state.
 # The Kafka streams apps have exactly-once semantics enabled.
 # The stack trace showed most of the stream threads 

[jira] [Updated] (KAFKA-8673) Kafka stream threads stuck while sending offsets to transaction preventing join group from completing

2019-07-16 Thread Varsha Abhinandan (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varsha Abhinandan updated KAFKA-8673:
-
Description: 
We observed a deadlock kind of a situation in our Kafka streams application 
when we accidentally shut down all the brokers. The Kafka cluster was brought 
back in about an hour. 

Observations made :
 # Normal Kafka producers and consumers started working fine after the brokers 
were up again. 
 # The Kafka streams applications were stuck in the "rebalancing" state.
 # The Kafka streams apps have exactly-once semantics enabled.
 # The stack trace showed most of the stream threads sending the join group 
requests to the group co-ordinator
 # Few stream threads couldn't initiate the join group request since the call 
to 
[org.apache.kafka.clients.producer.KafkaProducer#sendOffsetsToTransaction|https://jira.corp.appdynamics.com/browse/ANLYTCS_ES-2062#sendOffsetsToTransaction%20which%20was%20hung]
 was stuck.
 # Seems like the join group requests were getting parked at the coordinator 
since the expected members hadn't sent their own group join requests
 # And after the timeout, the stream threads that were not stuck sent a new 
join group requests.  
 # Maybe (6) and (7) is happening infinitely
 # Sample values of the GroupMetadata object on the group co-ordinator -  
!Screen Shot 2019-07-11 at 12.08.09 PM.png!
 # The list of notYetJoinedMembers client id's matched with the threads waiting 
for their offsets to be committed. 
{code:java}
[List(MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer-efa41349-3da1-43b6-9710-a662f68c63b1,
 
clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer,
 clientHost=/10.136.98.48, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer-7cc8e41b-ad98-4006-a18a-b22abe6350f4,
 
clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer,
 clientHost=/10.136.103.148, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer-9ffb96c1-3379-4cbd-bee1-5d4719fe6c9d,
 
clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer,
 clientHost=/10.136.98.48, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer-5b8a1f1f-84dd-4a87-86c8-7542c0e50d1f,
 
clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer,
 clientHost=/10.136.103.148, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer-3cb67ec9-c548-4386-962d-64d9772bf719,
 
clientId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer,
 clientHost=/10.136.99.15, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ))]

vabhinandan-mac:mp-jstack varsha.abhinandan$ cat jstack.* | grep 
"metric-extractor-stream-c1-" | grep "StreamThread-" | grep "waiting on 
condition"
"metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36"
 #128 daemon prio=5 os_prio=0 tid=0x7fc53c047800 nid=0xac waiting on 
condition [0x7fc4e68e7000]
"metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21"
 #93 daemon prio=5 os_prio=0 tid=0x7fc53c2b5800 nid=0x9d waiting on 
condition [0x7fc4e77f6000]
"metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33"
 #125 daemon prio=5 os_prio=0 tid=0x7fe18017c800 nid=0xbc waiting on 
condition [0x7fe12e7e8000]
"metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38"
 #154 daemon prio=5 os_prio=0 tid=0x7f27c4225800 nid=0xc4 waiting on 
condition [0x7f2772bec000]
"metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27"
 #143 daemon prio=5 os_prio=0 tid=0x7f27c4365800 nid=0xb9 waiting on 
condition [0x7f27736f7000]
{code}
 

  was:
We observed a deadlock kind of a situation in our Kafka streams application 
when we accidentally shut down all the brokers. The Kafka cluster was brought 
back in about an hour. 

Observations made :
 # Normal Kafka producers and consumers started working fine after the brokers 
were up again. 
 # The Kafka streams applications were stuck in the "rebalancing" state.
 # The Kafka streams apps have exactly-once semantics enabled.
 # The stack 

[jira] [Updated] (KAFKA-8673) Kafka stream threads stuck while sending offsets to transaction preventing join group from completing

2019-07-16 Thread Varsha Abhinandan (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varsha Abhinandan updated KAFKA-8673:
-
Attachment: Screen Shot 2019-07-11 at 12.08.09 PM.png

> Kafka stream threads stuck while sending offsets to transaction preventing 
> join group from completing
> -
>
> Key: KAFKA-8673
> URL: https://issues.apache.org/jira/browse/KAFKA-8673
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer, streams
>Affects Versions: 2.2.0
>Reporter: Varsha Abhinandan
>Priority: Major
> Attachments: Screen Shot 2019-07-11 at 12.08.09 PM.png
>
>
> We observed a deadlock kind of a situation in our Kafka streams application 
> when we accidentally shut down all the brokers. The Kafka cluster was brought 
> back in about an hour. 
> Observations made :
>  # Normal Kafka producers and consumers started working fine after the 
> brokers were up again. 
>  # The Kafka streams applications were stuck in the "rebalancing" state.
>  # The Kafka streams apps have exactly-once semantics enabled.
>  # The stack trace showed most of the stream threads sending the join group 
> requests to the group co-ordinator
>  # Few stream threads couldn't initiate the join group request since the call 
> to 
> [org.apache.kafka.clients.producer.KafkaProducer#sendOffsetsToTransaction|https://jira.corp.appdynamics.com/browse/ANLYTCS_ES-2062#sendOffsetsToTransaction%20which%20was%20hung]
>  was stuck.
>  # Seems like the join group requests were getting parked at the coordinator 
> since the expected members hadn't sent their own group join requests
>  # And after the timeout, the stream threads that were not stuck sent a new 
> join group requests.  
>  # Maybe (6) and (7) is happening infinitely
>  # Sample values of the GroupMetadata object on the group co-ordinator  
> !Screen Shot 2019-07-11 at 12.08.09 PM.png!
>  # The list of notYetJoinedMembers client id's matched with the threads 
> waiting for their offsets to be committed. 
> {code:java}
> [List(MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer-efa41349-3da1-43b6-9710-a662f68c63b1,
>  
> clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer,
>  clientHost=/10.136.98.48, sessionTimeoutMs=15000, 
> rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
> MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer-7cc8e41b-ad98-4006-a18a-b22abe6350f4,
>  
> clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer,
>  clientHost=/10.136.103.148, sessionTimeoutMs=15000, 
> rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
> MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer-9ffb96c1-3379-4cbd-bee1-5d4719fe6c9d,
>  
> clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer,
>  clientHost=/10.136.98.48, sessionTimeoutMs=15000, 
> rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
> MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer-5b8a1f1f-84dd-4a87-86c8-7542c0e50d1f,
>  
> clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer,
>  clientHost=/10.136.103.148, sessionTimeoutMs=15000, 
> rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
> MemberMetadata(memberId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer-3cb67ec9-c548-4386-962d-64d9772bf719,
>  
> clientId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer,
>  clientHost=/10.136.99.15, sessionTimeoutMs=15000, 
> rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ))]
> vabhinandan-mac:mp-jstack varsha.abhinandan$ cat jstack.* | grep 
> "metric-extractor-stream-c1-" | grep "StreamThread-" | grep "waiting on 
> condition"
> "metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36"
>  #128 daemon prio=5 os_prio=0 tid=0x7fc53c047800 nid=0xac waiting on 
> condition [0x7fc4e68e7000]
> "metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21"
>  #93 daemon prio=5 os_prio=0 tid=0x7fc53c2b5800 nid=0x9d waiting on 
> condition [0x7fc4e77f6000]
> "metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33"
>  #125 daemon prio=5 os_prio=0 tid=0x7fe18017c800 nid=0xbc waiting on 
> condition [0x7fe12e7e8000]
> "metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38"
>  #154 

[jira] [Updated] (KAFKA-8673) Kafka stream threads stuck while sending offsets to transaction preventing join group from completing

2019-07-16 Thread Varsha Abhinandan (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varsha Abhinandan updated KAFKA-8673:
-
Description: 
We observed a deadlock kind of a situation in our Kafka streams application 
when we accidentally shut down all the brokers. The Kafka cluster was brought 
back in about an hour. 

Observations made :
 # Normal Kafka producers and consumers started working fine after the brokers 
were up again. 
 # The Kafka streams applications were stuck in the "rebalancing" state.
 # The Kafka streams apps have exactly-once semantics enabled.
 # The stack trace showed most of the stream threads sending the join group 
requests to the group co-ordinator
 # Few stream threads couldn't initiate the join group request since the call 
to 
[org.apache.kafka.clients.producer.KafkaProducer#sendOffsetsToTransaction|https://jira.corp.appdynamics.com/browse/ANLYTCS_ES-2062#sendOffsetsToTransaction%20which%20was%20hung]
 was stuck.
 # Seems like the join group requests were getting parked at the coordinator 
since the expected members hadn't sent their own group join requests
 # And after the timeout, the stream threads that were not stuck sent a new 
join group requests.  
 # Maybe (6) and (7) is happening infinitely
 # Sample values of the GroupMetadata object on the group co-ordinator  !Screen 
Shot 2019-07-11 at 12.08.09 PM.png!
 # The list of notYetJoinedMembers client id's matched with the threads waiting 
for their offsets to be committed. 
{code:java}
[List(MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer-efa41349-3da1-43b6-9710-a662f68c63b1,
 
clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer,
 clientHost=/10.136.98.48, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer-7cc8e41b-ad98-4006-a18a-b22abe6350f4,
 
clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer,
 clientHost=/10.136.103.148, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer-9ffb96c1-3379-4cbd-bee1-5d4719fe6c9d,
 
clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer,
 clientHost=/10.136.98.48, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer-5b8a1f1f-84dd-4a87-86c8-7542c0e50d1f,
 
clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer,
 clientHost=/10.136.103.148, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer-3cb67ec9-c548-4386-962d-64d9772bf719,
 
clientId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer,
 clientHost=/10.136.99.15, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ))]

vabhinandan-mac:mp-jstack varsha.abhinandan$ cat jstack.* | grep 
"metric-extractor-stream-c1-" | grep "StreamThread-" | grep "waiting on 
condition"
"metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36"
 #128 daemon prio=5 os_prio=0 tid=0x7fc53c047800 nid=0xac waiting on 
condition [0x7fc4e68e7000]
"metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21"
 #93 daemon prio=5 os_prio=0 tid=0x7fc53c2b5800 nid=0x9d waiting on 
condition [0x7fc4e77f6000]
"metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33"
 #125 daemon prio=5 os_prio=0 tid=0x7fe18017c800 nid=0xbc waiting on 
condition [0x7fe12e7e8000]
"metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38"
 #154 daemon prio=5 os_prio=0 tid=0x7f27c4225800 nid=0xc4 waiting on 
condition [0x7f2772bec000]
"metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27"
 #143 daemon prio=5 os_prio=0 tid=0x7f27c4365800 nid=0xb9 waiting on 
condition [0x7f27736f7000]
{code}
 

  was:
We observed a deadlock kind of a situation in our Kafka streams application 
when we accidentally shut down all the brokers. The Kafka cluster was brought 
back in about an hour. 

Observations made :
 # Normal Kafka producers and consumers started working fine after the brokers 
were up again. 
 # The Kafka streams applications were stuck in the "rebalancing" state.
 # The Kafka streams apps have exactly-once semantics enabled.
 # The stack 

[jira] [Updated] (KAFKA-8673) Kafka stream threads stuck while sending offsets to transaction preventing join group from completing

2019-07-16 Thread Varsha Abhinandan (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varsha Abhinandan updated KAFKA-8673:
-
Description: 
We observed a deadlock kind of a situation in our Kafka streams application 
when we accidentally shut down all the brokers. The Kafka cluster was brought 
back in about an hour. 

Observations made :
 # Normal Kafka producers and consumers started working fine after the brokers 
were up again. 
 # The Kafka streams applications were stuck in the "rebalancing" state.
 # The Kafka streams apps have exactly-once semantics enabled.
 # The stack trace showed most of the stream threads sending the join group 
requests to the group co-ordinator
 # Few stream threads couldn't initiate the join group request since the call 
to 
[org.apache.kafka.clients.producer.KafkaProducer#sendOffsetsToTransaction|https://jira.corp.appdynamics.com/browse/ANLYTCS_ES-2062#sendOffsetsToTransaction%20which%20was%20hung]
 was stuck.
 # Seems like the join group requests were getting parked at the coordinator 
since the expected members hadn't sent their own group join requests
 # And after the timeout, the stream threads that were not stuck sent a new 
join group requests.  
 # Maybe (6) and (7) is happening infinitely
 # Sample values of the GroupMetadata object on the group co-ordinator  !Screen 
Shot 2019-07-11 at 12.08.09 PM.png|width=795,height=132!
 # The list of notYetJoinedMembers client id's matched with the threads waiting 
for their offsets to be committed. 
{code:java}
[List(MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer-efa41349-3da1-43b6-9710-a662f68c63b1,
 
clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38-consumer,
 clientHost=/10.136.98.48, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer-7cc8e41b-ad98-4006-a18a-b22abe6350f4,
 
clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36-consumer,
 clientHost=/10.136.103.148, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer-9ffb96c1-3379-4cbd-bee1-5d4719fe6c9d,
 
clientId=metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27-consumer,
 clientHost=/10.136.98.48, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer-5b8a1f1f-84dd-4a87-86c8-7542c0e50d1f,
 
clientId=metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21-consumer,
 clientHost=/10.136.103.148, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ), 
MemberMetadata(memberId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer-3cb67ec9-c548-4386-962d-64d9772bf719,
 
clientId=metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33-consumer,
 clientHost=/10.136.99.15, sessionTimeoutMs=15000, 
rebalanceTimeoutMs=2147483647, supportedProtocols=List(stream), ))]

vabhinandan-mac:mp-jstack varsha.abhinandan$ cat jstack.* | grep 
"metric-extractor-stream-c1-" | grep "StreamThread-" | grep "waiting on 
condition"
"metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-36"
 #128 daemon prio=5 os_prio=0 tid=0x7fc53c047800 nid=0xac waiting on 
condition [0x7fc4e68e7000]
"metric-extractor-stream-c1-4875282b-1f26-47cd-affd-23ba5f26787a-StreamThread-21"
 #93 daemon prio=5 os_prio=0 tid=0x7fc53c2b5800 nid=0x9d waiting on 
condition [0x7fc4e77f6000]
"metric-extractor-stream-c1-994cee9b-b79b-483b-97cd-f89e8cbb015a-StreamThread-33"
 #125 daemon prio=5 os_prio=0 tid=0x7fe18017c800 nid=0xbc waiting on 
condition [0x7fe12e7e8000]
"metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-38"
 #154 daemon prio=5 os_prio=0 tid=0x7f27c4225800 nid=0xc4 waiting on 
condition [0x7f2772bec000]
"metric-extractor-stream-c1-d9ac8890-cd80-4b75-a85a-2ff39ea27961-StreamThread-27"
 #143 daemon prio=5 os_prio=0 tid=0x7f27c4365800 nid=0xb9 waiting on 
condition [0x7f27736f7000]
{code}
 

  was:
We observed a deadlock kind of a situation in our Kafka streams application 
when we accidentally shut down all the brokers. The Kafka cluster was brought 
back in about an hour. 

Observations made :
 # Normal Kafka producers and consumers started working fine after the brokers 
were up again. 
 # The Kafka streams applications were stuck in the "rebalancing" state.
 # The Kafka streams apps have exactly-once semantics 

[jira] [Updated] (KAFKA-8673) Kafka stream threads stuck while sending offsets to transaction preventing join group from completing

2019-07-16 Thread Varsha Abhinandan (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varsha Abhinandan updated KAFKA-8673:
-
Description: 
We observed a deadlock kind of a situation in our Kafka streams application 
when we accidentally shut down all the brokers. The Kafka cluster was brought 
back in about an hour. 

Observations made :
 # Normal Kafka producers and consumers started working fine after the brokers 
were up again. 
 # The Kafka streams applications were stuck in the "rebalancing" state.
 # The Kafka streams apps have exactly-once semantics enabled.
 # The stack trace showed most of the stream threads sending the join group 
requests to the group co-ordinator
 # Few stream threads couldn't initiate the join group request since the call 
to 
[org.apache.kafka.clients.producer.KafkaProducer#sendOffsetsToTransaction|https://jira.corp.appdynamics.com/browse/ANLYTCS_ES-2062#sendOffsetsToTransaction%20which%20was%20hung]
 was stuck.
 # Seems like the join group requests were getting parked at the coordinator 
since the expected members hadn't sent their own group join requests
 # And after the timeout, the stream threads that were not stuck sent a new 
join group requests.  
 # Maybe (6) and (7) is happening infinitely
 # Sample values of the GroupMetadata object on the group co-ordinator  !Screen 
Shot 2019-07-11 at 12.08.09 PM.png|width=319,height=53!

> Kafka stream threads stuck while sending offsets to transaction preventing 
> join group from completing
> -
>
> Key: KAFKA-8673
> URL: https://issues.apache.org/jira/browse/KAFKA-8673
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer, streams
>Affects Versions: 2.2.0
>Reporter: Varsha Abhinandan
>Priority: Major
>
> We observed a deadlock kind of a situation in our Kafka streams application 
> when we accidentally shut down all the brokers. The Kafka cluster was brought 
> back in about an hour. 
> Observations made :
>  # Normal Kafka producers and consumers started working fine after the 
> brokers were up again. 
>  # The Kafka streams applications were stuck in the "rebalancing" state.
>  # The Kafka streams apps have exactly-once semantics enabled.
>  # The stack trace showed most of the stream threads sending the join group 
> requests to the group co-ordinator
>  # Few stream threads couldn't initiate the join group request since the call 
> to 
> [org.apache.kafka.clients.producer.KafkaProducer#sendOffsetsToTransaction|https://jira.corp.appdynamics.com/browse/ANLYTCS_ES-2062#sendOffsetsToTransaction%20which%20was%20hung]
>  was stuck.
>  # Seems like the join group requests were getting parked at the coordinator 
> since the expected members hadn't sent their own group join requests
>  # And after the timeout, the stream threads that were not stuck sent a new 
> join group requests.  
>  # Maybe (6) and (7) is happening infinitely
>  # Sample values of the GroupMetadata object on the group co-ordinator  
> !Screen Shot 2019-07-11 at 12.08.09 PM.png|width=319,height=53!



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (KAFKA-8673) Kafka stream threads stuck while sending offsets to transaction preventing join group from completing

2019-07-16 Thread Varsha Abhinandan (JIRA)
Varsha Abhinandan created KAFKA-8673:


 Summary: Kafka stream threads stuck while sending offsets to 
transaction preventing join group from completing
 Key: KAFKA-8673
 URL: https://issues.apache.org/jira/browse/KAFKA-8673
 Project: Kafka
  Issue Type: Bug
  Components: consumer, streams
Affects Versions: 2.2.0
Reporter: Varsha Abhinandan






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)