[ 
https://issues.apache.org/jira/browse/FLINK-32230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antonio Vespoli updated FLINK-32230:
------------------------------------
    Description: 
Connector calls to AWS Kinesis Data Streams can hang indefinitely without 
making any progress.

We suspect the root cause to be related to the SDK handling of exceptions, 
similarly to what observed in FLINK-31675.

We identified this deadlock on applications running on AWS Kinesis Data 
Analytics using the AWS Kinesis Data Streams AsyncSink (with AWS SDK version 
2.20.32 as per FLINK-31675). The deadlock scenario is still the same as 
described in FLINK-31675. However, the Netty content-length exception does not 
appear when using the updated SDK version.

This issue only occurs for applications and streams in the AWS regions 
_ap-northeast-3_ and {_}us-gov-east-1{_}. We did not observe this issue in any 
other AWS region.

The issue happens sporadically and unpredictably. As per its nature, we do not 
have instructions for reproducing it.

Example of flame-graphs observed when the issue occurs:
{code:java}
root
java.lang.Thread.run:829
org.apache.flink.runtime.taskmanager.Task.run:568
org.apache.flink.runtime.taskmanager.Task.doRun:746
org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke:932
org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring:953
org.apache.flink.runtime.taskmanager.Task$$Lambda$1253/0x0000000800ecbc40.run:-1
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke:753
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop:804
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop:203
org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$907/0x0000000800bf7840.runDefaultAction:-1
org.apache.flink.streaming.runtime.tasks.StreamTask.processInput:519
org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput:65
org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext:110
org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.pollNext:159
org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.handleEvent:181
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.processBarrier:231
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.markCheckpointAlignedAndTransformState:262
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler$$Lambda$1586/0x00000008012c5c40.apply:-1
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.lambda$processBarrier$2:234
org.apache.flink.streaming.runtime.io.checkpointing.AbstractAlignedBarrierHandlerState.barrierReceived:66
org.apache.flink.streaming.runtime.io.checkpointing.AbstractAlignedBarrierHandlerState.triggerGlobalCheckpoint:74
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler$ControllerImpl.triggerGlobalCheckpoint:493
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.access$100:64
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.triggerCheckpoint:287
org.apache.flink.streaming.runtime.io.checkpointing.CheckpointBarrierHandler.notifyCheckpoint:147
org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier:1198
org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint:1241
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing:50
org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$1453/0x000000080128e840.run:-1
org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$12:1253
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointState:300
org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.prepareSnapshotPreBarrier:89
org.apache.flink.streaming.runtime.operators.sink.SinkWriterOperator.prepareSnapshotPreBarrier:165
org.apache.flink.connector.base.sink.writer.AsyncSinkWriter.flush:494
org.apache.flink.connector.base.sink.writer.AsyncSinkWriter.yieldIfThereExistsInFlightRequests:503
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxExecutorImpl.yield:84
org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.take:149
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await:2211
java.util.concurrent.locks.LockSupport.parkNanos:234
jdk.internal.misc.Unsafe.park:-2 {code}
 

  was:
Connector calls to AWS Kinesis Data Streams can hang indefinitely without 
making any progress.

We suspect the root cause to be related to the SDK handling of exceptions, 
similarly to what observed in FLINK-31675.

We identified this deadlock on applications running on AWS Kinesis Data 
Analytics using the AWS Kinesis Data Streams connectors (with AWS SDK version 
2.20.32 as per FLINK-31675). The deadlock scenario is still the same as 
described in FLINK-31675. However, the Netty content-length exception does not 
appear when using the updated SDK version.

This issue only occurs for applications and streams in the AWS regions 
_ap-northeast-3_ and {_}us-gov-east-1{_}. We did not observe this issue in any 
other AWS region.

The issue happens sporadically and unpredictably. As per its nature, we do not 
have instructions for reproducing it.

Example of flame-graphs observed when the issue occurs:


{code:java}
root
java.lang.Thread.run:829
org.apache.flink.runtime.taskmanager.Task.run:568
org.apache.flink.runtime.taskmanager.Task.doRun:746
org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke:932
org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring:953
org.apache.flink.runtime.taskmanager.Task$$Lambda$1253/0x0000000800ecbc40.run:-1
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke:753
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop:804
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop:203
org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$907/0x0000000800bf7840.runDefaultAction:-1
org.apache.flink.streaming.runtime.tasks.StreamTask.processInput:519
org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput:65
org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext:110
org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.pollNext:159
org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.handleEvent:181
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.processBarrier:231
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.markCheckpointAlignedAndTransformState:262
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler$$Lambda$1586/0x00000008012c5c40.apply:-1
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.lambda$processBarrier$2:234
org.apache.flink.streaming.runtime.io.checkpointing.AbstractAlignedBarrierHandlerState.barrierReceived:66
org.apache.flink.streaming.runtime.io.checkpointing.AbstractAlignedBarrierHandlerState.triggerGlobalCheckpoint:74
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler$ControllerImpl.triggerGlobalCheckpoint:493
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.access$100:64
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.triggerCheckpoint:287
org.apache.flink.streaming.runtime.io.checkpointing.CheckpointBarrierHandler.notifyCheckpoint:147
org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier:1198
org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint:1241
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing:50
org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$1453/0x000000080128e840.run:-1
org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$12:1253
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointState:300
org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.prepareSnapshotPreBarrier:89
org.apache.flink.streaming.runtime.operators.sink.SinkWriterOperator.prepareSnapshotPreBarrier:165
org.apache.flink.connector.base.sink.writer.AsyncSinkWriter.flush:494
org.apache.flink.connector.base.sink.writer.AsyncSinkWriter.yieldIfThereExistsInFlightRequests:503
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxExecutorImpl.yield:84
org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.take:149
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await:2211
java.util.concurrent.locks.LockSupport.parkNanos:234
jdk.internal.misc.Unsafe.park:-2 {code}
 


> Deadlock in AWS Kinesis Data Streams AsyncSink connector
> --------------------------------------------------------
>
>                 Key: FLINK-32230
>                 URL: https://issues.apache.org/jira/browse/FLINK-32230
>             Project: Flink
>          Issue Type: Bug
>          Components: Connectors / AWS
>    Affects Versions: 1.15.4, 1.16.2, 1.17.1
>            Reporter: Antonio Vespoli
>            Priority: Major
>             Fix For: aws-connector-3.1.0, aws-connector-4.2.0
>
>
> Connector calls to AWS Kinesis Data Streams can hang indefinitely without 
> making any progress.
> We suspect the root cause to be related to the SDK handling of exceptions, 
> similarly to what observed in FLINK-31675.
> We identified this deadlock on applications running on AWS Kinesis Data 
> Analytics using the AWS Kinesis Data Streams AsyncSink (with AWS SDK version 
> 2.20.32 as per FLINK-31675). The deadlock scenario is still the same as 
> described in FLINK-31675. However, the Netty content-length exception does 
> not appear when using the updated SDK version.
> This issue only occurs for applications and streams in the AWS regions 
> _ap-northeast-3_ and {_}us-gov-east-1{_}. We did not observe this issue in 
> any other AWS region.
> The issue happens sporadically and unpredictably. As per its nature, we do 
> not have instructions for reproducing it.
> Example of flame-graphs observed when the issue occurs:
> {code:java}
> root
> java.lang.Thread.run:829
> org.apache.flink.runtime.taskmanager.Task.run:568
> org.apache.flink.runtime.taskmanager.Task.doRun:746
> org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke:932
> org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring:953
> org.apache.flink.runtime.taskmanager.Task$$Lambda$1253/0x0000000800ecbc40.run:-1
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke:753
> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop:804
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop:203
> org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$907/0x0000000800bf7840.runDefaultAction:-1
> org.apache.flink.streaming.runtime.tasks.StreamTask.processInput:519
> org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput:65
> org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext:110
> org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.pollNext:159
> org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.handleEvent:181
> org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.processBarrier:231
> org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.markCheckpointAlignedAndTransformState:262
> org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler$$Lambda$1586/0x00000008012c5c40.apply:-1
> org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.lambda$processBarrier$2:234
> org.apache.flink.streaming.runtime.io.checkpointing.AbstractAlignedBarrierHandlerState.barrierReceived:66
> org.apache.flink.streaming.runtime.io.checkpointing.AbstractAlignedBarrierHandlerState.triggerGlobalCheckpoint:74
> org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler$ControllerImpl.triggerGlobalCheckpoint:493
> org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.access$100:64
> org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.triggerCheckpoint:287
> org.apache.flink.streaming.runtime.io.checkpointing.CheckpointBarrierHandler.notifyCheckpoint:147
> org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier:1198
> org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint:1241
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing:50
> org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$1453/0x000000080128e840.run:-1
> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$12:1253
> org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointState:300
> org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.prepareSnapshotPreBarrier:89
> org.apache.flink.streaming.runtime.operators.sink.SinkWriterOperator.prepareSnapshotPreBarrier:165
> org.apache.flink.connector.base.sink.writer.AsyncSinkWriter.flush:494
> org.apache.flink.connector.base.sink.writer.AsyncSinkWriter.yieldIfThereExistsInFlightRequests:503
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxExecutorImpl.yield:84
> org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.take:149
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await:2211
> java.util.concurrent.locks.LockSupport.parkNanos:234
> jdk.internal.misc.Unsafe.park:-2 {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to