[
https://issues.apache.org/jira/browse/FLINK-39660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated FLINK-39660:
-----------------------------------
Labels: pull-request-available (was: )
> Kinesis Connector - Cascading failure in EFO subscription lifecycle
> -------------------------------------------------------------------
>
> Key: FLINK-39660
> URL: https://issues.apache.org/jira/browse/FLINK-39660
> Project: Flink
> Issue Type: Bug
> Components: Connectors / Kinesis
> Affects Versions: aws-connector-5.0.0, aws-connector-6.0.0
> Reporter: Chenyuan Lee
> Priority: Major
> Labels: pull-request-available
>
> h2. Overview
> Under certain conditions, the Kinesis EFO subscription lifecycle in
> {{FanOutKinesisShardSubscription}} enters a cascading failure: an initial
> disruption in one subscription propagates into repeated subscription failures
> across many shards, causing the connector to stall.
> h2. Observed Symptoms
> * {{ClosedChannelException}} bursts affecting many shards simultaneously.
> * {{AcquireTimeoutException: Acquire operation took longer than 60000
> milliseconds}} errors across many shards.
> * Same shard retried from the same {{startingMarker}} for hundreds of
> iterations without forward progress.
> * {{Http2PingHandler}} logs {{PING timer scheduled after N ms}} warnings
> clustered across many channels under load, indicating Netty event loop
> blocking during record processing.
> * HTTP/2 connection exhaustion: SDK pool fills with pending
> {{subscribeToShard}} calls that never complete, each consuming a slot until
> individual 60-second acquire timeouts trigger.
> h2. Logs
> Example stack trace from an EFO subscription failure under cascading failure
> conditions:
> {noformat}
> Error onError subscribing to shard shardId-000000000144 with starting
> position ...
> java.io.IOException: An error occurred on the connection:
> java.nio.channels.ClosedChannelException, [channel: 76d2a9d0]. All streams
> will be closed
> at
> software.amazon.awssdk.http.nio.netty.internal.http2.MultiplexedChannelRecord.decorateConnectionException(MultiplexedChannelRecord.java:213)
> at
> software.amazon.awssdk.http.nio.netty.internal.http2.MultiplexedChannelRecord.lambda$closeChildChannels$10(MultiplexedChannelRecord.java:205)
> at
> software.amazon.awssdk.http.nio.netty.internal.http2.MultiplexedChannelRecord.lambda$closeAndExecuteOnChildChannels$11(MultiplexedChannelRecord.java:229)
> at
> software.amazon.awssdk.http.nio.netty.internal.utils.NettyUtils.doInEventLoop(NettyUtils.java:259)
> {noformat}
> h2. Impact
> Multiple shards stop making progress; consumption lag grows and the condition
> can persist without manual intervention.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)