Chenyuan Lee created FLINK-39660:
------------------------------------

             Summary: Kinesis Connector - Cascading failure in EFO subscription 
lifecycle
                 Key: FLINK-39660
                 URL: https://issues.apache.org/jira/browse/FLINK-39660
             Project: Flink
          Issue Type: Bug
          Components: Connectors / Kinesis
    Affects Versions: aws-connector-6.0.0, aws-connector-5.0.0
            Reporter: Chenyuan Lee


### Overview

Under certain conditions, the Kinesis EFO subscription lifecycle in 
`FanOutKinesisShardSubscription` enters a cascading failure: an initial 
disruption in one subscription propagates into repeated subscription failures 
across many shards, causing the connector to stall.

### Observed Symptoms

- `ClosedChannelException` bursts affecting many shards simultaneously.
- `AcquireTimeoutException: Acquire operation took longer than 60000 
milliseconds` errors across many shards.
- Same shard retried from the same `startingMarker` for hundreds of iterations 
without forward progress.
- `Http2PingHandler` logs `PING timer scheduled after N ms` warnings clustered 
across many channels under load, indicating Netty event loop blocking during 
record processing.
- HTTP/2 connection exhaustion: SDK pool fills with pending `subscribeToShard` 
calls that never complete, each consuming a slot until individual 60-second 
acquire timeouts trigger.

### Logs

Example stack trace from an EFO subscription failure under cascading failure 
conditions:

```
Error onError subscribing to shard shardId-000000000144 with starting position 
...

java.io.IOException: An error occurred on the connection: 
java.nio.channels.ClosedChannelException, [channel: 76d2a9d0]. All streams will 
be closed
    at 
software.amazon.awssdk.http.nio.netty.internal.http2.MultiplexedChannelRecord.decorateConnectionException(MultiplexedChannelRecord.java:213)
    at 
software.amazon.awssdk.http.nio.netty.internal.http2.MultiplexedChannelRecord.lambda$closeChildChannels$10(MultiplexedChannelRecord.java:205)
    at 
software.amazon.awssdk.http.nio.netty.internal.http2.MultiplexedChannelRecord.lambda$closeAndExecuteOnChildChannels$11(MultiplexedChannelRecord.java:229)
    at 
software.amazon.awssdk.http.nio.netty.internal.utils.NettyUtils.doInEventLoop(NettyUtils.java:259)
```

### Impact

Multiple shards stop making progress; consumption lag grows and the condition 
can persist without manual intervention.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to