[jira] [Updated] (FLINK-39660) Kinesis Connector - Cascading failure in EFO subscription lifecycle

Chenyuan Lee (Jira) Mon, 11 May 2026 18:20:41 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-39660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Chenyuan Lee updated FLINK-39660:
---------------------------------
    Description: 

```markdown
## Overview

Under certain conditions, the Kinesis EFO subscription lifecycle in 
`FanOutKinesisShardSubscription` enters a cascading failure: an initial 
disruption in one subscription propagates into repeated subscription failures 
across many shards, causing the connector to stall.

## Observed Symptoms

- `ClosedChannelException` bursts affecting many shards simultaneously.
- `AcquireTimeoutException: Acquire operation took longer than 60000 
milliseconds` errors across many shards.
- Same shard retried from the same `startingMarker` for hundreds of iterations 
without forward progress.
- `Http2PingHandler` logs `PING timer scheduled after N ms` warnings clustered 
across many channels under load, indicating Netty event loop blocking during 
record processing.
- HTTP/2 connection exhaustion: SDK pool fills with pending `subscribeToShard` 
calls that never complete, each consuming a slot until individual 60-second 
acquire timeouts trigger.

## Logs

Example stack trace from an EFO subscription failure under cascading failure 
conditions:

```
Error onError subscribing to shard shardId-000000000144 with starting position 
...

java.io.IOException: An error occurred on the connection: 
java.nio.channels.ClosedChannelException, [channel: 76d2a9d0]. All streams will 
be closed
    at 
software.amazon.awssdk.http.nio.netty.internal.http2.MultiplexedChannelRecord.decorateConnectionException(MultiplexedChannelRecord.java:213)
    at 
software.amazon.awssdk.http.nio.netty.internal.http2.MultiplexedChannelRecord.lambda$closeChildChannels$10(MultiplexedChannelRecord.java:205)
    at 
software.amazon.awssdk.http.nio.netty.internal.http2.MultiplexedChannelRecord.lambda$closeAndExecuteOnChildChannels$11(MultiplexedChannelRecord.java:229)
    at 
software.amazon.awssdk.http.nio.netty.internal.utils.NettyUtils.doInEventLoop(NettyUtils.java:259)
```

## Impact

Multiple shards stop making progress; consumption lag grows and the condition 
can persist without manual intervention.
```

### JIRA wiki markup (for Apache JIRA description field)

```
h2. Overview

Under certain conditions, the Kinesis EFO subscription lifecycle in 
\{{FanOutKinesisShardSubscription}} enters a cascading failure: an initial 
disruption in one subscription propagates into repeated subscription failures 
across many shards, causing the connector to stall.

h2. Observed Symptoms

* \{{ClosedChannelException}} bursts affecting many shards simultaneously.
* \{{AcquireTimeoutException: Acquire operation took longer than 60000 
milliseconds}} errors across many shards.
* Same shard retried from the same \{{startingMarker}} for hundreds of 
iterations without forward progress.
* \{{Http2PingHandler}} logs \{{PING timer scheduled after N ms}} warnings 
clustered across many channels under load, indicating Netty event loop blocking 
during record processing.
* HTTP/2 connection exhaustion: SDK pool fills with pending 
\{{subscribeToShard}} calls that never complete, each consuming a slot until 
individual 60-second acquire timeouts trigger.

h2. Logs

Example stack trace from an EFO subscription failure under cascading failure 
conditions:

{noformat}
Error onError subscribing to shard shardId-000000000144 with starting position 
...

java.io.IOException: An error occurred on the connection: 
java.nio.channels.ClosedChannelException, [channel: 76d2a9d0]. All streams will 
be closed
    at 
software.amazon.awssdk.http.nio.netty.internal.http2.MultiplexedChannelRecord.decorateConnectionException(MultiplexedChannelRecord.java:213)
    at 
software.amazon.awssdk.http.nio.netty.internal.http2.MultiplexedChannelRecord.lambda$closeChildChannels$10(MultiplexedChannelRecord.java:205)
    at 
software.amazon.awssdk.http.nio.netty.internal.http2.MultiplexedChannelRecord.lambda$closeAndExecuteOnChildChannels$11(MultiplexedChannelRecord.java:229)
    at 
software.amazon.awssdk.http.nio.netty.internal.utils.NettyUtils.doInEventLoop(NettyUtils.java:259)
{noformat}

h2. Impact

Multiple shards stop making progress; consumption lag grows and the condition 
can persist without manual intervention.
```

 

  was:
Here you go. I'll give you both markdown (for GitHub PR / Slack) and JIRA wiki 
markup (for the Apache JIRA description).

### Markdown

```markdown
## Overview

Under certain conditions, the Kinesis EFO subscription lifecycle in 
`FanOutKinesisShardSubscription` enters a cascading failure: an initial 
disruption in one subscription propagates into repeated subscription failures 
across many shards, causing the connector to stall.

## Observed Symptoms

- `ClosedChannelException` bursts affecting many shards simultaneously.
- `AcquireTimeoutException: Acquire operation took longer than 60000 
milliseconds` errors across many shards.
- Same shard retried from the same `startingMarker` for hundreds of iterations 
without forward progress.
- `Http2PingHandler` logs `PING timer scheduled after N ms` warnings clustered 
across many channels under load, indicating Netty event loop blocking during 
record processing.
- HTTP/2 connection exhaustion: SDK pool fills with pending `subscribeToShard` 
calls that never complete, each consuming a slot until individual 60-second 
acquire timeouts trigger.

## Logs

Example stack trace from an EFO subscription failure under cascading failure 
conditions:

```
Error onError subscribing to shard shardId-000000000144 with starting position 
...

java.io.IOException: An error occurred on the connection: 
java.nio.channels.ClosedChannelException, [channel: 76d2a9d0]. All streams will 
be closed
    at 
software.amazon.awssdk.http.nio.netty.internal.http2.MultiplexedChannelRecord.decorateConnectionException(MultiplexedChannelRecord.java:213)
    at 
software.amazon.awssdk.http.nio.netty.internal.http2.MultiplexedChannelRecord.lambda$closeChildChannels$10(MultiplexedChannelRecord.java:205)
    at 
software.amazon.awssdk.http.nio.netty.internal.http2.MultiplexedChannelRecord.lambda$closeAndExecuteOnChildChannels$11(MultiplexedChannelRecord.java:229)
    at 
software.amazon.awssdk.http.nio.netty.internal.utils.NettyUtils.doInEventLoop(NettyUtils.java:259)
```

## Impact

Multiple shards stop making progress; consumption lag grows and the condition 
can persist without manual intervention.
```

### JIRA wiki markup (for Apache JIRA description field)

```
h2. Overview

Under certain conditions, the Kinesis EFO subscription lifecycle in 
\{{FanOutKinesisShardSubscription}} enters a cascading failure: an initial 
disruption in one subscription propagates into repeated subscription failures 
across many shards, causing the connector to stall.

h2. Observed Symptoms

* \{{ClosedChannelException}} bursts affecting many shards simultaneously.
* \{{AcquireTimeoutException: Acquire operation took longer than 60000 
milliseconds}} errors across many shards.
* Same shard retried from the same \{{startingMarker}} for hundreds of 
iterations without forward progress.
* \{{Http2PingHandler}} logs \{{PING timer scheduled after N ms}} warnings 
clustered across many channels under load, indicating Netty event loop blocking 
during record processing.
* HTTP/2 connection exhaustion: SDK pool fills with pending 
\{{subscribeToShard}} calls that never complete, each consuming a slot until 
individual 60-second acquire timeouts trigger.

h2. Logs

Example stack trace from an EFO subscription failure under cascading failure 
conditions:

{noformat}
Error onError subscribing to shard shardId-000000000144 with starting position 
...

java.io.IOException: An error occurred on the connection: 
java.nio.channels.ClosedChannelException, [channel: 76d2a9d0]. All streams will 
be closed
    at 
software.amazon.awssdk.http.nio.netty.internal.http2.MultiplexedChannelRecord.decorateConnectionException(MultiplexedChannelRecord.java:213)
    at 
software.amazon.awssdk.http.nio.netty.internal.http2.MultiplexedChannelRecord.lambda$closeChildChannels$10(MultiplexedChannelRecord.java:205)
    at 
software.amazon.awssdk.http.nio.netty.internal.http2.MultiplexedChannelRecord.lambda$closeAndExecuteOnChildChannels$11(MultiplexedChannelRecord.java:229)
    at 
software.amazon.awssdk.http.nio.netty.internal.utils.NettyUtils.doInEventLoop(NettyUtils.java:259)
{noformat}

h2. Impact

Multiple shards stop making progress; consumption lag grows and the condition 
can persist without manual intervention.
```

 


> Kinesis Connector - Cascading failure in EFO subscription lifecycle
> -------------------------------------------------------------------
>
>                 Key: FLINK-39660
>                 URL: https://issues.apache.org/jira/browse/FLINK-39660
>             Project: Flink
>          Issue Type: Bug
>          Components: Connectors / Kinesis
>    Affects Versions: aws-connector-5.0.0, aws-connector-6.0.0
>            Reporter: Chenyuan Lee
>            Priority: Major
>
> ```markdown
> ## Overview
> Under certain conditions, the Kinesis EFO subscription lifecycle in 
> `FanOutKinesisShardSubscription` enters a cascading failure: an initial 
> disruption in one subscription propagates into repeated subscription failures 
> across many shards, causing the connector to stall.
> ## Observed Symptoms
> - `ClosedChannelException` bursts affecting many shards simultaneously.
> - `AcquireTimeoutException: Acquire operation took longer than 60000 
> milliseconds` errors across many shards.
> - Same shard retried from the same `startingMarker` for hundreds of 
> iterations without forward progress.
> - `Http2PingHandler` logs `PING timer scheduled after N ms` warnings 
> clustered across many channels under load, indicating Netty event loop 
> blocking during record processing.
> - HTTP/2 connection exhaustion: SDK pool fills with pending 
> `subscribeToShard` calls that never complete, each consuming a slot until 
> individual 60-second acquire timeouts trigger.
> ## Logs
> Example stack trace from an EFO subscription failure under cascading failure 
> conditions:
> ```
> Error onError subscribing to shard shardId-000000000144 with starting 
> position ...
> java.io.IOException: An error occurred on the connection: 
> java.nio.channels.ClosedChannelException, [channel: 76d2a9d0]. All streams 
> will be closed
>     at 
> software.amazon.awssdk.http.nio.netty.internal.http2.MultiplexedChannelRecord.decorateConnectionException(MultiplexedChannelRecord.java:213)
>     at 
> software.amazon.awssdk.http.nio.netty.internal.http2.MultiplexedChannelRecord.lambda$closeChildChannels$10(MultiplexedChannelRecord.java:205)
>     at 
> software.amazon.awssdk.http.nio.netty.internal.http2.MultiplexedChannelRecord.lambda$closeAndExecuteOnChildChannels$11(MultiplexedChannelRecord.java:229)
>     at 
> software.amazon.awssdk.http.nio.netty.internal.utils.NettyUtils.doInEventLoop(NettyUtils.java:259)
> ```
> ## Impact
> Multiple shards stop making progress; consumption lag grows and the condition 
> can persist without manual intervention.
> ```
> ### JIRA wiki markup (for Apache JIRA description field)
> ```
> h2. Overview
> Under certain conditions, the Kinesis EFO subscription lifecycle in 
> \{{FanOutKinesisShardSubscription}} enters a cascading failure: an initial 
> disruption in one subscription propagates into repeated subscription failures 
> across many shards, causing the connector to stall.
> h2. Observed Symptoms
> * \{{ClosedChannelException}} bursts affecting many shards simultaneously.
> * \{{AcquireTimeoutException: Acquire operation took longer than 60000 
> milliseconds}} errors across many shards.
> * Same shard retried from the same \{{startingMarker}} for hundreds of 
> iterations without forward progress.
> * \{{Http2PingHandler}} logs \{{PING timer scheduled after N ms}} warnings 
> clustered across many channels under load, indicating Netty event loop 
> blocking during record processing.
> * HTTP/2 connection exhaustion: SDK pool fills with pending 
> \{{subscribeToShard}} calls that never complete, each consuming a slot until 
> individual 60-second acquire timeouts trigger.
> h2. Logs
> Example stack trace from an EFO subscription failure under cascading failure 
> conditions:
> {noformat}
> Error onError subscribing to shard shardId-000000000144 with starting 
> position ...
> java.io.IOException: An error occurred on the connection: 
> java.nio.channels.ClosedChannelException, [channel: 76d2a9d0]. All streams 
> will be closed
>     at 
> software.amazon.awssdk.http.nio.netty.internal.http2.MultiplexedChannelRecord.decorateConnectionException(MultiplexedChannelRecord.java:213)
>     at 
> software.amazon.awssdk.http.nio.netty.internal.http2.MultiplexedChannelRecord.lambda$closeChildChannels$10(MultiplexedChannelRecord.java:205)
>     at 
> software.amazon.awssdk.http.nio.netty.internal.http2.MultiplexedChannelRecord.lambda$closeAndExecuteOnChildChannels$11(MultiplexedChannelRecord.java:229)
>     at 
> software.amazon.awssdk.http.nio.netty.internal.utils.NettyUtils.doInEventLoop(NettyUtils.java:259)
> {noformat}
> h2. Impact
> Multiple shards stop making progress; consumption lag grows and the condition 
> can persist without manual intervention.
> ```
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-39660) Kinesis Connector - Cascading failure in EFO subscription lifecycle

Reply via email to