Keith Lee created FLINK-37949:
---------------------------------
Summary: FlinkKinesisConsumer on EFO mode runs into Netty deadlock
on client close during stop-with-savepoint
Key: FLINK-37949
URL: https://issues.apache.org/jira/browse/FLINK-37949
Project: Flink
Issue Type: Bug
Components: Connectors / Kinesis
Affects Versions: 1.20.0
Reporter: Keith Lee
We ran into the following issue
* Flink job was continuously failing over and stop-with-savepoint savepoint was
failing.
* Savepoint was failing because task failed
* Task failed because of TimeoutException thrown from netty client when Kinesis
connector attempts to close netty client during Stop-with-savepoint operation
{quote} java.lang.RuntimeException: java.util.concurrent.TimeoutException
at
org.apache.flink.kinesis.shaded.software.amazon.awssdk.http.nio.netty.internal.AwaitCloseChannelPoolMap.close(AwaitCloseChannelPoolMap.java:177)
at
org.apache.flink.kinesis.shaded.software.amazon.awssdk.http.nio.netty.internal.utils.NettyUtils.runAndLogError(NettyUtils.java:386)
at
org.apache.flink.kinesis.shaded.software.amazon.awssdk.http.nio.netty.NettyNioAsyncHttpClient.close(NettyNioAsyncHttpClient.java:198)
at
org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxyAsyncV2.close(KinesisProxyAsyncV2.java:72)
at
org.apache.flink.streaming.connectors.kinesis.internals.publisher.fanout.FanOutRecordPublisherFactory.close(FanOutRecordPublisherFactory.java:101)
at
org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.closeRecordPublisherFactory(KinesisDataFetcher.java:839)
at
org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.shutdownFetcher(KinesisDataFetcher.java:813)
at
org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer.cancel(FlinkKinesisConsumer.java:422)
at
org.apache.flink.streaming.api.operators.StreamSource.stop(StreamSource.java:131)
at
org.apache.flink.streaming.runtime.tasks.SourceStreamTask.stopOperatorForStopWithSavepoint(SourceStreamTask.java:311)
...
Caused by: java.util.concurrent.TimeoutException
at
java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886)
at
java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021)
at
org.apache.flink.kinesis.shaded.software.amazon.awssdk.http.nio.netty.internal.AwaitCloseChannelPoolMap.close(AwaitCloseChannelPoolMap.java:172)
...{quote}
* Netty client failed to close probably due to deadlock. See exception thrown
by checkDeadLock() within netty
{quote}
org.apache.flink.kinesis.shaded.io.netty.util.concurrent.BlockingOperationException:
DefaultChannelPromise@731d6e17(incomplete)
at
org.apache.flink.kinesis.shaded.io.netty.util.concurrent.DefaultPromise.checkDeadLock(DefaultPromise.java:463)
at
org.apache.flink.kinesis.shaded.io.netty.channel.DefaultChannelPromise.checkDeadLock(DefaultChannelPromise.java:159)
at
org.apache.flink.kinesis.shaded.io.netty.util.concurrent.DefaultPromise.awaitUninterruptibly(DefaultPromise.java:269)
at
org.apache.flink.kinesis.shaded.io.netty.channel.DefaultChannelPromise.awaitUninterruptibly(DefaultChannelPromise.java:137)
at
org.apache.flink.kinesis.shaded.io.netty.channel.DefaultChannelPromise.awaitUninterruptibly(DefaultChannelPromise.java:30)
at
org.apache.flink.kinesis.shaded.io.netty.channel.pool.SimpleChannelPool.close(SimpleChannelPool.java:408)
at
org.apache.flink.kinesis.shaded.software.amazon.awssdk.http.nio.netty.internal.BetterSimpleChannelPool.close(BetterSimpleChannelPool.java:38)
at
org.apache.flink.kinesis.shaded.software.amazon.awssdk.http.nio.netty.internal.HonorCloseOnReleaseChannelPool.close(HonorCloseOnReleaseChannelPool.java:80)
at
org.apache.flink.kinesis.shaded.software.amazon.awssdk.http.nio.netty.internal.http2.Http2MultiplexedChannelPool.lambda$doClose$11(Http2MultiplexedChannelPool.java:419)
...{quote}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)