[ https://issues.apache.org/jira/browse/FLINK-35151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17840062#comment-17840062 ]
Guozhen Yang commented on FLINK-35151: -------------------------------------- We encountered the same issue recently. It may be caused by using the same BinaryLogClient instance wrapped in MySqlTaskContextImpl in producer thread named "blc-hostname:port" and consumer thread named like "Source Data Fetcher". When the ChangeEventQueue is full, the BinaryLogClient in "blc-hostname:port" thread blocks at call stack shown as below (line number may be wrong since I added some logs) {code:java} "blc-hostname:port" Id=3271 TIMED_WAITING on io.debezium.connector.base.ChangeEventQueue@5e00c81c at java.base@11.0.22/java.lang.Object.wait(Native Method) - waiting on io.debezium.connector.base.ChangeEventQueue@5e00c81c at app//io.debezium.connector.base.ChangeEventQueue.doEnqueue(ChangeEventQueue.java:204) at app//io.debezium.connector.base.ChangeEventQueue.enqueue(ChangeEventQueue.java:169) at app//io.debezium.pipeline.EventDispatcher$StreamingChangeRecordReceiver.changeRecord(EventDispatcher.java:405) at app//io.debezium.pipeline.EventDispatcher$2.changeRecord(EventDispatcher.java:229) at app//io.debezium.relational.RelationalChangeRecordEmitter.emitUpdateRecord(RelationalChangeRecordEmitter.java:160) at app//io.debezium.relational.RelationalChangeRecordEmitter.emitChangeRecords(RelationalChangeRecordEmitter.java:60) at app//io.debezium.pipeline.EventDispatcher.dispatchDataChangeEvent(EventDispatcher.java:209) {code} You can dig into the call stack all the way down to [BinaryLogClient.java#L631|https://github.com/osheroff/mysql-binlog-connector-java/blob/0.27.2/src/main/java/com/github/shyiko/mysql/binlog/BinaryLogClient.java#L631]. The "blc-hostname:port" thread blocks at listenForEventPackets method call and never enters the finally block. So connectLock never got released. The "blc-hostname:port" producer thread blocks waiting for queue space, after which it will release connectLock. And "Source Data Fetcher" consumer thread blocks waiting for lock to be released, after which it will release queue space. So there's dead lock. We simply stop using MySqlTaskContextImpl in StatefulTaskContext, and there won't be any BinaryLogClient reusing and don't see the issue any further. > Flink mysql cdc will stuck when suspend binlog split and ChangeEventQueue is > full > ---------------------------------------------------------------------------------- > > Key: FLINK-35151 > URL: https://issues.apache.org/jira/browse/FLINK-35151 > Project: Flink > Issue Type: Bug > Components: Flink CDC > Environment: I use master branch reproduce it. > Reporter: Xin Gong > Priority: Major > Attachments: dumpstack.txt > > > Flink mysql cdc will stuck when suspend binlog split and ChangeEventQueue is > full. > Reason is that producing binlog is too fast. > MySqlSplitReader#suspendBinlogReaderIfNeed will execute > BinlogSplitReader#stopBinlogReadTask to set > currentTaskRunning to be false after MysqSourceReader receives binlog split > update event. > MySqlSplitReader#pollSplitRecords is executed and > dataIt is null to execute closeBinlogReader when currentReader is > BinlogSplitReader. closeBinlogReader will execute > statefulTaskContext.getBinaryLogClient().disconnect(), it could dead lock. > Because BinaryLogClient#connectLock is not release when > MySqlStreamingChangeEventSource add element to full queue. > > You can set StatefulTaskContext#queue to be 1 and run UT > NewlyAddedTableITCase#testRemoveAndAddNewTable. > -- This message was sent by Atlassian Jira (v8.20.10#820010)