[jira] [Commented] (FLINK-35151) Flink mysql cdc will stuck when suspend binlog split and ChangeEventQueue is full

2024-04-23 Thread Guozhen Yang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840062#comment-17840062
 ] 

Guozhen Yang commented on FLINK-35151:
--

We encountered the same issue recently.

It may be caused by using the same BinaryLogClient instance wrapped in 
MySqlTaskContextImpl in producer thread named "blc-hostname:port" and consumer 
thread named like "Source Data Fetcher".
When the ChangeEventQueue is full, the BinaryLogClient in "blc-hostname:port" 
thread blocks at call stack shown as below (line number may be wrong since I 
added some logs)
{code:java}
"blc-hostname:port" Id=3271 TIMED_WAITING on 
io.debezium.connector.base.ChangeEventQueue@5e00c81c
    at java.base@11.0.22/java.lang.Object.wait(Native Method)
    -  waiting on io.debezium.connector.base.ChangeEventQueue@5e00c81c
    at 
app//io.debezium.connector.base.ChangeEventQueue.doEnqueue(ChangeEventQueue.java:204)
    at 
app//io.debezium.connector.base.ChangeEventQueue.enqueue(ChangeEventQueue.java:169)
    at 
app//io.debezium.pipeline.EventDispatcher$StreamingChangeRecordReceiver.changeRecord(EventDispatcher.java:405)
    at 
app//io.debezium.pipeline.EventDispatcher$2.changeRecord(EventDispatcher.java:229)
    at 
app//io.debezium.relational.RelationalChangeRecordEmitter.emitUpdateRecord(RelationalChangeRecordEmitter.java:160)
    at 
app//io.debezium.relational.RelationalChangeRecordEmitter.emitChangeRecords(RelationalChangeRecordEmitter.java:60)
    at 
app//io.debezium.pipeline.EventDispatcher.dispatchDataChangeEvent(EventDispatcher.java:209)
 {code}
You can dig into the call stack all the way down to 
[BinaryLogClient.java#L631|https://github.com/osheroff/mysql-binlog-connector-java/blob/0.27.2/src/main/java/com/github/shyiko/mysql/binlog/BinaryLogClient.java#L631].
 The "blc-hostname:port" thread blocks at listenForEventPackets method call and 
never enters the finally block. So connectLock never got released.

The "blc-hostname:port" producer thread blocks waiting for queue space, after 
which it will release connectLock. And "Source Data Fetcher" consumer thread 
blocks waiting for lock to be released, after which it will release queue 
space. So there's dead lock.

We simply stop using MySqlTaskContextImpl in StatefulTaskContext, and there 
won't be any BinaryLogClient reusing and don't see the issue any further.

> Flink mysql cdc will  stuck when suspend binlog split and ChangeEventQueue is 
> full
> --
>
> Key: FLINK-35151
> URL: https://issues.apache.org/jira/browse/FLINK-35151
> Project: Flink
>  Issue Type: Bug
>  Components: Flink CDC
> Environment: I use master branch reproduce it.
>Reporter: Xin Gong
>Priority: Major
> Attachments: dumpstack.txt
>
>
> Flink mysql cdc will  stuck when suspend binlog split and ChangeEventQueue is 
> full.
> Reason is that producing binlog is too fast.  
> MySqlSplitReader#suspendBinlogReaderIfNeed will execute 
> BinlogSplitReader#stopBinlogReadTask to set 
> currentTaskRunning to be false after MysqSourceReader receives binlog split 
> update event.
> MySqlSplitReader#pollSplitRecords is executed and 
> dataIt is null to execute closeBinlogReader when currentReader is 
> BinlogSplitReader. closeBinlogReader will execute 
> statefulTaskContext.getBinaryLogClient().disconnect(), it could dead lock. 
> Because BinaryLogClient#connectLock is not release  when 
> MySqlStreamingChangeEventSource add element to full queue.
>  
> You can set StatefulTaskContext#queue to be 1 and run UT 
> NewlyAddedTableITCase#testRemoveAndAddNewTable.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-35151) Flink mysql cdc will stuck when suspend binlog split and ChangeEventQueue is full

2024-04-18 Thread Leonard Xu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838499#comment-17838499
 ] 

Leonard Xu commented on FLINK-35151:


Thanks [~pacinogong]for the report, [~ruanhang1993] Would you like to take a 
look this issue?

> Flink mysql cdc will  stuck when suspend binlog split and ChangeEventQueue is 
> full
> --
>
> Key: FLINK-35151
> URL: https://issues.apache.org/jira/browse/FLINK-35151
> Project: Flink
>  Issue Type: Bug
>  Components: Flink CDC
> Environment: I use master branch reproduce it.
>Reporter: Xin Gong
>Priority: Major
> Attachments: dumpstack.txt
>
>
> Flink mysql cdc will  stuck when suspend binlog split and ChangeEventQueue is 
> full.
> Reason is that producing binlog is too fast.  
> MySqlSplitReader#suspendBinlogReaderIfNeed will execute 
> BinlogSplitReader#stopBinlogReadTask to set 
> currentTaskRunning to be false after MysqSourceReader receives binlog split 
> update event.
> MySqlSplitReader#pollSplitRecords is executed and 
> dataIt is null to execute closeBinlogReader when currentReader is 
> BinlogSplitReader. closeBinlogReader will execute 
> statefulTaskContext.getBinaryLogClient().disconnect(), it could dead lock. 
> Because BinaryLogClient#connectLock is not release  when 
> MySqlStreamingChangeEventSource add element to full queue.
>  
> You can set StatefulTaskContext#queue to be 1 and run UT 
> NewlyAddedTableITCase#testRemoveAndAddNewTable.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-35151) Flink mysql cdc will stuck when suspend binlog split and ChangeEventQueue is full

2024-04-17 Thread Xin Gong (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838452#comment-17838452
 ] 

Xin Gong commented on FLINK-35151:
--

I get an idea to address this issue by set 

currentTaskRunning || queue.remainingCapacity() == 0 for 
BinlogSplitReader#pollSplitRecords.

> Flink mysql cdc will  stuck when suspend binlog split and ChangeEventQueue is 
> full
> --
>
> Key: FLINK-35151
> URL: https://issues.apache.org/jira/browse/FLINK-35151
> Project: Flink
>  Issue Type: Bug
>  Components: Flink CDC
> Environment: I use master branch reproduce it.
>Reporter: Xin Gong
>Priority: Major
> Attachments: dumpstack.txt
>
>
> Flink mysql cdc will  stuck when suspend binlog split and ChangeEventQueue is 
> full.
> Reason is that producing binlog is too fast.  
> MySqlSplitReader#suspendBinlogReaderIfNeed will execute 
> BinlogSplitReader#stopBinlogReadTask to set 
> currentTaskRunning to be false after MysqSourceReader receives binlog split 
> update event.
> MySqlSplitReader#pollSplitRecords is executed and 
> dataIt is null to execute closeBinlogReader when currentReader is 
> BinlogSplitReader. closeBinlogReader will execute 
> statefulTaskContext.getBinaryLogClient().disconnect(), it could dead lock. 
> Because BinaryLogClient#connectLock is not release  when 
> MySqlStreamingChangeEventSource add element to full queue.
>  
> You can set StatefulTaskContext#queue to be 1 and run UT 
> NewlyAddedTableITCase#testRemoveAndAddNewTable.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)