[ 
https://issues.apache.org/jira/browse/FLINK-39824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ran Tao updated FLINK-39824:
----------------------------
    Description: 
*Description*

When using MySQL CDC pipeline source to synchronize a large number of tables, 
TaskManager CPU usage can become very high and may stay close to 100%.

CPU profiling shows that most CPU time is spent in Java regex matching during 
table filter evaluation:
{code:java}
java.util.regex.Matcher.match
java.util.regex.Matcher.matches
io.debezium.function.Predicates.lambda$matchedByPattern$5
io.debezium.relational.Selectors$TableSelectionPredicateBuilder...
io.debezium.relational.RelationalTableFilters...
io.debezium.connector.mysql.MySqlStreamingChangeEventSource.informAboutUnknownTableIfRequired
io.debezium.connector.mysql.MySqlStreamingChangeEventSource.handleUpdateTableMetadata
 {code}
In large-table scenarios, the same TableId can be checked repeatedly during 
binlog event processing. Each check currently goes through Debezium's 
include/exclude table regex predicates again. If the table list pattern is 
large or complex, this repeated regex evaluation may dominate CPU usage.

*Expected Behavior*

MySQL CDC source should avoid repeatedly evaluating expensive regex table 
filters for the same TableId. Once the include/exclude result for a table is 
known, subsequent checks for the same table should reuse the result.

*Actual Behavior*

The include/exclude table filter result is recomputed repeatedly through regex 
matching, causing high CPU usage in large-scale table synchronization jobs.

*Impact*

This issue affects MySQL CDC jobs that synchronize many tables. It can cause:
 - TaskManager CPU usage close to 100%
 - Lower binlog processing throughput
 - Increased CDC event latency
 - Poor scalability for large table-list configurations

*Proposed Fix*

Cache the table filter result by TableId in MySqlSourceConfig.

The cached filter should preserve the existing behavior:

included by Debezium table filter AND not matched by exclude-table-list, if 
configured

This avoids repeated regex matching for the same table while keeping the 
original include/exclude semantics unchanged.

  was:
*Description*

When using MySQL CDC pipeline source to synchronize a large number of tables, 
TaskManager CPU usage can become very high and may stay close to 100%.

CPU profiling shows that most CPU time is spent in Java regex matching during 
table filter evaluation:
{code:java}
java.util.regex.Matcher.match
java.util.regex.Matcher.matches
io.debezium.function.Predicates.lambda$matchedByPattern$5
io.debezium.relational.Selectors$TableSelectionPredicateBuilder...
io.debezium.relational.RelationalTableFilters...
io.debezium.connector.mysql.MySqlStreamingChangeEventSource.informAboutUnknownTableIfRequired
io.debezium.connector.mysql.MySqlStreamingChangeEventSource.handleUpdateTableMetadata
 {code}
In large-table scenarios, the same TableId can be checked repeatedly during 
binlog event processing. Each check currently goes through Debezium's 
include/exclude table regex predicates again. If the table list
pattern is large or complex, this repeated regex evaluation may dominate CPU 
usage.

*Expected Behavior*

MySQL CDC source should avoid repeatedly evaluating expensive regex table 
filters for the same TableId. Once the include/exclude result for a table is 
known, subsequent checks for the same table should reuse
the result.

*Actual Behavior*

The include/exclude table filter result is recomputed repeatedly through regex 
matching, causing high CPU usage in large-scale table synchronization jobs.

*Impact*

This issue affects MySQL CDC jobs that synchronize many tables. It can cause:
 - TaskManager CPU usage close to 100%
 - Lower binlog processing throughput
 - Increased CDC event latency
 - Poor scalability for large table-list configurations

*Proposed Fix*

Cache the table filter result by TableId in MySqlSourceConfig.

The cached filter should preserve the existing behavior:

included by Debezium table filter
AND
not matched by exclude-table-list, if configured

This avoids repeated regex matching for the same table while keeping the 
original include/exclude semantics unchanged.


> [mysql-cdc] High CPU usage caused by repeated regex table filtering in large 
> table synchronization
> --------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-39824
>                 URL: https://issues.apache.org/jira/browse/FLINK-39824
>             Project: Flink
>          Issue Type: Bug
>          Components: Flink CDC
>            Reporter: Ran Tao
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: 20260602-203520.jpeg, 20260602-203545.jpg
>
>
> *Description*
> When using MySQL CDC pipeline source to synchronize a large number of tables, 
> TaskManager CPU usage can become very high and may stay close to 100%.
> CPU profiling shows that most CPU time is spent in Java regex matching during 
> table filter evaluation:
> {code:java}
> java.util.regex.Matcher.match
> java.util.regex.Matcher.matches
> io.debezium.function.Predicates.lambda$matchedByPattern$5
> io.debezium.relational.Selectors$TableSelectionPredicateBuilder...
> io.debezium.relational.RelationalTableFilters...
> io.debezium.connector.mysql.MySqlStreamingChangeEventSource.informAboutUnknownTableIfRequired
> io.debezium.connector.mysql.MySqlStreamingChangeEventSource.handleUpdateTableMetadata
>  {code}
> In large-table scenarios, the same TableId can be checked repeatedly during 
> binlog event processing. Each check currently goes through Debezium's 
> include/exclude table regex predicates again. If the table list pattern is 
> large or complex, this repeated regex evaluation may dominate CPU usage.
> *Expected Behavior*
> MySQL CDC source should avoid repeatedly evaluating expensive regex table 
> filters for the same TableId. Once the include/exclude result for a table is 
> known, subsequent checks for the same table should reuse the result.
> *Actual Behavior*
> The include/exclude table filter result is recomputed repeatedly through 
> regex matching, causing high CPU usage in large-scale table synchronization 
> jobs.
> *Impact*
> This issue affects MySQL CDC jobs that synchronize many tables. It can cause:
>  - TaskManager CPU usage close to 100%
>  - Lower binlog processing throughput
>  - Increased CDC event latency
>  - Poor scalability for large table-list configurations
> *Proposed Fix*
> Cache the table filter result by TableId in MySqlSourceConfig.
> The cached filter should preserve the existing behavior:
> included by Debezium table filter AND not matched by exclude-table-list, if 
> configured
> This avoids repeated regex matching for the same table while keeping the 
> original include/exclude semantics unchanged.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to