[
https://issues.apache.org/jira/browse/FLINK-36798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yanquan Lv updated FLINK-36798:
-------------------------------
Fix Version/s: (was: cdc-3.5.0)
> Improve data processing speed during the phase from snapshot to incremental
> phase
> ---------------------------------------------------------------------------------
>
> Key: FLINK-36798
> URL: https://issues.apache.org/jira/browse/FLINK-36798
> Project: Flink
> Issue Type: Improvement
> Components: Flink CDC
> Affects Versions: cdc-3.1.0, cdc-3.2.0, cdc-3.1.1
> Reporter: Yanquan Lv
> Priority: Major
>
> During the phase from snapshot to incremental phase, for each input record,
> we need to compare with all finished splits and find the binlog offset to
> check whether we should emit the record, however, this complexity is `O(n)`,
> it's a very time cost procedure.
> Actually, we can improve data processing speed by the following ways:
> 1. For numeric fields, we can directly calculate which chunk they belong to
> based on the primary key and chunk size information.this complexity is `O(1)`.
> 2. For non numeric fields, we can use binary search to find the shard to
> which the data belongs. this complexity is `log(n)`.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)