Zhongmin Qiao created FLINK-35874:
-------------------------------------
Summary: Check pureBinlogPhaseTables set before call
getBinlogPosition method in BinlogSplitReader
Key: FLINK-35874
URL: https://issues.apache.org/jira/browse/FLINK-35874
Project: Flink
Issue Type: Improvement
Components: Flink CDC
Reporter: Zhongmin Qiao
Attachments: image-2024-07-22-19-26-59-158.png,
image-2024-07-22-19-27-19-366.png, image-2024-07-22-19-30-08-989.png,
image-2024-07-22-19-36-20-481.png, image-2024-07-22-19-36-40-581.png,
image-2024-07-22-19-37-35-542.png, image-2024-07-22-21-12-03-316.png
The method getBinlogPosition of RecordUtil which is called by
BinlogSplitReader.
shouldEmit is a highly performance-consuming method. This is because it
iterates through the sourceOffset map of the SourceRecord, and during the
iteration, it also performs a toString() conversion on the value. Finally, it
calls the putAll method of BinlogOffsetBuilder to put all the elements obtained
from the iteration into the offsetMap (which involves another map traversal and
hashcode computation). Despite the significant performance impact of
getBinlogPosition, we still need to call it when emitting each
DataChangeRecord, which reduces the efficiency of data processing in Flink CDC.
!image-2024-07-22-19-26-59-158.png|width=545,height=222!
!image-2024-07-22-19-27-19-366.png|width=545,height=119!
However, we can optimize and avoid frequent invocations of getBinlogPosition by
moving the check pureBinlogPhaseTables.contains(tableId) in the
hasEnterPureBinlogPhase method before calling getBinlogPosition. This way, if
the SourceRecord belongs to a pure binlog phase table, we can directly return
true without the need for the highly performance-consuming getBinlogPosition
method.
diff
!image-2024-07-22-21-12-03-316.png|width=548,height=236!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)