LiuZeshan created FLINK-38183:
---------------------------------
Summary: Data loss when cdc reading mysql that has out of order
GTID
Key: FLINK-38183
URL: https://issues.apache.org/jira/browse/FLINK-38183
Project: Flink
Issue Type: Bug
Components: Flink CDC
Affects Versions: 3.0.0
Environment: Fink-CDC: 3.5-SNAPSHOT
Flink:1.20.1
Reporter: LiuZeshan
As the design of
[https://github.com/apache/flink-cdc/pull/2220|http://example.com],CDC only
cares about the maximum GTID position and starts from it. For example, if
reading from gtid offset 1-7:9-10, it will automatically adjust to read from
1-10, which causes an error in skipping gitd site 8, thus losing data. In
particular, when gtid bit 8 is a large transaction, it will cause more serious
data loss. We have encountered this problem many times in the production
environment.
MySQL 5.7+ supports parallel replication based on group commit (LOGICAL_LOCK).
Conflict free transactions are distributed from the SQL thread (Coordinator) of
the database to multiple worker threads for concurrent execution. Although the
main database generates continuous GTIDs in the order of submission (such as A:
1-100), the worker threads of the slave database may complete transaction
submission in disorder. When the CDC reads the MySQL slave database, we may
encounter the following gtid order. In fact, we can also manually set the gtid
to construct this scenario.
{code:java}
SET @@SESSION.GTID_NEXT='XXX:1';
INSERT ...;
SET @@SESSION.GTID_NEXT='XXX:2';
INSERT ...;
...
SET @@SESSION.GTID_NEXT='XXX:7';
INSERT ...;
SET @@SESSION.GTID_NEXT='XXX:9';
INSERT ...;
SET @@SESSION.GTID_NEXT='XXX:10';
INSERT ...;
SET @@SESSION.GTID_NEXT='XXX:8';
BEGIN;
INSERT ...;
...
INSERT ...; -- (the number 1 million DML, checkpoint at this position)
...
INSERT ...; -- (the number 2 millions DML)
COMMIT;
SET @@SESSION.GTID_NEXT='XXX:11';
INSERT ...; {code}
There are 2 million transactions at GTID location 8. When 1 million data are
read, a checkpoint is triggered and completed. The recorded git offset is
1-7:9-10, and the skip events are 1 million, as shown below.
{code:java}
offset={transaction_id=null, ts_sec=1754145492, file=mysql-bin.000190,
pos=1443601, kind=SPECIFIC, gtids=xxx:1-7:9-10, row=3, event=1000000,
server_id=123} {code}
The job is restarted and recovered from this checkpoint. According to the
design of CDC, it is automatically adjusted to read from 1-10, and continues to
skip 1 million events, resulting in the loss of 1 million unread data of gitd
site 8 and the loss of data contained in 1 million events starting from gtid
site 11.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)