LiuZeshan created FLINK-38183:
---------------------------------

             Summary: Data loss when cdc reading mysql that has out of order 
GTID
                 Key: FLINK-38183
                 URL: https://issues.apache.org/jira/browse/FLINK-38183
             Project: Flink
          Issue Type: Bug
          Components: Flink CDC
    Affects Versions: 3.0.0
         Environment: Fink-CDC: 3.5-SNAPSHOT

Flink:1.20.1

 
            Reporter: LiuZeshan


As the design of 
[https://github.com/apache/flink-cdc/pull/2220|http://example.com],CDC only 
cares about the maximum GTID position and starts from it. For example, if 
reading from gtid offset 1-7:9-10, it will automatically adjust to read from 
1-10, which causes an error in skipping gitd site 8, thus losing data. In 
particular, when gtid bit 8 is a large transaction, it will cause more serious 
data loss. We have encountered this problem many times in the production 
environment.


MySQL 5.7+ supports parallel replication based on group commit (LOGICAL_LOCK). 
Conflict free transactions are distributed from the SQL thread (Coordinator) of 
the database to multiple worker threads for concurrent execution. Although the 
main database generates continuous GTIDs in the order of submission (such as A: 
1-100), the worker threads of the slave database may complete transaction 
submission in disorder. When the CDC reads the MySQL slave database, we may 
encounter the following gtid order. In fact, we can also manually set the gtid 
to construct this scenario.
{code:java}
SET @@SESSION.GTID_NEXT='XXX:1';
INSERT ...;
SET @@SESSION.GTID_NEXT='XXX:2';
INSERT ...;
...
SET @@SESSION.GTID_NEXT='XXX:7';
INSERT ...;
SET @@SESSION.GTID_NEXT='XXX:9';
INSERT ...;
SET @@SESSION.GTID_NEXT='XXX:10';
INSERT ...;
SET @@SESSION.GTID_NEXT='XXX:8';
BEGIN;
INSERT ...;
... 
INSERT ...; -- (the number 1 million DML, checkpoint at this position)
...
INSERT ...; -- (the number 2 millions DML)
COMMIT;
SET @@SESSION.GTID_NEXT='XXX:11';
INSERT ...; {code}
There are 2 million transactions at GTID location 8. When 1 million data are 
read, a checkpoint is triggered and completed. The recorded git offset is 
1-7:9-10, and the skip events are 1 million, as shown below.
{code:java}
offset={transaction_id=null, ts_sec=1754145492, file=mysql-bin.000190, 
pos=1443601, kind=SPECIFIC, gtids=xxx:1-7:9-10, row=3, event=1000000, 
server_id=123} {code}
The job is restarted and recovered from this checkpoint. According to the 
design of CDC, it is automatically adjusted to read from 1-10, and continues to 
skip 1 million events, resulting in the loss of 1 million unread data of gitd 
site 8 and the loss of data contained in 1 million events starting from gtid 
site 11.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to