zhilinli123 opened a new issue #4881: URL: https://github.com/apache/hudi/issues/4881
We use flink CDC to monitor mysql's latest binlog send kafka consumption Kafka load index full import data index Importing HDFS in batches offline Enable incremental intervention in bulk insert mode to consume kafka binlog data Test consumption One table in the same Kafka topic started index loading consumption without duplicate data, but some duplicate data occurred in the parallel consumption of multiple tables. This problem occurs many times. Each time, duplicate data occurred after the first successful index loading, the program has been running Each time, duplicate data successfully occurred at the first checkpoint The metadata fields of the two HUDI duplicates are otherwise identical <img width="1488" alt="image" src="https://user-images.githubusercontent.com/76689593/155283951-553e7bf1-0521-4476-bf02-f1c7ae5f3eaa.png"> Duplicate HUDI data found <img width="1258" alt="image" src="https://user-images.githubusercontent.com/76689593/155284057-e416e7b9-f46f-4a8d-b604-bc6b6af823a1.png"> Hudi writes the configured parameters with('connector' = 'hudi', 'path' = 'hdfs:///prod/xxx/member', 'index.bootstrap.enabled'='true', 'compaction.tasks'='2', 'read.start-commit'='earliest', 'changelog.enabled'='true', 'write.task.max.size'='4096', 'write.bucket_assign.tasks'='1', 'compaction.delta_seconds'='120', 'compaction.delta_commits'='2', 'compaction.trigger.strategy'='num_or_time', 'compaction.max_memory'='2048', 'read.streaming.enabled'='true', 'read.tasks'='6', 'write.merge.max_memory'='1024', 'read.streaming.check-interval'='10', 'table.type'='MERGE_ON_READ', 'write.tasks'='4'); hudi version:master flink version: 1.13.2 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org