yeah, you are right, for mor table, when the cdc log is enabled, which are generated during compaction, there are two choices for the reader:
1. read the changes from the change log, which got a hight TTL delay because these logs are only generated during compaction; 2. or it can infers the changes by itself, which got much shorter TTL delay but more resource cost on the reader side. 1 and 2 are mutual exclusion, you can only enable either one of them. The reason we add some tricky logic in HoodieCDCExtractor is that when we choose 1, the timeline needs to be filtered to only include compaction commits; while when 2 is enabled, the timeline needs to be filtered to exlucde compaction commits. Best, Danny Jack Vanlightly <vanligh...@apache.org> 于2024年8月19日周一 18:25写道: > > Hi all, > > I'm trying to understand the CDC read process that Flink uses in Hudi. > According to the Flink option, READ_CDC_FROM_CHANGELOG ( > https://github.com/apache/hudi/blob/db5c2d97dc94122ebd63e6200858eabc4b119178/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java#L365), > when True, Flink should only read from CDC files, when false, it must infer > the deltas from the base and log files. But reading the code, the > HoodieCDCExtractor creates file splits for both CDC files and for the > inference cases ( > https://github.com/apache/hudi/blob/db5c2d97dc94122ebd63e6200858eabc4b119178/hudi-common/src/main/java/org/apache/hudi/common/table/cdc/HoodieCDCExtractor.java#L234) > when READ_CDC_FROM_CHANGELOG =True. > > This is confusing as it seems you would do one or the other but not both. > Why go through all the work of inferring deltas from base and log files > when perhaps a couple of commits ahead there is a compact instant that has > it all precomputed? > > Thanks > Jack