yeah, you are right, for mor table, when the cdc log is enabled, which
are generated during compaction, there are two choices for the reader:

1. read the changes from the change log, which got a hight TTL delay
because these logs are only generated during compaction;
2. or it can infers the changes by itself, which got much shorter TTL
delay but more resource cost on the reader side.

1 and 2 are mutual exclusion, you can only enable either one of them.

The reason we add some tricky logic in HoodieCDCExtractor is that when
we choose 1, the timeline needs to be filtered to only include
compaction commits;
while when 2 is enabled, the timeline needs to be filtered to exlucde
compaction commits.

Best,
Danny

Jack Vanlightly <vanligh...@apache.org> 于2024年8月19日周一 18:25写道:
>
> Hi all,
>
> I'm trying to understand the CDC read process that Flink uses in Hudi.
> According to the Flink option, READ_CDC_FROM_CHANGELOG (
> https://github.com/apache/hudi/blob/db5c2d97dc94122ebd63e6200858eabc4b119178/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java#L365),
> when True, Flink should only read from CDC files, when false, it must infer
> the deltas from the base and log files. But reading the code, the
> HoodieCDCExtractor creates file splits for both CDC files and for the
> inference cases (
> https://github.com/apache/hudi/blob/db5c2d97dc94122ebd63e6200858eabc4b119178/hudi-common/src/main/java/org/apache/hudi/common/table/cdc/HoodieCDCExtractor.java#L234)
> when READ_CDC_FROM_CHANGELOG =True.
>
> This is confusing as it seems you would do one or the other but not both.
> Why go through all the work of inferring deltas from base and log files
> when perhaps a couple of commits ahead there is a compact instant that has
> it all precomputed?
>
> Thanks
> Jack

Reply via email to