CDC and MOR tables

Jack Vanlightly Mon, 19 Aug 2024 03:25:16 -0700

Hi all,

I'm trying to understand the CDC read process that Flink uses in Hudi.
According to the Flink option, READ_CDC_FROM_CHANGELOG (
https://github.com/apache/hudi/blob/db5c2d97dc94122ebd63e6200858eabc4b119178/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java#L365),
when True, Flink should only read from CDC files, when false, it must infer
the deltas from the base and log files. But reading the code, the
HoodieCDCExtractor creates file splits for both CDC files and for the
inference cases (
https://github.com/apache/hudi/blob/db5c2d97dc94122ebd63e6200858eabc4b119178/hudi-common/src/main/java/org/apache/hudi/common/table/cdc/HoodieCDCExtractor.java#L234)
when READ_CDC_FROM_CHANGELOG =True.


This is confusing as it seems you would do one or the other but not both.
Why go through all the work of inferring deltas from base and log files
when perhaps a couple of commits ahead there is a compact instant that has
it all precomputed?

Thanks
Jack

CDC and MOR tables

Reply via email to