vinothchandar commented on issue #8222: URL: https://github.com/apache/hudi/issues/8222#issuecomment-1496165517
@parisni To clarify the semantics a bit. Incremental query provides all the records that changed between a start and end commit time range. If there are multiple writes (CoW) or multiple compactions (MoR) between queries, you would only see the latest record (per pre combine logic) up to the compacted point, then log records after that. This is similar to the Kafka compacted topic [design](https://kafka.apache.org/documentation/#compaction), to bound the "catch up" time for downstream jobs. If one wants every change record i.e, multiple rows in incremental query output per key for each change, that's what the CDC feature solves, right now it's supported for CoW). As for this problem, the issue is the reads are served out of the logs based on the commit time range and it's fine as long as we are just returning the latest committed records. In this case, there is a pre-combine field to respect and that's not handled yet. The solution would be to perform a base + log merge first (which will consider the precombine fields), then filter for the commit range (increases the cost of the query, but will give you same semantics). How much of a blocker is this for your project? This will help us prioritize this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org