vinothchandar commented on issue #8222:
URL: https://github.com/apache/hudi/issues/8222#issuecomment-1496165517

   @parisni To clarify the semantics a bit. Incremental query provides all the 
records that changed between a start and end commit time range. If there are 
multiple writes (CoW) or multiple compactions (MoR) between queries, you would 
only see the latest record (per pre combine logic) up to the compacted point, 
then log records after that. This is similar to the Kafka compacted topic 
[design](https://kafka.apache.org/documentation/#compaction), to bound the 
"catch up" time for downstream jobs. If one wants every change record i.e, 
multiple rows in incremental query output per key for each change, that's what 
the CDC feature solves, right now it's supported for CoW).
   
   As for this problem, the issue is the reads are served out of the logs based 
on the commit time range and it's fine as long as we are just returning the 
latest committed records. In this case, there is a pre-combine field to respect 
and that's not handled yet. The solution would be to perform a base + log merge 
first (which will consider the precombine fields), then filter for the commit 
range (increases the cost of the query, but will give you same semantics). 
   
   How much of a blocker is this for your project? This will help us prioritize 
this. 
   
   
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to