[GitHub] [hudi] ssdong edited a comment on issue #2818: [SUPPORT] Exception thrown in incremental query(MOR) and potential change data loss after archiving

GitBox Mon, 19 Apr 2021 23:26:00 -0700


ssdong edited a comment on issue #2818:
URL: https://github.com/apache/hudi/issues/2818#issuecomment-822660600



   Hey @garyli1019 thank you for the meticulous explanation. Yep, I was trying 
to confirm the “expected” behavior of incremental query. It makes sense to pull 
from _existing_ active timeline, given a bulky active timeline would introduce 
a file listing issue and so we do archiving. Controlling the number of instants 
on the active timeline through `keep.max` is definitely one way to go. Adding 
extra configuration(default to be `false`) so we could tune it to pull from the 
archived timeline doesn’t sound bad to me, though we should carefully document 
it and educate our users on it. Ideally, the option to read from archived 
timeline applies to the case where there aren't that many archived commit files 
otherwise we'll face the same bulky timeline again, and an alternative in this 
case, depending on the number of generated file groups and slices, is to ask 
the user to read _all_ existing files and do select on it. However, if the 
underlying storage system allows log appending, we w
 ouldn't end up having many archive files as they will be merged to previously 
archived files. Unfortunately, mainstream storage like S3 does not allow 
appending. (CMIIW) 😅 
   Conclusion: I guess if we implement the configuration and well document it, 
it should be good. Let me know. Thanks! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] ssdong edited a comment on issue #2818: [SUPPORT] Exception thrown in incremental query(MOR) and potential change data loss after archiving

Reply via email to