ssdong edited a comment on issue #2818: URL: https://github.com/apache/hudi/issues/2818#issuecomment-822660600
Hey @garyli1019 thank you for the meticulous explanation. Yep, I was trying to confirm the “expected” behavior of incremental query. It makes sense to pull from _existing_ active timeline, given a bulky active timeline would introduce a file listing issue and so we do archiving. Controlling the number of instants on the active timeline through `keep.max` is definitely one way to go. Adding extra configuration(default to be `false`) so we could tune it to pull from the archived timeline doesn’t sound bad to me, though we should carefully document it and educate our users on it. Ideally, the option to read from archived timeline applies to the case where there aren't that many archived commit files otherwise we'll face the same bulky timeline again, and an alternative in this case, depending on the number of generated file groups and slices, is to ask the user to read _all_ existing files and do select on it. However, if the underlying storage system allows log appending, we w ouldn't end up having many archive files as they will be merged to previously archived files. Unfortunately, mainstream storage like S3 does not allow appending. (CMIIW) 😅 Conclusion: I guess if we implement the configuration and well document it, it should be good. Let me know. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org