voonhous commented on PR #9922: URL: https://github.com/apache/hudi/pull/9922#issuecomment-1782187627
> Thanks for the fix, from high-level, I kind of think we should avoid to relies on the Spark mechanisms to add any rollback/cleaning improvement here, it's hacky to maintain and it is not tenable for all engines. Agree, however, if we want to address, we would need mechanisms for ignoring corrupted files that were created by zombie tasks. Which at this stage, is not trivial to implement. At the most vanilla deployment (no MDT) of Hudi, a "VALID" base file is basically a file with the largest timestamp (with filegroup that is not in any replacecommit). If we want to modify this from a high-level, we will need to modify the heuristics in determining what is a "VALID" basefile. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org