nsivabalan commented on code in PR #11514: URL: https://github.com/apache/hudi/pull/11514#discussion_r1663014814
########## rfc/rfc-78/rfc-78.md: ########## @@ -191,15 +198,15 @@ We need to add back these older methods to HoodieDefaultTimeline, so that we do - e. We need to port code changes which accounts for uncommitted log files. In 0.16.0, from FSV standpoint, all log files(including partially failed) are valid. We let the log record reader ignore the partially failed log files. But in 1.x, log files could be rolledback (deleted) by a concurrent rollback. So, the FSV should ensure it ignores the uncommitted log files. - f. Looks like we only have to make changes/appends to few methods in HoodieDefaultTimeline. But one option to potentially consider (if we see us making lot of changes to 0.16.0 HoodieDefaultTimeline in order to support reading 1.x tables), we could introduce Hoodie016xDefaultTimeline and Hoodie1xDefaultTimeline and use delegate pattern to delegate to either of the timelines. Using hoodie table version we could instantiate (internally to HoodieDefaultTimeline) to either of Hoodie016xDefaultTimeline or Hoodie1xDefaultTimeline. But for now, we don’t feel we might need to take this route. Just calling it out as an option depending on the changes we had to make. +- g. Since log file ordering logic will differ from 0.16.x and 1.x, and we have a table upgrade commit time, we could leverage that to use diff log file ordering logic based on whether a file slice's base instant time is less or greater than table upgrade commit time. ### FileSystemView changes Once all timeline changes are incorporated, we need to account for FSV changes. Major change as called out earlier is the Completion time based log files from 1.x writer and the log file naming referring to delta commit time instead of base commit time. So, w/o any changes to FSV/HoodieFileGroup/HoodieFileSlice code snippets, our file slice deduction logic might be wrong. Each log file could be tagged as its own file slice since each has a different base commit time (thats how 0.16.x HoodieLogFile would deduce it). So, we might have to port over CompletionTimeQueryView class and associated logic to 0.16.0. So, for file slice deduction logic in 0.16.0 will be pretty much similar to 1.x reader. But the log file ordering for log reading purpose, we do not need to maintain parity with 1.x reader as of yet. (unless we make NBCC default with MDT). Assuming 1.x reader and 1.x FSV should be able to read data written in older hudi versions, we also have a potential option here for avoid making nit-picky changes similar to the option called out earlier. We could instantiate two different FSV depending on the table version. If table version is 7 (0.16.0), we could instantiate FSV_V0 may be and if table version is 8 (1.0.0), we could instantiate FSV_V1. So that we don’t break/regress any of 0.16.0 read functionality in the interest of supporting 1.x table reads. We should strive to cover all scenarios and not let any bugs creep in, but trying to see if we can keep the changes isolated so that battle tested code (FSV) is not touched or changed for the purpose of supporting 1.x table reads. If we run into any bugs with 1.x reads, we could ask users to not upgrade any of the writers to 1.x and stick with 0.16.0 unless we have say 1.0.1 or something. But it would be really bad if we break 0.16.0 table read in some edge case. Just calling out as one of the safe option to upgrade. - #### Pending exploration: -How partially failed log files are ignored in 1.x. I see all log files are accounted for while building FSV. +1. We removed special suffixes to MDT operations in 1x. we need to test the flow and flush out details if anything to be added to 0.16.x reader. Review Comment: understand the new commit time generation logic is foolproof. what incase there is a concurrent ingestion in data table co-incidentally generates the same commit time? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org