Re: [PR] [HUDI-7882][WIP] Adding RFC 78 for bridge release to assist users to migrate to 1.x from 0.x [hudi]

via GitHub Tue, 02 Jul 2024 11:52:48 -0700


nsivabalan commented on code in PR #11514:
URL: https://github.com/apache/hudi/pull/11514#discussion_r1663014814



##########
rfc/rfc-78/rfc-78.md:
##########
@@ -191,15 +198,15 @@ We need to add back these older methods to 
HoodieDefaultTimeline, so that we do
 - e. We need to port code changes which accounts for uncommitted log files. In 
0.16.0, from FSV standpoint, all log files(including partially failed) are 
valid. We let the log record reader ignore the partially failed log files. But
   in 1.x, log files could be rolledback (deleted) by a concurrent rollback. 
So, the FSV should ensure it ignores the uncommitted log files.
 - f. Looks like we only have to make changes/appends to few methods in 
HoodieDefaultTimeline. But one option to potentially consider (if we see us 
making lot of changes to 0.16.0 HoodieDefaultTimeline in order to support 
reading 1.x tables), we could introduce Hoodie016xDefaultTimeline and 
Hoodie1xDefaultTimeline and use delegate pattern to delegate to either of the 
timelines. Using hoodie table version we could instantiate (internally to 
HoodieDefaultTimeline) to either of Hoodie016xDefaultTimeline or 
Hoodie1xDefaultTimeline. But for now, we don’t feel we might need to take this 
route. Just calling it out as an option depending on the changes we had to make.
+- g. Since log file ordering logic will differ from 0.16.x and 1.x, and we 
have a table upgrade commit time, we could leverage that to use diff log file 
ordering logic based on whether a file slice's base instant time is less or 
greater than table upgrade commit time. 
 
 ### FileSystemView changes
 Once all timeline changes are incorporated, we need to account for FSV 
changes. Major change as called out earlier is the Completion time based log 
files from 1.x writer and the log file naming referring to delta commit time 
instead of base commit time. So, w/o any changes to 
FSV/HoodieFileGroup/HoodieFileSlice code snippets, our file slice deduction 
logic might be wrong. Each log file could be tagged as its own file slice since 
each has a different base commit time (thats how 0.16.x HoodieLogFile would 
deduce it). So, we might have to port over CompletionTimeQueryView class and 
associated logic to 0.16.0. So, for file slice deduction logic in 0.16.0 will 
be pretty much similar to 1.x reader. But the log file ordering for log reading 
purpose, we do not need to maintain parity with 1.x reader as of yet. (unless 
we make NBCC default with MDT).
 Assuming 1.x reader and 1.x FSV should be able to read data written in older 
hudi versions, we also have a potential option here for avoid making nit-picky 
changes similar to the option called out earlier.
 We could instantiate two different FSV depending on the table version. If 
table version is 7 (0.16.0), we could instantiate FSV_V0 may be and if table 
version is 8 (1.0.0), we could instantiate FSV_V1. So that we don’t 
break/regress any of 0.16.0 read functionality in the interest of supporting 
1.x table reads. We should strive to cover all scenarios and not let any bugs 
creep in, but trying to see if we can keep the changes isolated so that battle 
tested code (FSV) is not touched or changed for the purpose of supporting 1.x 
table reads. If we run into any bugs with 1.x reads, we could ask users to not 
upgrade any of the writers to 1.x and stick with 0.16.0 unless we have say 
1.0.1 or something. But it would be really bad if we break 0.16.0 table read in 
some edge case.  Just calling out as one of the safe option to upgrade.
 
-
 #### Pending exploration:
-How partially failed log files are ignored in 1.x. I see all log files are 
accounted for while building FSV.
+1. We removed special suffixes to MDT operations in 1x. we need to test the 
flow and flush out details if anything to be added to 0.16.x reader. 

Review Comment:
   understand the new commit time generation logic is foolproof. what incase 
there is a concurrent ingestion in data table co-incidentally generates the 
same commit time? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7882][WIP] Adding RFC 78 for bridge release to assist users to migrate to 1.x from 0.x [hudi]

Reply via email to