nsivabalan commented on code in PR #11514:
URL: https://github.com/apache/hudi/pull/11514#discussion_r1676770629


##########
rfc/rfc-78/rfc-78.md:
##########
@@ -0,0 +1,339 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-76: [Bridge release for 1.x]
+
+## Proposers
+
+- @nsivabalan
+- @vbalaji
+
+## Approvers
+ - @yihua
+ - @codope
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-7882
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+[Hudi 
1.x](https://github.com/apache/hudi/blob/ae1ee05ab8c2bd732e57bee11c8748926b05ec4b/rfc/rfc-69/rfc-69.md)
 is a powerful 
+re-imagination of the transactional database layer in Hudi to power continued 
innovation across the community in the coming 
+years. It introduces lot of differentiating features for Apache Hudi. Feel 
free to checkout the 
+[release page](https://hudi.apache.org/releases/release-1.0.0-beta1) for more 
info. We had beta1 and beta2 releases which was meant for 
+interested developers/users to give a spin on some of the  advanced features. 
But as we are working towards 1.0 GA, we are proposing 
+a bridge release (0.16.0) for smoother migration for existing hudi users. 
+
+## Objectives 
+Goal is to have a smooth migration experience for the users from 0.x to 1.0. 
We plan to have a 0.16.0 bridge release asking everyone to first migrate to 
0.16.0 before they can upgrade to 1.x. 
+
+A typical organization might have a medallion architecture deployed to run 
1000s of Hudi pipelines i.e. bronze, silver and gold layer. 
+For this layout of pipelines, here is how a typical migration might look 
like(w/o a bridge release)
+
+a. Existing pipelines are in 0.15.x. (bronze, silver, gold) 
+b. Migrate gold pipelines to 1.x. 
+- We need to strictly migrate only gold to 1x. Bcoz, a 0.15.0 reader may not 
be able to read 1.x hudi tables. So, if we migrate any of silver pipelines to 
1.x before migrating entire gold layer, we might end up in a situation, 
+where a 0.15.0 reader (gold) might end up reading 1.x table (silver). This 
might lead to failures. So, we have to follow certain order in which we migrate 
pipelines. 
+c. Once all of gold is migrated to 1.x, we can move all of silver to 1.x. 
+d. Once all of gold and silver pipelines are migrated to 1.x, finally we can 
move all of bronze to 1.x.
+
+In the end, we would have migrated all of existing hudi pipelines from 0.15.0 
to 1.x. 
+But as you could see, we need some coordination with which we need to migrate. 
And in a very large organization, sometimes we may not have good control over 
downstream consumers. 
+Hence, coordinating entire migration workflow and orchestrating the same might 
be challenging.
+
+Hence to ease the migration workflow for 1.x, we are introducing 0.16.0 as a 
bridge release.  
+
+Here are the objectives with this bridge release:
+
+- 1.x reader should be able to read 0.14.x to 0.16.x tables w/o any loss in 
functionality and no data inconsistencies.
+- 0.16.x should have read capability for 1.x tables w/ some limitations. For 
features ported over from 0.x, no loss in functionality should be guaranteed. 
+But for new features that was introduced in 1.x, we may not be able to support 
all of them. Will be calling out which new features may not work with 0.16.x 
reader. 
+- In this case, we explicitly request users to not turn on these features 
untill all readers are completely migrated to 1.x so as to not break any 
readers as applicable. 
+
+Connecting back to our example above, lets see how the migration might look 
like for an existing user. 
+
+a. Existing pipelines are in 0.15.x. (bronze, silver, gold)
+b. Migrate pipelines to 0.16.0 (in any order. we do not have any constraints 
around which pipeline should be migrated first). 
+c. Ensure all pipelines are in 0.16.0 (both readers and writers)
+d. Start migrating pipelines in a rolling fashion to 1.x. At this juncture, we 
could have few pipelines in 1.x and few pipelines in 0.16.0. but since 0.16.x 
+can read 1.x tables, we should be ok here. Just that do not enable new 
features like Non blocking concurrency control yet. 
+e. Migrate all of 0.16.0 to 1.x version. 
+f. Once all readers and writers are in 1.x, we are good to enable any new 
features (like NBCC) with 1.x tables.
+
+As you could see, company/org wide coordination to migrate gold before 
migrating silver or bronze is relaxed with the bridge release. Only requirement 
to keep a tab on, 
+is to ensure to migrate all pipelines completely to 0.16.x before starting to 
migrate to 1.x.
+
+So, here are the objectives of this RFC with the bridge release. 
+- 1.x reader should be able to read 0.14.x to 0.16.x tables w/o any loss in 
functionality and no data inconsistencies.
+- 0.16.x should have read capability for 1.x tables w/ some limitations. For 
features ported over from 0.x, no loss in functionality should be guaranteed.
+  But for new features that are being introduced in 1.x, we may not be able to 
support all of them. Will be calling out which new features may not work with 
0.16.x reader.
+- Document steps for rolling upgrade from 0.16.x to 1.x with minimal downtime.
+- Downgrade from 1.x to 0.16.x documented with call outs on any functionality 
loss. 
+
+### Considerations when choosing Migration strategy
+- While migration is happening, we want to allow readers to continue reading 
data. This means, we cannot employ a stop-the-world strategy when we are 
migrating. 
+All the actions that we are performing as part of table upgrade should not 
have any side-effects of breaking snapshot isolation for readers.
+- Also, users should have migrated to 0.16.x before upgrading to 1.x. We do 
not want to add read support for very old versions of hudi in 1.x(for eg 
0.7.0). 
+- So, in an effort to bring everyone to latest hudi versions, 1.x reader will 
have full read capabilities for 0.16.x, but for older hudi versions, 1.x reader 
may not have full reader support. 
+The reocmmended guideline is to upgrade all readers and writers to 0.16.x. and 
then slowly start upgrading to 1.x(readers followed by writers). 
+
+Before we dive in further, lets understand the format changes:
+
+## Format changes
+### Table properties
+- Payload class ➝ payload type.
+- hoodie.record.merge.mode is introduced in 1.x. 
+- New metadata partitions could be added (optionally enabled)
+
+### MDT changes
+- New MDT partitions are available in 1.x. MDT schema upgraded.
+- RLI schema is upgraded to hold row positions.
+
+### Timeline:
+- [storage changes] Completed write commits have completed times in the file 
name(timeline commit files).
+- [storage changes] Completed and inflight write commits are in avro format 
which were json in 0.x.
+- We are switching the action type for pending clustering from “replace 
commit” to “cluster”.
+- [storage changes] Timeline ➝ LST timeline. There is no archived timeline in 
1.x.
+- [In-memory changes] HoodieInstant changes due to presence of completion time 
for completed HoodieInstants.
+
+### Filegroup/FileSlice changes:
+- Log file names contain delta commit time instead of base instant time.
+- Log appends are disabled in 1.x. In other words, each log block is already 
appended to a new log file.
+- File Slice determination logic for log files changed. In 0.x, we have base 
instant time in log files and its straight forward. 
+In 1.x, we find completion time for a log file and find the base instant time 
(parsed from base files) for a given HoodieFileGroup, 
+- which has the highest value lesser than the completion time of the log file) 
of interest. 
+- Log file ordering within a file slice. (in 0.x, we use base instant time ➝ 
log file versions ➝ write token) to order diff log files. in 1.x, we will be 
using completion time to order).
+- Rollbacks in 0.x appends a new rollback block (new log file). While in 1.x, 
rollback will remove the partially failed log files. 
+
+### Log format changes:
+- We have added a new header type, IS_PARTIAL in 1.x.
+
+## Changes to be ported over to 0.16.x to support reading 1.x tables
+
+### What will be supported
+- For features introduced in 0.x, and tables written in 1.x, 0.16.0 reader 
should be able to provide consistent reads w/o any breakage. 
+
+### What will not be supported
+- A 0.16 writer cannot write to a table that has been upgraded-to/created 
using 1.x without downgrading to 0.16. Might be obvious, but calling it out 
nevertheless.
+- For new features introduced in 1.x, we may or may not have full support with 
0.16.x reader. 
+
+| 1.x Features written by 1.x writer        | 1.x reader   | 0.16.x Reader | 
+|----------------|-------------------------------------------|-------|
+| Deletion vector | supported | Falls back to key based merges giving up on 
perf optimization
+| Partial merges/updates  | supported | Fails with clear error message stating 
that partial merging |
+| Functional indexes | supported | Not supported. Perf optimization may not 
kick in|
+| Secondary indexes | supported | Not supported. Perf optimization may not 
kick in|
+| NBCC OR Completion time based log file ordering in a file slice | supported 
| Not supported. Will be using log file version and write token based 
ordering.|  
+
+
+### Timeline
+- Commit instants w/ completion time should be readable. HoodieInstant parsing 
logic to parse completion time should be ported over.
+- Commit metadata in avro instead of json should be readable.
+   - More details on this under Implementation section.
+- Pending Clustering commits using “cluster” action should be readable in 
0.16.0 reader.
+- HoodieDefaultTimeline should be able to support both 0.x timeline and 1.x 
timeline.
+   - More details on this under Implementation section.
+- Should we port LSM reader as well to 0.16.x? Our goal here is to support 
snapshot, time travel and incremental queries for 1.x tables. Strictly speaking 
we can only do all these 3 queries in uncleaned 
+instants i.e. over instant ranges where cleaner has not been executed. So, if 
we guarantee that 1.x active timeline will definitely contain all the uncleaned 
instants, we could get away by not even porting over 
+LSM timeline reader logic to 0.16.0. 
+
+#### Completion time based read in FileGroup/FileSlice grouping and log block 
reads
+- As of this write up, we are not supporting completion time based log file 
ordering in in 0.16. Punting it as called out earlier.
+- What's the impact if not for this support:
+    - For users having single writer mode or OCC in 1.x, 0.16.0 reader not 
supporting completion time based read(log file ordering) should still be ok. 
Both 1.x reader and 0.16.0 reader should have same behavior.
+    - But only if someone has NBCC writes and have log files in different 
ordering written compared to the log file versions, 0.16.0 reader might result 
in data consistency issues. but since we are calling out that 0.16.0 is a 
bridge release and recommend users to migrate all the readers to 1.x fully 
before starting to enable any new features for 1.x tables.
+    - Example scenarios. say, we have lf1_10_25, lf2_15_20(format 
"logfile[index]_[starttime]_[completiontime]") for a file slice. In 1.x reader, 
we will order and read it as lf2 followed by lf1. w/o this support in 0.16.0, 
we might read lf1 followed by lf2. Just to re-iterate this might only impact 
users who have enabled NBCC and having multi writers writing log files in 
different ordering. Even if they were using OCC, one of the writers is expected 
to have failed (on the writer side) since data is overlapping for two writers 
in 1.x writer.
+
+### FileSystemView:
+- Support ignoring partially failed log files from FSV. In 0.16.0, from FSV 
standpoint, all log files(including partially failed) are valid. We let the log 
record reader ignore the partially failed log files. But
+  in 1.x, log files could be rolledback (deleted) by a concurrent rollback. 
So, the FSV should ensure it ignores the uncommitted log files. 
+- We don't need the completion time logic ported over either for file slice 
determination nor for log file ordering.\
+    - So, here is how File Slice determination will happen using 0.16.0 
reader.\
+      a. Read base files and assign to resp file groups. \
+      b. Read log files. Parse instant time from log file name (it could refer 
to base instant time for a file written in 0.16.x, or it could refer to delta 
commit in case of 1.x writer). Find largest (<=) base instant times in the 
corresponding file group
+      and assign the log file to it. \
+      d. Log files within a file slice are ordered based on log version and 
write tokens.\
+      The same logic will be used whether we are reading a 0.16.x table or its 
a 1.x table. Only difference wrt how a 1.x reader will behave in comparison to 
0.16.x reader while reading a 1.x table is when NBCC is involved w/ 
multi-writers. But as of this writing, 
+  we are not supporting that in 0.16.x. 
+    - Lets see with an illustrative example.\
+      i. Table written in 0.16.x. Say total files are bf1_t10, lf1_t10, 
lf2_t10, bf2_t100, lf3_100, lf4_100.\
+      If we run through above algo, lf1 and lf2 will be assigned to file slice 
with base instant time t10,\
+  and lf3 and lf4 will be assigned to file slice with base instant time t100.\
+      ii. Table written in 1.x.   Say total files are bf1_t10, lf1_t15_{t50}, 
lf2_t60_{t90}, bf2_t100, lf3_t110_{t140}, lf4_t150_{t200}. Time within braces 
are completion times which is not really part of log file name, but showing it 
just for illustration.\
+      If we run through above algo, lf1 and lf2 will be assigned to file slice 
with base instant time t10 (since the max base instant time which is less than 
instant times of the log files (start time/delta commit time) is t10\
+      and lf3 and lf4 will be assigned to file slice with base instant time 
t100 using similar logic.\
+- FSV building/reading should account for commit instants in both 0.x and 1.x 
formats. Since completion time log file ordering is not supported in 0.16.0 
(that’s our current assumption), we may not need to support exactly how a FSV 
in 1.x reader supports. But below items should be supported.
+    - File slicing should be intact. For a single writer and OCC based writer 
in 1.x, the mapping of log files to base instant time or file slice should be 
same across 1.x reader and 0.16.0 reader.
+    - Log file ordering will follow 0.16.0 logic. I.e. log version followed by 
write token based comparator. Even if there are log files which completed in 
different order using 1.x writer, since we don’t plan to support that feature 
in 0.16.0 reader, we can try to maintain parity with 0.16.0 reader or log file 
ordering.
+    - We might have to revisit this if we plan to make NBCC default with MDT 
in 1.x.
+- If not for completion time based read support, what will break, check 
[here](#Completion time based read in FileGroup/FileSlice initialization and 
log block reads) 
+
+Bringing this altogether, lets see how different reader behaves in some of the 
scenarios.
+
+| Table state in 1.x                                                           
                                                                         | 1.x 
reader                                                                          
                                           | 0.16.x Reader                      
                                                                                
                      | 
+|-------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------|
+| base file + 2 log files written sequentially -> snapshot read                
                                                                         | 
reads 2 log files in order and merges w/ base file records                      
                                               | reads 2 log files in order and 
merges w/ base file records                                                     
                          |
+| base file + 2 log files written concurrently w/ OCC and so one of the writer 
aborted (rollback yet to kick in) -> snapshot read                       | 
reads just 1 log file and 1 base file, as uncommitted log files will be 
filtered out.                                          | reads just 1 log file 
and 1 base file, as uncommitted log files will be filtered out.                 
                                   |
+| base file + 2 log files written concurrently w/ NBCC and based on completion 
time log file 2 completed earlier compared to log file1 -> snapshot read | 
merges all log records. log file2 followed by log file 1 is the order.          
                                               | merges all log records. log 
file1 will be followed by log file 1. We anyway do not support this in 0.16.x 
reader                         |
+| FG1 having 3rd file slice in pending compaction and few log files added to 
latest file slice                                                          | 
latest file slice should be merged w/ log files from last but one file slice. 
log files are ordered based on completion times. | latest file slice should be 
merged w/ log files from last but one file slice. log files are ordered based 
on log version and write token |                                                
                |
+| Time travel Query                                                            
                                                                         | 
works as expected                                                               
                                               | Assuming commit time is in 
active timeline, we are good. Else, LSM timeline read support needs to be 
ported over.                        |
+| Incremental Query                                                            
                                                                         | 
works as expected                                                               
                                               | Assuming commit time is in 
active timeline, we are good. Else, LSM timeline read support needs to be 
ported over.                        |                                           
                                            |
+
+
+### Table properties
+- payload type inference from payload class need to be ported.
+- hoodie.record.merge.mode needs to be ported over. 
+
+### MDT changes:
+- MDT schema changes need to be ported.
+
+### Log reader/format changes:
+- New log header type need to be ported.
+- For unsupported features, meaningful errors should be thrown. For eg, 
partial update has been enabled and if those log files are read using 0.16.0 
reader, we should fail and throw a meaningful error.
+- For deletion vector, we should add the fallback option to 0.16.0 so that we 
give up on perf optimization, but still support reads w/o any failures. 
+
+### Read optimized query:
+- No explicit changes required (apart from timeline, and FSV) to support read 
optimized query.  
+
+### Incremental reads and time travel query: 
+- Incremental reads in 1.x is expected to have some changes and design is not 
fully out. So, until then we have to wait to design 0.16.x read support for 1.x 
tables for incremental queries. 
+- Time travel query: I don't see any changes in 1.x wrt time travel query. So, 
as per master, we are still using commit time(i.e not completed time) to serve 
time travel query. Until we make any changes to that, we do not need to add any 
additional support to 0.16.x reader. But if we plan to make changes in 1.x, we 
might have to revisit this. 
+
+### CDC reads:
+- There has not been a lot of changes in 1.x wrt CDC. Only minor change was 
done in HoodieAppendHandle. but that affects only the writer side logic. So, we 
are good wrt CDC. i.e we do not need to port 
+any additional logic to 0.16.x reader just for the purpose of CDC in addition 
to changes covered in this RFC. 
+
+## 0.16.0 ➝ 1.0 upgrade steps
+This will be an automatic upgrade for users when they start using 1.x hudi 
library. Listing the changes we might need to do with the upgrade.
+- Rewrite archival timeline to LSM based timeline.
+- Do not touch active timeline, since there could be concurrent readers 
reading the table. So, 1.x reader should be capable of handling timeline w/ a 
mix of 0.x commit files and 1.x commit files.
+- But we need to trigger rollback of any failed writes using 0.16.x rollback 
logic. Since with 1.x, rollback will delete the log files based on delta commit 
times in log file naming, this rollback logic may not work for 
+log files written in 0.16.0, since we have log appends. So, we need to trigger 
rollbacks of failed writes using 0.16.x rollback flow and not 1.x flow. 
+- No changes to log reader.
+- Check custom payload class and hoodie.record.merge.mode in table properties 
and switch to respective 0.16.x properties.
+- Debatable: Trigger compaction for latest file slices. We do not want a 
single file slice having a mix of log files from 0.x and log files from 1.x. 
So, we will trigger a full compaction 
+of the table to ensure all latest file slices has just the base files. I need 
to dug deeper on this(need to have detailed discussion w/ initial authors who 
designed this). But if we ensure all partially failed writes are rollbacked 
completely, 
+we should be good to upgrade w/o needing to trigger full compaction. 
+
+Lets walk through an example:
+Say we have a file group and latest file slice is fs3. and it has 2 log files 
in 0.x. 
+  fs3(t3):
+     lf1_(3_10) lf2_(3_10)
+here format is lf[log version]_([committime]_[completiontime])
+base file instant time is 3 and so all log files (in memory) has same begin 
and completion time.(Remember completion time is only applicable for commits 
and not for data files). and so, from 1.x reader standpoint,
+lf1 and lf2 has delta commit time as t3 (which matches the base file's instant 
time) and its completion time is t10. 
+
+lets trigger an upgrade and add 2 new files. 
+
+fs3(t3):
+lf1_(3_10) lf2_(3_10) [upgrade] lf3_15_20, lf4_40_100.
+
+With this layout, 1.x reader should be able to read this file slice correctly 
w/o any issues. So, we should not require a full compaction during upgrade. But 
need to consult w/ authors to ensure we are not missing anything.
+
+## 1.0 ➝ 0.16.0 downgrade
+Any new features that was introduced in 1.x may not work after downgrade.
+Here are the ones that might see behavior change: 
+- If deletion vector is enabled, after downgrade we might fallback to key 
based merges, giving up on the performance optimization.
+- If partial updates are enabled, after downgrade we might fallback to key 
based merges, giving up on the performance optimization.
+- Any functional indexes built in 1.x might be deleted during downgrade. 
+- Any secondary indexes built in 1.x might be deleted during downgrade. 
+- Completion time based log file ordering within a file slice may not be 
honored.
+
+### Users will have to use hudi-cli to do the downgrade. Here are the steps we 
might need to do during downgrade
+- Similar to our upgrade, lets trigger full compaction to latest file slice so 
that all latest file slices only has base files. We do not want to have any 
file slices w/ a mix of 1.x log files and 0.x log files. 
+We could afford to take some downtime during downgrade, but upgrade path 
should be kept to as minimum as possible. So, during upgrade, we are not 
enforcing this.   
+- Rewrite LSM based timeline to archived timeline. we have to deduce writer 
properties and introduce boundaries based on that (i.e. until which commit to 
archive).
+- We have two options wrt handling active timeline instants(for eg, pending 
clustering instants):
+   A: No changes to active instants. In order to support 1.x tables w/ 0.16.0 
reader, already these changes might have been ported to 0.16.0 reader. But we 
might have to ensure downgrade from 0.16.0 to any older version of hudi does 
take care of rewriting the active timeline instants.
+   B: While downgrading from 1.x to 0.16.0, let’s rewrite the active timeline 
instants too. So that after downgrade, we can't differentiate whether table was 
natively created in 0.16.0 or was it upgraded to 1.x and then later downgraded. 
The con in Option A is not an issue here since we take care of active timeline 
instants during 1.x to 0.16.0 downgrade only. So, downgrade from 0.16.0 to any 
lower hudi version does not need to do any special handling if table was native 
to 0.16.0 or was it downgraded from 1.x to 0.16.0
+   Our proposal is to go with Option B to ensure 0.16.0 table will be intact 
and not have any residues from 1.x writes. While the rewrite of active instants 
are happening, we could see some read failures. But 
+   we do not want to have a 0.16.0 (post upgrade) table to deal w/ 1.x 
timeline intricacies. 
+- If there are new MDT partitions (functional index, sec index), nuke them. 
Update table properties.
+- Check latest snapshot files for log files w/ new features enabled. For eg, 
deletion vector, partial update. If any log files are found in latest snapshot, 
we have to trigger compaction for those file groups. This calls for a custom 
compaction strategy as well which compacts only for certain file groups.
+   - In order to reduce downtime to downgrade, we are inclining to do this 
only for snapshot reads. If there are older file slices and users are issuing 
time travel or incremental reads, and if they have any of the new features 
enabled (deletion vector, partial update), readers could break after downgrade.
+
+## Implementation details
+
+To ensure read capability of 1.x tables in 0.16.0, documenting the changes 
required at class level. Other two sections (0.16.0 upgrade to 1.x or 1.x 
downgrade to 0.16.0) should be fairly straightforward and so not calling out 
the obvious.
+
+### Timeline changes:
+Let’s reiterate what we need to support w/ 0.16.0 reader.
+
+### Timeline read of 1.x need to be supported
+- Commit instants w/ completion time should be readable. HoodieInstant parsing 
logic to parse completion time should be ported over.
+- We could ignore the completion time log file ordering semantics since we 
don’t plan to support it yet in 0.16.0. But reading should not break.
+- Pending Clustering commits using “cluster” action should be readable in 
0.16.0 reader.

Review Comment:
   NTR: 
   As as optimization, we are planning to consider "table upgrade commit time" 
to help us here. because, just after upgrading to 1.x, we have to consider both 
pending replace commits and pending clustering commits (for clustering 
operations). But at some point eventually, we want to sunset checking replace 
commit timeline and only consider clustering timeline (1 month or 2 months or 
next release 1.0.1). But until then, this config "table upgrade commit time" 
will help us do deterministic processing. for eg, once active timeline's first 
entry is > "table upgrade commit time", its safe to consider only pending 
clustering timeline and ignore pending replace timeline. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to