[jira] [Updated] (HUDI-8076) RFC for backwards compatible writer mode in Hudi 1.0

Vinoth Chandar (Jira) Mon, 19 Aug 2024 16:32:06 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-8076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Vinoth Chandar updated HUDI-8076:
---------------------------------
    Description: 
*(!) Work in Progress.* 
h3. Basic Idea: 

Introduce support for Hudi writer code to produce storage format for the last 
2-3 table versions (0.14/6, 1.0/7,8). This enables older readers to continue 
reading the table even when writers are upgraded, as long as writer produces 
storage compatible with the latest table version the reader can read. The 
readers can then be rolling upgraded easily, at different cadence without need 
for any tight co-ordination. Additionally, the reader should have ability to 
"dynamically" deduce table version based on table properties, such that when 
the writer is switched to the latest table version, subsequent reads will just 
adapt and read it as the latest table version. 

Operators still need to ensure all readers have the latest binary that supports 
a given table version, before switching the writer to that version. Special 
consideration to table services, as reader/writer processes, that should be 
able manage the tables as well. Queries should gracefully fail during table 
version switches and start eventually succeeding when writer completes 
switching. Writers/table services should fail if working with an unsupported 
table version, without which one cannot start switching writers to new version 
(this may still need a minor release on the last 2-3 table versions?)
h3. High level approach: 

We need to introduce table version aware reading/writing inside the core layers 
of Hudi, as well as query engines like Spark/Flink. 

To this effect : We need a HoodieStorageFormat abstraction that can cover the 
following layers . 

Timeline : Timeline already has a timeline layout version, which can be 
extended to write older and newer timeline. The ArchivedTimeline can be old 
style or LSM style, depending on table version. 
{*}TBD{*}: whether or not completion time based changes can be retained, 
assuming instant file creation timestamp. 

 

WriteHandle : This layer may or may not need changes, as base files don't 
really change. and log format is already versioned (see next point). However, 
it's prudent to ensure we have a mechanism for this, since there could be 
different ways of encoding records or footers etc (e.g HFile k-v pairs, ...) 

 

Metadata table:  Encoding of k-v pairs, their schemas etc.. which partitions 
are supported in what table versions. *TBD* to see if any of the code around 
recent simplification needs to be undone. 

 

LogFormat Reader/Writer : the blogs and format itself is version, but its not 
yet tied to the table version overall. So we need these links so that the 
reader can for eg decide how to read/assemble log file scanning? 
FileGroupReader abstractions may need to change as well. 

  was:
*(!) Work in Progress.* 
h3. Basic Idea: 

Introduce support for Hudi writer code to produce storage format for the last 
2-3 table versions (0.14/6, 1.0/7,8). This enables older readers to continue 
reading the table even when writers are upgraded, as long as writer produces 
storage compatible with the latest table version the reader can read. The 
readers can then be rolling upgraded easily, at different cadence without need 
for any tight co-ordination. Additionally, the reader should have ability to 
"dynamically" deduce table version based on table properties, such that when 
the writer is switched to the latest table version, subsequent reads will just 
adapt and read it as the latest table version. 

Operators still need to ensure all readers have the latest binary that supports 
a given table version, before switching the writer to that version. Special 
consideration to table services, as reader/writer processes, that should be 
able manage the tables as well. Queries should gracefully fail during table 
version switches and start eventually succeeding when writer completes 
switching. Writers/table services should fail if working with an unsupported 
table version, without which one cannot start switching writers to new version 
(this may still need a minor release on the last 2-3 table versions?)
h3. High level approach: 

We need to introduce table version aware reading/writing inside the core layers 
of Hudi, as well as query engines like Spark/Flink. 

To this effect : We need a HoodieStorageFormat abstraction that can cover the 
following layers . 
 # Timeline : Timeline already has a timeline layout version, which can be 
extended to write older and newer timeline. The ArchivedTimeline can be old 
style or LSM style, depending on table version. {*}TBD{*}: whether or not 
completion time based changes can be retained, assuming instant file creation 
timestamp. 
 # WriteHandle : This layer may or may not need changes, as base files don't 
really change. and log format is already versioned (see next point). However, 
it's prudent to ensure we have a mechanism for this, since there could be 
different ways of encoding records or footers etc (e.g HFile k-v pairs, ...) 
 # Metadata table:  Encoding of k-v pairs, their schemas etc.. which partitions 
are supported in what table versions. *TBD* to see if any of the code around 
recent simplification needs to be undone. 
 # LogFormat Reader/Writer : the blogs and format itself is version, but its 
not yet tied to the table version overall. So we need these links so that the 
reader can for eg decide how to read/assemble log file scanning? 
FileGroupReader abstractions may need to change as well. 


> RFC for backwards compatible writer mode in Hudi 1.0
> ----------------------------------------------------
>
>                 Key: HUDI-8076
>                 URL: https://issues.apache.org/jira/browse/HUDI-8076
>             Project: Apache Hudi
>          Issue Type: New Feature
>            Reporter: Ethan Guo
>            Assignee: Vinoth Chandar
>            Priority: Major
>             Fix For: 1.0.0
>
>
> *(!) Work in Progress.* 
> h3. Basic Idea: 
> Introduce support for Hudi writer code to produce storage format for the last 
> 2-3 table versions (0.14/6, 1.0/7,8). This enables older readers to continue 
> reading the table even when writers are upgraded, as long as writer produces 
> storage compatible with the latest table version the reader can read. The 
> readers can then be rolling upgraded easily, at different cadence without 
> need for any tight co-ordination. Additionally, the reader should have 
> ability to "dynamically" deduce table version based on table properties, such 
> that when the writer is switched to the latest table version, subsequent 
> reads will just adapt and read it as the latest table version. 
> Operators still need to ensure all readers have the latest binary that 
> supports a given table version, before switching the writer to that version. 
> Special consideration to table services, as reader/writer processes, that 
> should be able manage the tables as well. Queries should gracefully fail 
> during table version switches and start eventually succeeding when writer 
> completes switching. Writers/table services should fail if working with an 
> unsupported table version, without which one cannot start switching writers 
> to new version (this may still need a minor release on the last 2-3 table 
> versions?)
> h3. High level approach: 
> We need to introduce table version aware reading/writing inside the core 
> layers of Hudi, as well as query engines like Spark/Flink. 
> To this effect : We need a HoodieStorageFormat abstraction that can cover the 
> following layers . 
> Timeline : Timeline already has a timeline layout version, which can be 
> extended to write older and newer timeline. The ArchivedTimeline can be old 
> style or LSM style, depending on table version. 
> {*}TBD{*}: whether or not completion time based changes can be retained, 
> assuming instant file creation timestamp. 
>  
> WriteHandle : This layer may or may not need changes, as base files don't 
> really change. and log format is already versioned (see next point). However, 
> it's prudent to ensure we have a mechanism for this, since there could be 
> different ways of encoding records or footers etc (e.g HFile k-v pairs, ...) 
>  
> Metadata table:  Encoding of k-v pairs, their schemas etc.. which partitions 
> are supported in what table versions. *TBD* to see if any of the code around 
> recent simplification needs to be undone. 
>  
> LogFormat Reader/Writer : the blogs and format itself is version, but its not 
> yet tied to the table version overall. So we need these links so that the 
> reader can for eg decide how to read/assemble log file scanning? 
> FileGroupReader abstractions may need to change as well. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-8076) RFC for backwards compatible writer mode in Hudi 1.0

Reply via email to