[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

Vinoth Chandar (Jira) Thu, 17 Aug 2023 07:00:06 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Vinoth Chandar updated HUDI-6242:
---------------------------------
    Description: 
This EPIC tracks changes to the Hudi storage format.
Format change is anything that changes any bits related to
 - *Timeline* : active or archived timeline contents, file names.
 - {*}Base Files{*}: file format versions, any changes to any data types, file 
footers, file names.
 - {*}Log Files{*}: Block structure, content, names.
 - {*}Metadata Table{*}: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row mappings.
 - {*}Table properties{*}: What's written to hoodie.properties.
 - *Marker files* : Can be left to the writer implementation.

h2. Change summary:

 

 

 

 

 

 

 

The following functionality should be supportable by the new format tech specs 
(at a minimum)

Flexibility :
 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...)
 - Easy integration of metadata for JVM and non-jvm clients

Metafields :
 - Should _recordkey be uuid special handling?

Additional Info:
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data.
 - Position based skipping of base file
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - ML/Column family use-case?
 - Support having changeset of columns in each write, other headers

Log :
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - CDC format is GA.

Table organization:
 - Support different logical partitions on the same data
 - Storage of table spread across buckets/root folders
 - Decouple table location from timeline, metadata. They can all be in 
different places

Concurrency/Timeline:
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts.
 - Support for long lived instants in timeline, break down distinction between 
active/archived
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions.
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - TrueTime store/support for instant times
 - No more separate rollback action. make it a new state.

Metadata table :
 - Encode filegroup ID and commit time along with file metadata

Table Properties:
 - Partitioning information/indexing info

  was:
This EPIC tracks changes to the Hudi storage format.
Format change is anything that changes any bits related to
 - *Timeline* : active or archived timeline contents, file names.
 - {*}Base Files{*}: file format versions, any changes to any data types, file 
footers, file names.
 - {*}Log Files{*}: Block structure, content, names.
 - {*}Metadata Table{*}: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row mappings.
 - {*}Table properties{*}: What's written to hoodie.properties.
 - *Marker files* : Can be left to the writer implementation.

h2. Change summary:

 

 

 

 

 

 

 

The following functionality should be supportable by the new format tech specs 
(at a minimum)

Flexibility :
 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...)
 - Easy integration of metadata for JVM and non-jvm clients

Metafields :
 - Should _recordkey be uuid special handling?

Additional Info:
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data.
 - Position based skipping of base file
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - ML/Column family use-case?
 - Support having changeset of columns in each write, other headers

Log :
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - CDC format is GA.

Table organization:
 - Support different logical partitions on the same data
 - Storage of table spread across buckets/root folders
 - Decouple table location from timeline, metadata. They can all be in 
different places

Concurrency/Timeline:
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts.
 - Support for long lived instants in timeline, break down distinction between 
active/archived
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions.
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - TrueTime store/support for instant times

Metadata table :
 - Encode filegroup ID and commit time along with file metadata

Table Properties:
 - Partitioning information/indexing info


> Format changes for Hudi 1.X release line
> ----------------------------------------
>
>                 Key: HUDI-6242
>                 URL: https://issues.apache.org/jira/browse/HUDI-6242
>             Project: Apache Hudi
>          Issue Type: Epic
>          Components: core
>            Reporter: Vinoth Chandar
>            Assignee: Vinoth Chandar
>            Priority: Major
>              Labels: hudi-umbrellas
>             Fix For: 1.0.0
>
>
> This EPIC tracks changes to the Hudi storage format.
> Format change is anything that changes any bits related to
>  - *Timeline* : active or archived timeline contents, file names.
>  - {*}Base Files{*}: file format versions, any changes to any data types, 
> file footers, file names.
>  - {*}Log Files{*}: Block structure, content, names.
>  - {*}Metadata Table{*}: (should we call this index table instead?) partition 
> names, number of file groups, key/value schema and metadata to MDT row 
> mappings.
>  - {*}Table properties{*}: What's written to hoodie.properties.
>  - *Marker files* : Can be left to the writer implementation.
> h2. Change summary:
>  
>  
>  
>  
>  
>  
>  
> The following functionality should be supportable by the new format tech 
> specs (at a minimum)
> Flexibility :
>  - Ability to mix different types of base files within a single table or even 
> a single file group (e.g images, json, vectors ...)
>  - Easy integration of metadata for JVM and non-jvm clients
> Metafields :
>  - Should _recordkey be uuid special handling?
> Additional Info:
>  - Support encoding of watermarks/event time fields as first class citizen, 
> for handling late arriving data.
>  - Position based skipping of base file
>  - Additional metadata to avoid more RPCs to scan base file/log blocks.
>  - ML/Column family use-case?
>  - Support having changeset of columns in each write, other headers
> Log :
>  - Support writing updates as deletes and inserts, instead of logging as 
> update to base file.
>  - CDC format is GA.
> Table organization:
>  - Support different logical partitions on the same data
>  - Storage of table spread across buckets/root folders
>  - Decouple table location from timeline, metadata. They can all be in 
> different places
> Concurrency/Timeline:
>  - Ability to support general purpose multi-table transactions, esp between 
> data and metadata tables.
>  - Support lockless/non-blocking transactions, where writers don't block each 
> other even in face of conflicts.
>  - Support for long lived instants in timeline, break down distinction 
> between active/archived
>  - Support checking of uniqueness constraints, even in face of two concurrent 
> insert transactions.
>  - Support precise time-travel queries
>  - Support time-travel writes.
>  - Support schema history tracking and aid in schema evol impl.
>  - TrueTime store/support for instant times
>  - No more separate rollback action. make it a new state.
> Metadata table :
>  - Encode filegroup ID and commit time along with file metadata
> Table Properties:
>  - Partitioning information/indexing info



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

Reply via email to