[ 
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6242:
---------------------------------
    Description: 
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

 - *Timeline* : active or archived timeline contents, file names.
 - *Base Files*: file format versions, any changes to any data types, file 
footers, file names.
 - *Log Files*:  Block structure, content, names. 
 - *Metadata Table*: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row 
mappings. 
- *Table properties*: What's written to hoodie.properties.
- *Marker files* : how would we treat these?


The following functionality should be supportable by the new format tech specs 
(at a minimum) 

Flexibility : 
 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...) 
 - Easy integration of metadata for JVM and non-jvm clients

Metafields :
- Should _recordkey be uuid special handling?


Additional Info:
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data. 
 - Position based skipping of base file
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - ML/Column family use-case?
 - Support having changeset of columns in each write, other headers


Log : 
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - CDC format is GA.

Table organization:
 - Support different logical partitions on the same data
 - Storage of table spread across buckets/root folders
 - Decouple table location from timeline, metadata. They can all be in 
different places


Concurrency/Timeline: 
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts. 
 - Support for long lived instants in timeline, break down distinction between 
active/archived
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions. 
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - TrueTime store/support for instant times

Metadata table :
 - Encode filegroup ID and commit time along with file metadata


Table Properties: 
 - Partitioning information/indexing info


  was:
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

 - *Timeline* : active or archived timeline contents, file names.
 - *Base Files*: file format versions, any changes to any data types, file 
footers, file names.
 - *Log Files*:  Block structure, content, names. 
 - *Metadata Table*: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row 
mappings. 
- *Table properties*: What's written to hoodie.properties.
- *Marker files* : how would we treat these?


The following functionality should be supportable by the new format tech specs 
(at a minimum) 

Flexibility : 
 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...) 
 - Easy integration of metadata for JVM and non-jvm clients


Additional Info:
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data. 
 - Position based skipping of base file
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - ML/Column family use-case?
 - Support having changeset of columns in each write, other headers


Log : 
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - CDC format is GA.

Table organization:
 - Support different logical partitions on the same data
 - Storage of table spread across buckets/root folders
 - Decouple table location from timeline, metadata. They can all be in 
different places


Concurrency/Timeline: 
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts. 
 - Support for long lived instants in timeline, break down distinction between 
active/archived
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions. 
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - TrueTime store/support for instant times

Metadata table :
 - Encode filegroup ID and commit time along with file metadata


Table Properties: 
 - Partitioning information/indexing info



> Format changes for Hudi 1.X release line
> ----------------------------------------
>
>                 Key: HUDI-6242
>                 URL: https://issues.apache.org/jira/browse/HUDI-6242
>             Project: Apache Hudi
>          Issue Type: Epic
>          Components: core
>            Reporter: Vinoth Chandar
>            Assignee: Vinoth Chandar
>            Priority: Major
>              Labels: hudi-umbrellas
>             Fix For: 1.0.0
>
>
> This EPIC tracks changes to the Hudi storage format.
> Format change is anything that changes any bits related to
>  - *Timeline* : active or archived timeline contents, file names.
>  - *Base Files*: file format versions, any changes to any data types, file 
> footers, file names.
>  - *Log Files*:  Block structure, content, names. 
>  - *Metadata Table*: (should we call this index table instead?) partition 
> names, number of file groups, key/value schema and metadata to MDT row 
> mappings. 
> - *Table properties*: What's written to hoodie.properties.
> - *Marker files* : how would we treat these?
> The following functionality should be supportable by the new format tech 
> specs (at a minimum) 
> Flexibility : 
>  - Ability to mix different types of base files within a single table or even 
> a single file group (e.g images, json, vectors ...) 
>  - Easy integration of metadata for JVM and non-jvm clients
> Metafields :
> - Should _recordkey be uuid special handling?
> Additional Info:
>  - Support encoding of watermarks/event time fields as first class citizen, 
> for handling late arriving data. 
>  - Position based skipping of base file
>  - Additional metadata to avoid more RPCs to scan base file/log blocks.
>  - ML/Column family use-case?
>  - Support having changeset of columns in each write, other headers
> Log : 
>  - Support writing updates as deletes and inserts, instead of logging as 
> update to base file.
>  - CDC format is GA.
> Table organization:
>  - Support different logical partitions on the same data
>  - Storage of table spread across buckets/root folders
>  - Decouple table location from timeline, metadata. They can all be in 
> different places
> Concurrency/Timeline: 
>  - Ability to support general purpose multi-table transactions, esp between 
> data and metadata tables.
>  - Support lockless/non-blocking transactions, where writers don't block each 
> other even in face of conflicts. 
>  - Support for long lived instants in timeline, break down distinction 
> between active/archived
>  - Support checking of uniqueness constraints, even in face of two concurrent 
> insert transactions. 
>  - Support precise time-travel queries
>  - Support time-travel writes.
>  - Support schema history tracking and aid in schema evol impl.
>  - TrueTime store/support for instant times
> Metadata table :
>  - Encode filegroup ID and commit time along with file metadata
> Table Properties: 
>  - Partitioning information/indexing info



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to