[ https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vinoth Chandar updated HUDI-6242: --------------------------------- Description: This EPIC tracks changes to the Hudi storage format. Format change is anything that changes any bits related to - *Timeline* : active or archived timeline contents, file names. - *Base Files*: file format versions, any changes to any data types, file footers, file names. - *Log Files*: Block structure, content, names. - *Metadata Table*: (should we call this index table instead?) partition names, number of file groups, key/value schema and metadata to MDT row mappings. - *Table properties*: What's written to hoodie.properties. The following functionality should be met. - Ability to mix different types of base files within a single table or even a single file group (e.g images, json, vectors ...) - Ability to support general purpose multi-table transactions, esp between data and metadata tables. - Support lockless/non-blocking transactions, where writers don't block each other even in face of conflicts. - Support encoding of watermarks/event time fields as first class citizen, for handling late arriving data. - Support writing updates as deletes and inserts, instead of logging as update to base file. - Support checking of uniqueness constraints, even in face of two concurrent insert transactions. - Support precise time-travel queries - Support time-travel writes. - Support schema history tracking and aid in schema evol impl. - Easy integration of metadata for JVM and non-jvm clients - TrueTime store/support for instant times - CDC format is GA. - Support different logical partitions on the same data - Position based skipping of base file - Storage of table spread across buckets/root folders - Additional metadata to avoid more RPCs to scan base file/log blocks. - Decouple table location from timeline, metadata. They can all be in different places - was: This EPIC tracks changes to the Hudi storage format. Format change is anything that changes any bits related to - *Timeline* : active or archived timeline contents, file names. - *Base Files*: file format versions, any changes to any data types, file footers, file names. - *Log Files*: Block structure, content, names. - *Metadata Table*: (should we call this index table instead?) partition names, number of file groups, key/value schema and metadata to MDT row mappings. - *Table properties*: What's written to hoodie.properties. The following functionality should be met. - Ability to mix different types of base files within a single table or even a single file group (e.g images, json, vectors ...) - Ability to support general purpose multi-table transactions, esp between data and metadata tables. - Support lockless/non-blocking transactions, where writers don't block each other even in face of conflicts. - Support encoding of watermarks/event time fields as first class citizen, for handling late arriving data. - Pluggable implementations for write handles, timeline, metadata table. > Format changes for Hudi 1.X release line > ---------------------------------------- > > Key: HUDI-6242 > URL: https://issues.apache.org/jira/browse/HUDI-6242 > Project: Apache Hudi > Issue Type: Epic > Components: core > Reporter: Vinoth Chandar > Assignee: Vinoth Chandar > Priority: Major > Labels: hudi-umbrellas > Fix For: 1.0.0 > > > This EPIC tracks changes to the Hudi storage format. > Format change is anything that changes any bits related to > - *Timeline* : active or archived timeline contents, file names. > - *Base Files*: file format versions, any changes to any data types, file > footers, file names. > - *Log Files*: Block structure, content, names. > - *Metadata Table*: (should we call this index table instead?) partition > names, number of file groups, key/value schema and metadata to MDT row > mappings. > - *Table properties*: What's written to hoodie.properties. > The following functionality should be met. > - Ability to mix different types of base files within a single table or even > a single file group (e.g images, json, vectors ...) > - Ability to support general purpose multi-table transactions, esp between > data and metadata tables. > - Support lockless/non-blocking transactions, where writers don't block each > other even in face of conflicts. > - Support encoding of watermarks/event time fields as first class citizen, > for handling late arriving data. > - Support writing updates as deletes and inserts, instead of logging as > update to base file. > - Support checking of uniqueness constraints, even in face of two concurrent > insert transactions. > - Support precise time-travel queries > - Support time-travel writes. > - Support schema history tracking and aid in schema evol impl. > - Easy integration of metadata for JVM and non-jvm clients > - TrueTime store/support for instant times > - CDC format is GA. > - Support different logical partitions on the same data > - Position based skipping of base file > - Storage of table spread across buckets/root folders > - Additional metadata to avoid more RPCs to scan base file/log blocks. > - Decouple table location from timeline, metadata. They can all be in > different places > - -- This message was sent by Atlassian Jira (v8.20.10#820010)