[ https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vinoth Chandar updated HUDI-6242: --------------------------------- Description: This EPIC tracks changes to the Hudi storage format. Format change is anything that changes any bits related to - *Timeline* : active or archived timeline contents, file names. - {*}Base Files{*}: file format versions, any changes to any data types, file footers, file names. - {*}Log Files{*}: Block structure, content, names. - {*}Metadata Table{*}: (should we call this index table instead?) partition names, number of file groups, key/value schema and metadata to MDT row mappings. - {*}Table properties{*}: What's written to hoodie.properties. - *Marker files* : Can be left to the writer implementation. h2. Change summary: The following functionality should be supportable by the new format tech specs (at a minimum) Flexibility : - Ability to mix different types of base files within a single table or even a single file group (e.g images, json, vectors ...) - Easy integration of metadata for JVM and non-jvm clients Metafields : - Should _recordkey be uuid special handling? Additional Info: - Support encoding of watermarks/event time fields as first class citizen, for handling late arriving data. - Position based skipping of base file - Additional metadata to avoid more RPCs to scan base file/log blocks. - ML/Column family use-case? - Support having changeset of columns in each write, other headers Log : - Support writing updates as deletes and inserts, instead of logging as update to base file. - CDC format is GA. Table organization: - Support different logical partitions on the same data - Storage of table spread across buckets/root folders - Decouple table location from timeline, metadata. They can all be in different places Concurrency/Timeline: - Ability to support general purpose multi-table transactions, esp between data and metadata tables. - Support lockless/non-blocking transactions, where writers don't block each other even in face of conflicts. - Support for long lived instants in timeline, break down distinction between active/archived - Support checking of uniqueness constraints, even in face of two concurrent insert transactions. - Support precise time-travel queries - Support time-travel writes. - Support schema history tracking and aid in schema evol impl. - TrueTime store/support for instant times - No more separate rollback action. make it a new state. Metadata table : - Encode filegroup ID and commit time along with file metadata Table Properties: - Partitioning information/indexing info was: This EPIC tracks changes to the Hudi storage format. Format change is anything that changes any bits related to - *Timeline* : active or archived timeline contents, file names. - {*}Base Files{*}: file format versions, any changes to any data types, file footers, file names. - {*}Log Files{*}: Block structure, content, names. - {*}Metadata Table{*}: (should we call this index table instead?) partition names, number of file groups, key/value schema and metadata to MDT row mappings. - {*}Table properties{*}: What's written to hoodie.properties. - *Marker files* : Can be left to the writer implementation. h2. Change summary: The following functionality should be supportable by the new format tech specs (at a minimum) Flexibility : - Ability to mix different types of base files within a single table or even a single file group (e.g images, json, vectors ...) - Easy integration of metadata for JVM and non-jvm clients Metafields : - Should _recordkey be uuid special handling? Additional Info: - Support encoding of watermarks/event time fields as first class citizen, for handling late arriving data. - Position based skipping of base file - Additional metadata to avoid more RPCs to scan base file/log blocks. - ML/Column family use-case? - Support having changeset of columns in each write, other headers Log : - Support writing updates as deletes and inserts, instead of logging as update to base file. - CDC format is GA. Table organization: - Support different logical partitions on the same data - Storage of table spread across buckets/root folders - Decouple table location from timeline, metadata. They can all be in different places Concurrency/Timeline: - Ability to support general purpose multi-table transactions, esp between data and metadata tables. - Support lockless/non-blocking transactions, where writers don't block each other even in face of conflicts. - Support for long lived instants in timeline, break down distinction between active/archived - Support checking of uniqueness constraints, even in face of two concurrent insert transactions. - Support precise time-travel queries - Support time-travel writes. - Support schema history tracking and aid in schema evol impl. - TrueTime store/support for instant times Metadata table : - Encode filegroup ID and commit time along with file metadata Table Properties: - Partitioning information/indexing info > Format changes for Hudi 1.X release line > ---------------------------------------- > > Key: HUDI-6242 > URL: https://issues.apache.org/jira/browse/HUDI-6242 > Project: Apache Hudi > Issue Type: Epic > Components: core > Reporter: Vinoth Chandar > Assignee: Vinoth Chandar > Priority: Major > Labels: hudi-umbrellas > Fix For: 1.0.0 > > > This EPIC tracks changes to the Hudi storage format. > Format change is anything that changes any bits related to > - *Timeline* : active or archived timeline contents, file names. > - {*}Base Files{*}: file format versions, any changes to any data types, > file footers, file names. > - {*}Log Files{*}: Block structure, content, names. > - {*}Metadata Table{*}: (should we call this index table instead?) partition > names, number of file groups, key/value schema and metadata to MDT row > mappings. > - {*}Table properties{*}: What's written to hoodie.properties. > - *Marker files* : Can be left to the writer implementation. > h2. Change summary: > > > > > > > > The following functionality should be supportable by the new format tech > specs (at a minimum) > Flexibility : > - Ability to mix different types of base files within a single table or even > a single file group (e.g images, json, vectors ...) > - Easy integration of metadata for JVM and non-jvm clients > Metafields : > - Should _recordkey be uuid special handling? > Additional Info: > - Support encoding of watermarks/event time fields as first class citizen, > for handling late arriving data. > - Position based skipping of base file > - Additional metadata to avoid more RPCs to scan base file/log blocks. > - ML/Column family use-case? > - Support having changeset of columns in each write, other headers > Log : > - Support writing updates as deletes and inserts, instead of logging as > update to base file. > - CDC format is GA. > Table organization: > - Support different logical partitions on the same data > - Storage of table spread across buckets/root folders > - Decouple table location from timeline, metadata. They can all be in > different places > Concurrency/Timeline: > - Ability to support general purpose multi-table transactions, esp between > data and metadata tables. > - Support lockless/non-blocking transactions, where writers don't block each > other even in face of conflicts. > - Support for long lived instants in timeline, break down distinction > between active/archived > - Support checking of uniqueness constraints, even in face of two concurrent > insert transactions. > - Support precise time-travel queries > - Support time-travel writes. > - Support schema history tracking and aid in schema evol impl. > - TrueTime store/support for instant times > - No more separate rollback action. make it a new state. > Metadata table : > - Encode filegroup ID and commit time along with file metadata > Table Properties: > - Partitioning information/indexing info -- This message was sent by Atlassian Jira (v8.20.10#820010)