[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line
[ https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-6242: - Epic Colour: ghx-label-2 (was: ghx-label-8) > Format changes for Hudi 1.X release line > > > Key: HUDI-6242 > URL: https://issues.apache.org/jira/browse/HUDI-6242 > Project: Apache Hudi > Issue Type: Epic > Components: core >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Labels: hudi-umbrellas > Fix For: 1.0.0 > > > This EPIC tracks changes to the Hudi storage format. > h2. Proposals > Format change is anything that changes any bits related to > * *Timeline* : active or archived timeline contents, file names. > * {*}Base Files{*}: file format versions, any changes to any data types, > file footers, file names. > * {*}Log Files{*}: Block structure, content, names. > * {*}Metadata Table{*}: (should we call this index table instead?) partition > names, number of file groups, key/value schema and metadata to MDT row > mappings. > * {*}Table properties{*}: What's written to > [hoodie.properties|http://hoodie.properties/]. > * *Marker files* : Can be left to the writer implementation. > h2. Change summary: > The following functionality should be supportable by the new format tech > specs (at a minimum) > Flexibility : > * [Pending] Ability to mix different types of base files within a single > table or even a single file group (e.g images, json, vectors ...) > * [Pending] Easy integration of metadata for JVM and non-jvm clients > (parquet as MT format, HFile native APIs) > Metafields : > * [Resolved] Should _recordkey be uuid special handling? > * Semantics of _hoodie_commit_time , with completion time changes. > Additional Info: > * Support encoding of watermarks/event time fields as first class citizen, > for handling late arriving data. > * [Resolved] Position based skipping of base file > * [Pending] Additional metadata to avoid more RPCs to scan base file/log > blocks. > * [Pending] ML/Column family use-case? > * [Resolved] Support having changeset of columns in each write, other headers > Log : > * [No change needed] Support writing updates as deletes and inserts, instead > of logging as update to base file. > * [Pending] CDC format is GA. > Table organization: > * [Pending] Support different logical partitions on the same data > * [Pending] RFC-60/Storage of table spread across buckets/root folders > * [Pending] Decouple table location from timeline, metadata. They can all be > in different places > Concurrency/Timeline: > * [Pending] Ability to support general purpose multi-table transactions, > esp between data and metadata tables. > * [Pending] Support lockless/non-blocking transactions, where writers don't > block each other even in face of conflicts. > * [Resolved] Support for long lived instants in timeline, break down > distinction between active/archived > * [Pending] Support checking of uniqueness constraints, even in face of two > concurrent insert transactions. > * [Pending] Support precise time-travel queries > * [Pending] Support time-travel writes. > * [Pending] Support schema history tracking and aid in schema evol impl. > * [Resolved] TrueTime store/support for instant times > * [Pending] No more separate rollback action. make it a new state. > Metadata table : > * Encode filegroup ID and commit time along with file metadata > Table Properties: > * Partitioning information/indexing info > Marker Files: > * Write marker files for logs as well, based on new marker format. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line
[ https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-6242: - Description: This EPIC tracks changes to the Hudi storage format. h2. Proposals Format change is anything that changes any bits related to * *Timeline* : active or archived timeline contents, file names. * {*}Base Files{*}: file format versions, any changes to any data types, file footers, file names. * {*}Log Files{*}: Block structure, content, names. * {*}Metadata Table{*}: (should we call this index table instead?) partition names, number of file groups, key/value schema and metadata to MDT row mappings. * {*}Table properties{*}: What's written to [hoodie.properties|http://hoodie.properties/]. * *Marker files* : Can be left to the writer implementation. h2. Change summary: The following functionality should be supportable by the new format tech specs (at a minimum) Flexibility : * [Pending] Ability to mix different types of base files within a single table or even a single file group (e.g images, json, vectors ...) * [Pending] Easy integration of metadata for JVM and non-jvm clients (parquet as MT format, HFile native APIs) Metafields : * [Resolved] Should _recordkey be uuid special handling? * Semantics of _hoodie_commit_time , with completion time changes. Additional Info: * Support encoding of watermarks/event time fields as first class citizen, for handling late arriving data. * [Resolved] Position based skipping of base file * [Pending] Additional metadata to avoid more RPCs to scan base file/log blocks. * [Pending] ML/Column family use-case? * [Resolved] Support having changeset of columns in each write, other headers Log : * [No change needed] Support writing updates as deletes and inserts, instead of logging as update to base file. * [Pending] CDC format is GA. Table organization: * [Pending] Support different logical partitions on the same data * [Pending] RFC-60/Storage of table spread across buckets/root folders * [Pending] Decouple table location from timeline, metadata. They can all be in different places Concurrency/Timeline: * [Pending] Ability to support general purpose multi-table transactions, esp between data and metadata tables. * [Pending] Support lockless/non-blocking transactions, where writers don't block each other even in face of conflicts. * [Resolved] Support for long lived instants in timeline, break down distinction between active/archived * [Pending] Support checking of uniqueness constraints, even in face of two concurrent insert transactions. * [Pending] Support precise time-travel queries * [Pending] Support time-travel writes. * [Pending] Support schema history tracking and aid in schema evol impl. * [Resolved] TrueTime store/support for instant times * [Pending] No more separate rollback action. make it a new state. Metadata table : * Encode filegroup ID and commit time along with file metadata Table Properties: * Partitioning information/indexing info Marker Files: * Write marker files for logs as well, based on new marker format. was: This EPIC tracks changes to the Hudi storage format. h2. Proposals Format change is anything that changes any bits related to * *Timeline* : active or archived timeline contents, file names. * {*}Base Files{*}: file format versions, any changes to any data types, file footers, file names. * {*}Log Files{*}: Block structure, content, names. * {*}Metadata Table{*}: (should we call this index table instead?) partition names, number of file groups, key/value schema and metadata to MDT row mappings. * {*}Table properties{*}: What's written to [hoodie.properties|http://hoodie.properties/]. * *Marker files* : Can be left to the writer implementation. h2. Change summary: The following functionality should be supportable by the new format tech specs (at a minimum) Flexibility : * [Pending] Ability to mix different types of base files within a single table or even a single file group (e.g images, json, vectors ...) * [Pending] Easy integration of metadata for JVM and non-jvm clients (parquet as MT format, HFile native APIs) Metafields : * [Resolved] Should _recordkey be uuid special handling? * Semantics of _hoodie_commit_time , with completion time changes. Additional Info: * Support encoding of watermarks/event time fields as first class citizen, for handling late arriving data. * [Resolved] Position based skipping of base file * [Pending] Additional metadata to avoid more RPCs to scan base file/log blocks. * [Pending] ML/Column family use-case? * [Resolved] Support having changeset of columns in each write, other headers Log : * [No change needed] Support writing updates as deletes and inserts, instead of logging as update to base file. * [Pending] CDC format is GA. Table organization: * [Pending] Support different logical partitions on the
[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line
[ https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-6242: - Description: This EPIC tracks changes to the Hudi storage format. h2. Proposals Format change is anything that changes any bits related to * *Timeline* : active or archived timeline contents, file names. * {*}Base Files{*}: file format versions, any changes to any data types, file footers, file names. * {*}Log Files{*}: Block structure, content, names. * {*}Metadata Table{*}: (should we call this index table instead?) partition names, number of file groups, key/value schema and metadata to MDT row mappings. * {*}Table properties{*}: What's written to [hoodie.properties|http://hoodie.properties/]. * *Marker files* : Can be left to the writer implementation. h2. Change summary: The following functionality should be supportable by the new format tech specs (at a minimum) Flexibility : * [Pending] Ability to mix different types of base files within a single table or even a single file group (e.g images, json, vectors ...) * [Pending] Easy integration of metadata for JVM and non-jvm clients (parquet as MT format, HFile native APIs) Metafields : * [Resolved] Should _recordkey be uuid special handling? * Semantics of _hoodie_commit_time , with completion time changes. Additional Info: * Support encoding of watermarks/event time fields as first class citizen, for handling late arriving data. * [Resolved] Position based skipping of base file * [Pending] Additional metadata to avoid more RPCs to scan base file/log blocks. * [Pending] ML/Column family use-case? * [Resolved] Support having changeset of columns in each write, other headers Log : * [No change needed] Support writing updates as deletes and inserts, instead of logging as update to base file. * [Pending] CDC format is GA. Table organization: * [Pending] Support different logical partitions on the same data * [Pending] RFC-60/Storage of table spread across buckets/root folders * [Pending] Decouple table location from timeline, metadata. They can all be in different places Concurrency/Timeline: * [Pending] Ability to support general purpose multi-table transactions, esp between data and metadata tables. * [Pending] Support lockless/non-blocking transactions, where writers don't block each other even in face of conflicts. * [Resolved] Support for long lived instants in timeline, break down distinction between active/archived * [Pending] Support checking of uniqueness constraints, even in face of two concurrent insert transactions. * [Pending] Support precise time-travel queries * [Pending] Support time-travel writes. * [Pending] Support schema history tracking and aid in schema evol impl. * [Resolved] TrueTime store/support for instant times * [Pending] No more separate rollback action. make it a new state. Metadata table : * Encode filegroup ID and commit time along with file metadata Table Properties: * Partitioning information/indexing info Marker Files: * Write marker files for logs as well, based on new marker format. was: This EPIC tracks changes to the Hudi storage format. Format change is anything that changes any bits related to - *Timeline* : active or archived timeline contents, file names. - {*}Base Files{*}: file format versions, any changes to any data types, file footers, file names. - {*}Log Files{*}: Block structure, content, names. - {*}Metadata Table{*}: (should we call this index table instead?) partition names, number of file groups, key/value schema and metadata to MDT row mappings. - {*}Table properties{*}: What's written to hoodie.properties. - *Marker files* : Can be left to the writer implementation. h2. Change summary: The following functionality should be supportable by the new format tech specs (at a minimum) Flexibility : - Ability to mix different types of base files within a single table or even a single file group (e.g images, json, vectors ...) - Easy integration of metadata for JVM and non-jvm clients Metafields : - Should _recordkey be uuid special handling? Additional Info: - Support encoding of watermarks/event time fields as first class citizen, for handling late arriving data. - Position based skipping of base file - Additional metadata to avoid more RPCs to scan base file/log blocks. - ML/Column family use-case? - Support having changeset of columns in each write, other headers Log : - Support writing updates as deletes and inserts, instead of logging as update to base file. - CDC format is GA. Table organization: - Support different logical partitions on the same data - Storage of table spread across buckets/root folders - Decouple table location from timeline, metadata. They can all be in different places Concurrency/Timeline: - Ability to support general purpose multi-table transactions, esp between data and metadata tables. - Suppo
[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line
[ https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-6242: -- Description: This EPIC tracks changes to the Hudi storage format. Format change is anything that changes any bits related to - *Timeline* : active or archived timeline contents, file names. - {*}Base Files{*}: file format versions, any changes to any data types, file footers, file names. - {*}Log Files{*}: Block structure, content, names. - {*}Metadata Table{*}: (should we call this index table instead?) partition names, number of file groups, key/value schema and metadata to MDT row mappings. - {*}Table properties{*}: What's written to hoodie.properties. - *Marker files* : Can be left to the writer implementation. h2. Change summary: The following functionality should be supportable by the new format tech specs (at a minimum) Flexibility : - Ability to mix different types of base files within a single table or even a single file group (e.g images, json, vectors ...) - Easy integration of metadata for JVM and non-jvm clients Metafields : - Should _recordkey be uuid special handling? Additional Info: - Support encoding of watermarks/event time fields as first class citizen, for handling late arriving data. - Position based skipping of base file - Additional metadata to avoid more RPCs to scan base file/log blocks. - ML/Column family use-case? - Support having changeset of columns in each write, other headers Log : - Support writing updates as deletes and inserts, instead of logging as update to base file. - CDC format is GA. Table organization: - Support different logical partitions on the same data - Storage of table spread across buckets/root folders - Decouple table location from timeline, metadata. They can all be in different places Concurrency/Timeline: - Ability to support general purpose multi-table transactions, esp between data and metadata tables. - Support lockless/non-blocking transactions, where writers don't block each other even in face of conflicts. - Support for long lived instants in timeline, break down distinction between active/archived - Support checking of uniqueness constraints, even in face of two concurrent insert transactions. - Support precise time-travel queries - Support time-travel writes. - Support schema history tracking and aid in schema evol impl. - TrueTime store/support for instant times - No more separate rollback action. make it a new state. Metadata table : - Encode filegroup ID and commit time along with file metadata Table Properties: - Partitioning information/indexing info was: This EPIC tracks changes to the Hudi storage format. Format change is anything that changes any bits related to - *Timeline* : active or archived timeline contents, file names. - {*}Base Files{*}: file format versions, any changes to any data types, file footers, file names. - {*}Log Files{*}: Block structure, content, names. - {*}Metadata Table{*}: (should we call this index table instead?) partition names, number of file groups, key/value schema and metadata to MDT row mappings. - {*}Table properties{*}: What's written to hoodie.properties. - *Marker files* : Can be left to the writer implementation. h2. Change summary: The following functionality should be supportable by the new format tech specs (at a minimum) Flexibility : - Ability to mix different types of base files within a single table or even a single file group (e.g images, json, vectors ...) - Easy integration of metadata for JVM and non-jvm clients Metafields : - Should _recordkey be uuid special handling? Additional Info: - Support encoding of watermarks/event time fields as first class citizen, for handling late arriving data. - Position based skipping of base file - Additional metadata to avoid more RPCs to scan base file/log blocks. - ML/Column family use-case? - Support having changeset of columns in each write, other headers Log : - Support writing updates as deletes and inserts, instead of logging as update to base file. - CDC format is GA. Table organization: - Support different logical partitions on the same data - Storage of table spread across buckets/root folders - Decouple table location from timeline, metadata. They can all be in different places Concurrency/Timeline: - Ability to support general purpose multi-table transactions, esp between data and metadata tables. - Support lockless/non-blocking transactions, where writers don't block each other even in face of conflicts. - Support for long lived instants in timeline, break down distinction between active/archived - Support checking of uniqueness constraints, even in face of two concurrent insert transactions. - Support precise time-travel queries - Support time-travel writes. - Support schema history tracking and aid in schema evol impl. - TrueTime
[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line
[ https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-6242: - Description: This EPIC tracks changes to the Hudi storage format. Format change is anything that changes any bits related to - *Timeline* : active or archived timeline contents, file names. - {*}Base Files{*}: file format versions, any changes to any data types, file footers, file names. - {*}Log Files{*}: Block structure, content, names. - {*}Metadata Table{*}: (should we call this index table instead?) partition names, number of file groups, key/value schema and metadata to MDT row mappings. - {*}Table properties{*}: What's written to hoodie.properties. - *Marker files* : Can be left to the writer implementation. h2. Change summary: The following functionality should be supportable by the new format tech specs (at a minimum) Flexibility : - Ability to mix different types of base files within a single table or even a single file group (e.g images, json, vectors ...) - Easy integration of metadata for JVM and non-jvm clients Metafields : - Should _recordkey be uuid special handling? Additional Info: - Support encoding of watermarks/event time fields as first class citizen, for handling late arriving data. - Position based skipping of base file - Additional metadata to avoid more RPCs to scan base file/log blocks. - ML/Column family use-case? - Support having changeset of columns in each write, other headers Log : - Support writing updates as deletes and inserts, instead of logging as update to base file. - CDC format is GA. Table organization: - Support different logical partitions on the same data - Storage of table spread across buckets/root folders - Decouple table location from timeline, metadata. They can all be in different places Concurrency/Timeline: - Ability to support general purpose multi-table transactions, esp between data and metadata tables. - Support lockless/non-blocking transactions, where writers don't block each other even in face of conflicts. - Support for long lived instants in timeline, break down distinction between active/archived - Support checking of uniqueness constraints, even in face of two concurrent insert transactions. - Support precise time-travel queries - Support time-travel writes. - Support schema history tracking and aid in schema evol impl. - TrueTime store/support for instant times - No more separate rollback action. make it a new state. Metadata table : - Encode filegroup ID and commit time along with file metadata Table Properties: - Partitioning information/indexing info was: This EPIC tracks changes to the Hudi storage format. Format change is anything that changes any bits related to - *Timeline* : active or archived timeline contents, file names. - {*}Base Files{*}: file format versions, any changes to any data types, file footers, file names. - {*}Log Files{*}: Block structure, content, names. - {*}Metadata Table{*}: (should we call this index table instead?) partition names, number of file groups, key/value schema and metadata to MDT row mappings. - {*}Table properties{*}: What's written to hoodie.properties. - *Marker files* : Can be left to the writer implementation. h2. Change summary: The following functionality should be supportable by the new format tech specs (at a minimum) Flexibility : - Ability to mix different types of base files within a single table or even a single file group (e.g images, json, vectors ...) - Easy integration of metadata for JVM and non-jvm clients Metafields : - Should _recordkey be uuid special handling? Additional Info: - Support encoding of watermarks/event time fields as first class citizen, for handling late arriving data. - Position based skipping of base file - Additional metadata to avoid more RPCs to scan base file/log blocks. - ML/Column family use-case? - Support having changeset of columns in each write, other headers Log : - Support writing updates as deletes and inserts, instead of logging as update to base file. - CDC format is GA. Table organization: - Support different logical partitions on the same data - Storage of table spread across buckets/root folders - Decouple table location from timeline, metadata. They can all be in different places Concurrency/Timeline: - Ability to support general purpose multi-table transactions, esp between data and metadata tables. - Support lockless/non-blocking transactions, where writers don't block each other even in face of conflicts. - Support for long lived instants in timeline, break down distinction between active/archived - Support checking of uniqueness constraints, even in face of two concurrent insert transactions. - Support precise time-travel queries - Support time-travel writes. - Support schema history tracking and aid in sc
[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line
[ https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-6242: - Description: This EPIC tracks changes to the Hudi storage format. Format change is anything that changes any bits related to - *Timeline* : active or archived timeline contents, file names. - {*}Base Files{*}: file format versions, any changes to any data types, file footers, file names. - {*}Log Files{*}: Block structure, content, names. - {*}Metadata Table{*}: (should we call this index table instead?) partition names, number of file groups, key/value schema and metadata to MDT row mappings. - {*}Table properties{*}: What's written to hoodie.properties. - *Marker files* : Can be left to the writer implementation. h2. Change summary: The following functionality should be supportable by the new format tech specs (at a minimum) Flexibility : - Ability to mix different types of base files within a single table or even a single file group (e.g images, json, vectors ...) - Easy integration of metadata for JVM and non-jvm clients Metafields : - Should _recordkey be uuid special handling? Additional Info: - Support encoding of watermarks/event time fields as first class citizen, for handling late arriving data. - Position based skipping of base file - Additional metadata to avoid more RPCs to scan base file/log blocks. - ML/Column family use-case? - Support having changeset of columns in each write, other headers Log : - Support writing updates as deletes and inserts, instead of logging as update to base file. - CDC format is GA. Table organization: - Support different logical partitions on the same data - Storage of table spread across buckets/root folders - Decouple table location from timeline, metadata. They can all be in different places Concurrency/Timeline: - Ability to support general purpose multi-table transactions, esp between data and metadata tables. - Support lockless/non-blocking transactions, where writers don't block each other even in face of conflicts. - Support for long lived instants in timeline, break down distinction between active/archived - Support checking of uniqueness constraints, even in face of two concurrent insert transactions. - Support precise time-travel queries - Support time-travel writes. - Support schema history tracking and aid in schema evol impl. - TrueTime store/support for instant times Metadata table : - Encode filegroup ID and commit time along with file metadata Table Properties: - Partitioning information/indexing info was: This EPIC tracks changes to the Hudi storage format. Format change is anything that changes any bits related to - *Timeline* : active or archived timeline contents, file names. - *Base Files*: file format versions, any changes to any data types, file footers, file names. - *Log Files*: Block structure, content, names. - *Metadata Table*: (should we call this index table instead?) partition names, number of file groups, key/value schema and metadata to MDT row mappings. - *Table properties*: What's written to hoodie.properties. - *Marker files* : Can be left to the writer implementation. The following functionality should be supportable by the new format tech specs (at a minimum) Flexibility : - Ability to mix different types of base files within a single table or even a single file group (e.g images, json, vectors ...) - Easy integration of metadata for JVM and non-jvm clients Metafields : - Should _recordkey be uuid special handling? Additional Info: - Support encoding of watermarks/event time fields as first class citizen, for handling late arriving data. - Position based skipping of base file - Additional metadata to avoid more RPCs to scan base file/log blocks. - ML/Column family use-case? - Support having changeset of columns in each write, other headers Log : - Support writing updates as deletes and inserts, instead of logging as update to base file. - CDC format is GA. Table organization: - Support different logical partitions on the same data - Storage of table spread across buckets/root folders - Decouple table location from timeline, metadata. They can all be in different places Concurrency/Timeline: - Ability to support general purpose multi-table transactions, esp between data and metadata tables. - Support lockless/non-blocking transactions, where writers don't block each other even in face of conflicts. - Support for long lived instants in timeline, break down distinction between active/archived - Support checking of uniqueness constraints, even in face of two concurrent insert transactions. - Support precise time-travel queries - Support time-travel writes. - Support schema history tracking and aid in schema evol impl. - TrueTime store/support for instant times Metadata table : - Encode filegroup ID a
[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line
[ https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-6242: - Description: This EPIC tracks changes to the Hudi storage format. Format change is anything that changes any bits related to - *Timeline* : active or archived timeline contents, file names. - *Base Files*: file format versions, any changes to any data types, file footers, file names. - *Log Files*: Block structure, content, names. - *Metadata Table*: (should we call this index table instead?) partition names, number of file groups, key/value schema and metadata to MDT row mappings. - *Table properties*: What's written to hoodie.properties. - *Marker files* : Can be left to the writer implementation. The following functionality should be supportable by the new format tech specs (at a minimum) Flexibility : - Ability to mix different types of base files within a single table or even a single file group (e.g images, json, vectors ...) - Easy integration of metadata for JVM and non-jvm clients Metafields : - Should _recordkey be uuid special handling? Additional Info: - Support encoding of watermarks/event time fields as first class citizen, for handling late arriving data. - Position based skipping of base file - Additional metadata to avoid more RPCs to scan base file/log blocks. - ML/Column family use-case? - Support having changeset of columns in each write, other headers Log : - Support writing updates as deletes and inserts, instead of logging as update to base file. - CDC format is GA. Table organization: - Support different logical partitions on the same data - Storage of table spread across buckets/root folders - Decouple table location from timeline, metadata. They can all be in different places Concurrency/Timeline: - Ability to support general purpose multi-table transactions, esp between data and metadata tables. - Support lockless/non-blocking transactions, where writers don't block each other even in face of conflicts. - Support for long lived instants in timeline, break down distinction between active/archived - Support checking of uniqueness constraints, even in face of two concurrent insert transactions. - Support precise time-travel queries - Support time-travel writes. - Support schema history tracking and aid in schema evol impl. - TrueTime store/support for instant times Metadata table : - Encode filegroup ID and commit time along with file metadata Table Properties: - Partitioning information/indexing info was: This EPIC tracks changes to the Hudi storage format. Format change is anything that changes any bits related to - *Timeline* : active or archived timeline contents, file names. - *Base Files*: file format versions, any changes to any data types, file footers, file names. - *Log Files*: Block structure, content, names. - *Metadata Table*: (should we call this index table instead?) partition names, number of file groups, key/value schema and metadata to MDT row mappings. - *Table properties*: What's written to hoodie.properties. - *Marker files* : how would we treat these? The following functionality should be supportable by the new format tech specs (at a minimum) Flexibility : - Ability to mix different types of base files within a single table or even a single file group (e.g images, json, vectors ...) - Easy integration of metadata for JVM and non-jvm clients Metafields : - Should _recordkey be uuid special handling? Additional Info: - Support encoding of watermarks/event time fields as first class citizen, for handling late arriving data. - Position based skipping of base file - Additional metadata to avoid more RPCs to scan base file/log blocks. - ML/Column family use-case? - Support having changeset of columns in each write, other headers Log : - Support writing updates as deletes and inserts, instead of logging as update to base file. - CDC format is GA. Table organization: - Support different logical partitions on the same data - Storage of table spread across buckets/root folders - Decouple table location from timeline, metadata. They can all be in different places Concurrency/Timeline: - Ability to support general purpose multi-table transactions, esp between data and metadata tables. - Support lockless/non-blocking transactions, where writers don't block each other even in face of conflicts. - Support for long lived instants in timeline, break down distinction between active/archived - Support checking of uniqueness constraints, even in face of two concurrent insert transactions. - Support precise time-travel queries - Support time-travel writes. - Support schema history tracking and aid in schema evol impl. - TrueTime store/support for instant times Metadata table : - Encode filegroup ID and commit time along with file metadata Table Properti
[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line
[ https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-6242: - Due Date: 13/Sep/23 (was: 30/Jun/23) > Format changes for Hudi 1.X release line > > > Key: HUDI-6242 > URL: https://issues.apache.org/jira/browse/HUDI-6242 > Project: Apache Hudi > Issue Type: Epic > Components: core >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Labels: hudi-umbrellas > Fix For: 1.0.0 > > > This EPIC tracks changes to the Hudi storage format. > Format change is anything that changes any bits related to > - *Timeline* : active or archived timeline contents, file names. > - *Base Files*: file format versions, any changes to any data types, file > footers, file names. > - *Log Files*: Block structure, content, names. > - *Metadata Table*: (should we call this index table instead?) partition > names, number of file groups, key/value schema and metadata to MDT row > mappings. > - *Table properties*: What's written to hoodie.properties. > - *Marker files* : how would we treat these? > The following functionality should be supportable by the new format tech > specs (at a minimum) > Flexibility : > - Ability to mix different types of base files within a single table or even > a single file group (e.g images, json, vectors ...) > - Easy integration of metadata for JVM and non-jvm clients > Metafields : > - Should _recordkey be uuid special handling? > Additional Info: > - Support encoding of watermarks/event time fields as first class citizen, > for handling late arriving data. > - Position based skipping of base file > - Additional metadata to avoid more RPCs to scan base file/log blocks. > - ML/Column family use-case? > - Support having changeset of columns in each write, other headers > Log : > - Support writing updates as deletes and inserts, instead of logging as > update to base file. > - CDC format is GA. > Table organization: > - Support different logical partitions on the same data > - Storage of table spread across buckets/root folders > - Decouple table location from timeline, metadata. They can all be in > different places > Concurrency/Timeline: > - Ability to support general purpose multi-table transactions, esp between > data and metadata tables. > - Support lockless/non-blocking transactions, where writers don't block each > other even in face of conflicts. > - Support for long lived instants in timeline, break down distinction > between active/archived > - Support checking of uniqueness constraints, even in face of two concurrent > insert transactions. > - Support precise time-travel queries > - Support time-travel writes. > - Support schema history tracking and aid in schema evol impl. > - TrueTime store/support for instant times > Metadata table : > - Encode filegroup ID and commit time along with file metadata > Table Properties: > - Partitioning information/indexing info -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line
[ https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-6242: - Description: This EPIC tracks changes to the Hudi storage format. Format change is anything that changes any bits related to - *Timeline* : active or archived timeline contents, file names. - *Base Files*: file format versions, any changes to any data types, file footers, file names. - *Log Files*: Block structure, content, names. - *Metadata Table*: (should we call this index table instead?) partition names, number of file groups, key/value schema and metadata to MDT row mappings. - *Table properties*: What's written to hoodie.properties. - *Marker files* : how would we treat these? The following functionality should be supportable by the new format tech specs (at a minimum) Flexibility : - Ability to mix different types of base files within a single table or even a single file group (e.g images, json, vectors ...) - Easy integration of metadata for JVM and non-jvm clients Metafields : - Should _recordkey be uuid special handling? Additional Info: - Support encoding of watermarks/event time fields as first class citizen, for handling late arriving data. - Position based skipping of base file - Additional metadata to avoid more RPCs to scan base file/log blocks. - ML/Column family use-case? - Support having changeset of columns in each write, other headers Log : - Support writing updates as deletes and inserts, instead of logging as update to base file. - CDC format is GA. Table organization: - Support different logical partitions on the same data - Storage of table spread across buckets/root folders - Decouple table location from timeline, metadata. They can all be in different places Concurrency/Timeline: - Ability to support general purpose multi-table transactions, esp between data and metadata tables. - Support lockless/non-blocking transactions, where writers don't block each other even in face of conflicts. - Support for long lived instants in timeline, break down distinction between active/archived - Support checking of uniqueness constraints, even in face of two concurrent insert transactions. - Support precise time-travel queries - Support time-travel writes. - Support schema history tracking and aid in schema evol impl. - TrueTime store/support for instant times Metadata table : - Encode filegroup ID and commit time along with file metadata Table Properties: - Partitioning information/indexing info was: This EPIC tracks changes to the Hudi storage format. Format change is anything that changes any bits related to - *Timeline* : active or archived timeline contents, file names. - *Base Files*: file format versions, any changes to any data types, file footers, file names. - *Log Files*: Block structure, content, names. - *Metadata Table*: (should we call this index table instead?) partition names, number of file groups, key/value schema and metadata to MDT row mappings. - *Table properties*: What's written to hoodie.properties. - *Marker files* : how would we treat these? The following functionality should be supportable by the new format tech specs (at a minimum) Flexibility : - Ability to mix different types of base files within a single table or even a single file group (e.g images, json, vectors ...) - Easy integration of metadata for JVM and non-jvm clients Additional Info: - Support encoding of watermarks/event time fields as first class citizen, for handling late arriving data. - Position based skipping of base file - Additional metadata to avoid more RPCs to scan base file/log blocks. - ML/Column family use-case? - Support having changeset of columns in each write, other headers Log : - Support writing updates as deletes and inserts, instead of logging as update to base file. - CDC format is GA. Table organization: - Support different logical partitions on the same data - Storage of table spread across buckets/root folders - Decouple table location from timeline, metadata. They can all be in different places Concurrency/Timeline: - Ability to support general purpose multi-table transactions, esp between data and metadata tables. - Support lockless/non-blocking transactions, where writers don't block each other even in face of conflicts. - Support for long lived instants in timeline, break down distinction between active/archived - Support checking of uniqueness constraints, even in face of two concurrent insert transactions. - Support precise time-travel queries - Support time-travel writes. - Support schema history tracking and aid in schema evol impl. - TrueTime store/support for instant times Metadata table : - Encode filegroup ID and commit time along with file metadata Table Properties: - Partitioning information/indexing info > Format changes for Hudi
[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line
[ https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-6242: - Description: This EPIC tracks changes to the Hudi storage format. Format change is anything that changes any bits related to - *Timeline* : active or archived timeline contents, file names. - *Base Files*: file format versions, any changes to any data types, file footers, file names. - *Log Files*: Block structure, content, names. - *Metadata Table*: (should we call this index table instead?) partition names, number of file groups, key/value schema and metadata to MDT row mappings. - *Table properties*: What's written to hoodie.properties. - *Marker files* : how would we treat these? The following functionality should be supportable by the new format tech specs (at a minimum) Flexibility : - Ability to mix different types of base files within a single table or even a single file group (e.g images, json, vectors ...) - Easy integration of metadata for JVM and non-jvm clients Additional Info: - Support encoding of watermarks/event time fields as first class citizen, for handling late arriving data. - Position based skipping of base file - Additional metadata to avoid more RPCs to scan base file/log blocks. - ML/Column family use-case? - Support having changeset of columns in each write, other headers Log : - Support writing updates as deletes and inserts, instead of logging as update to base file. - CDC format is GA. Table organization: - Support different logical partitions on the same data - Storage of table spread across buckets/root folders - Decouple table location from timeline, metadata. They can all be in different places Concurrency/Timeline: - Ability to support general purpose multi-table transactions, esp between data and metadata tables. - Support lockless/non-blocking transactions, where writers don't block each other even in face of conflicts. - Support for long lived instants in timeline, break down distinction between active/archived - Support checking of uniqueness constraints, even in face of two concurrent insert transactions. - Support precise time-travel queries - Support time-travel writes. - Support schema history tracking and aid in schema evol impl. - TrueTime store/support for instant times Metadata table : - Encode filegroup ID and commit time along with file metadata Table Properties: - Partitioning information/indexing info was: This EPIC tracks changes to the Hudi storage format. Format change is anything that changes any bits related to - *Timeline* : active or archived timeline contents, file names. - *Base Files*: file format versions, any changes to any data types, file footers, file names. - *Log Files*: Block structure, content, names. - *Metadata Table*: (should we call this index table instead?) partition names, number of file groups, key/value schema and metadata to MDT row mappings. - *Table properties*: What's written to hoodie.properties. - *Marker files* : how would we treat these? The following functionality should be supportable by the new format tech specs (at a minimum) Flexibility : - Ability to mix different types of base files within a single table or even a single file group (e.g images, json, vectors ...) - Easy integration of metadata for JVM and non-jvm clients Additional Info: - Support encoding of watermarks/event time fields as first class citizen, for handling late arriving data. - Position based skipping of base file - Additional metadata to avoid more RPCs to scan base file/log blocks. - ML/Column family use-case? - Support having changeset of columns in each write, other headers Log : - Support writing updates as deletes and inserts, instead of logging as update to base file. - CDC format is GA. Table organization: - Support different logical partitions on the same data - Storage of table spread across buckets/root folders - Decouple table location from timeline, metadata. They can all be in different places Concurrency/Timeline: - Ability to support general purpose multi-table transactions, esp between data and metadata tables. - Support lockless/non-blocking transactions, where writers don't block each other even in face of conflicts. - Support for long lived instants in timeline, break down distinction between active/archived - Support checking of uniqueness constraints, even in face of two concurrent insert transactions. - Support precise time-travel queries - Support time-travel writes. - Support schema history tracking and aid in schema evol impl. - TrueTime store/support for instant times > Format changes for Hudi 1.X release line > > > Key: HUDI-6242 > URL: https://issues.apache.org/jira/browse/HUDI-6242 > Project: Apache Hudi >
[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line
[ https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-6242: - Description: This EPIC tracks changes to the Hudi storage format. Format change is anything that changes any bits related to - *Timeline* : active or archived timeline contents, file names. - *Base Files*: file format versions, any changes to any data types, file footers, file names. - *Log Files*: Block structure, content, names. - *Metadata Table*: (should we call this index table instead?) partition names, number of file groups, key/value schema and metadata to MDT row mappings. - *Table properties*: What's written to hoodie.properties. - *Marker files* : how would we treat these? The following functionality should be supportable by the new format tech specs (at a minimum) Flexibility : - Ability to mix different types of base files within a single table or even a single file group (e.g images, json, vectors ...) - Easy integration of metadata for JVM and non-jvm clients Additional Info: - Support encoding of watermarks/event time fields as first class citizen, for handling late arriving data. - Position based skipping of base file - Additional metadata to avoid more RPCs to scan base file/log blocks. - ML/Column family use-case? - Support having changeset of columns in each write, other headers Log : - Support writing updates as deletes and inserts, instead of logging as update to base file. - CDC format is GA. Table organization: - Support different logical partitions on the same data - Storage of table spread across buckets/root folders - Decouple table location from timeline, metadata. They can all be in different places Concurrency/Timeline: - Ability to support general purpose multi-table transactions, esp between data and metadata tables. - Support lockless/non-blocking transactions, where writers don't block each other even in face of conflicts. - Support for long lived instants in timeline, break down distinction between active/archived - Support checking of uniqueness constraints, even in face of two concurrent insert transactions. - Support precise time-travel queries - Support time-travel writes. - Support schema history tracking and aid in schema evol impl. - TrueTime store/support for instant times was: This EPIC tracks changes to the Hudi storage format. Format change is anything that changes any bits related to - *Timeline* : active or archived timeline contents, file names. - *Base Files*: file format versions, any changes to any data types, file footers, file names. - *Log Files*: Block structure, content, names. - *Metadata Table*: (should we call this index table instead?) partition names, number of file groups, key/value schema and metadata to MDT row mappings. - *Table properties*: What's written to hoodie.properties. - *Marker files* : how would we treat these? The following functionality should be met (at a minimum) Flexibility : - Ability to mix different types of base files within a single table or even a single file group (e.g images, json, vectors ...) - Easy integration of metadata for JVM and non-jvm clients Additional Info: - Support encoding of watermarks/event time fields as first class citizen, for handling late arriving data. - Position based skipping of base file - Additional metadata to avoid more RPCs to scan base file/log blocks. - ML/Column family use-case? - Support having changeset of columns in each write, other headers Log : - Support writing updates as deletes and inserts, instead of logging as update to base file. - CDC format is GA. Table organization: - Support different logical partitions on the same data - Storage of table spread across buckets/root folders - Decouple table location from timeline, metadata. They can all be in different places Concurrency/Timeline: - Ability to support general purpose multi-table transactions, esp between data and metadata tables. - Support lockless/non-blocking transactions, where writers don't block each other even in face of conflicts. - Support for long lived instants in timeline, break down distinction between active/archived - Support checking of uniqueness constraints, even in face of two concurrent insert transactions. - Support precise time-travel queries - Support time-travel writes. - Support schema history tracking and aid in schema evol impl. - TrueTime store/support for instant times > Format changes for Hudi 1.X release line > > > Key: HUDI-6242 > URL: https://issues.apache.org/jira/browse/HUDI-6242 > Project: Apache Hudi > Issue Type: Epic > Components: core >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Labels: hu
[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line
[ https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-6242: - Description: This EPIC tracks changes to the Hudi storage format. Format change is anything that changes any bits related to - *Timeline* : active or archived timeline contents, file names. - *Base Files*: file format versions, any changes to any data types, file footers, file names. - *Log Files*: Block structure, content, names. - *Metadata Table*: (should we call this index table instead?) partition names, number of file groups, key/value schema and metadata to MDT row mappings. - *Table properties*: What's written to hoodie.properties. - *Marker files* : how would we treat these? The following functionality should be met (at a minimum) Flexibility : - Ability to mix different types of base files within a single table or even a single file group (e.g images, json, vectors ...) - Easy integration of metadata for JVM and non-jvm clients Additional Info: - Support encoding of watermarks/event time fields as first class citizen, for handling late arriving data. - Position based skipping of base file - Additional metadata to avoid more RPCs to scan base file/log blocks. - ML/Column family use-case? - Support having changeset of columns in each write, other headers Log : - Support writing updates as deletes and inserts, instead of logging as update to base file. - CDC format is GA. Table organization: - Support different logical partitions on the same data - Storage of table spread across buckets/root folders - Decouple table location from timeline, metadata. They can all be in different places Concurrency/Timeline: - Ability to support general purpose multi-table transactions, esp between data and metadata tables. - Support lockless/non-blocking transactions, where writers don't block each other even in face of conflicts. - Support for long lived instants in timeline, break down distinction between active/archived - Support checking of uniqueness constraints, even in face of two concurrent insert transactions. - Support precise time-travel queries - Support time-travel writes. - Support schema history tracking and aid in schema evol impl. - TrueTime store/support for instant times was: This EPIC tracks changes to the Hudi storage format. Format change is anything that changes any bits related to - *Timeline* : active or archived timeline contents, file names. - *Base Files*: file format versions, any changes to any data types, file footers, file names. - *Log Files*: Block structure, content, names. - *Metadata Table*: (should we call this index table instead?) partition names, number of file groups, key/value schema and metadata to MDT row mappings. - *Table properties*: What's written to hoodie.properties. - *Marker files* : how would we treat these? The following functionality should be met (at a minimum) - Ability to mix different types of base files within a single table or even a single file group (e.g images, json, vectors ...) - Ability to support general purpose multi-table transactions, esp between data and metadata tables. - Support lockless/non-blocking transactions, where writers don't block each other even in face of conflicts. - Support for long lived instants in timeline, break down distinction between active/archived - Support encoding of watermarks/event time fields as first class citizen, for handling late arriving data. - Support writing updates as deletes and inserts, instead of logging as update to base file. - Support checking of uniqueness constraints, even in face of two concurrent insert transactions. - Support precise time-travel queries - Support time-travel writes. - Support schema history tracking and aid in schema evol impl. - Easy integration of metadata for JVM and non-jvm clients - TrueTime store/support for instant times - CDC format is GA. - Support different logical partitions on the same data - Position based skipping of base file - Storage of table spread across buckets/root folders - Additional metadata to avoid more RPCs to scan base file/log blocks. - Decouple table location from timeline, metadata. They can all be in different places - Support having changeset of columns in each write, other headers - ML/Column family use-case > Format changes for Hudi 1.X release line > > > Key: HUDI-6242 > URL: https://issues.apache.org/jira/browse/HUDI-6242 > Project: Apache Hudi > Issue Type: Epic > Components: core >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Labels: hudi-umbrellas > Fix For: 1.0.0 > > > This EPIC tracks changes to the Hudi storage format. > Format change is anything
[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line
[ https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-6242: - Description: This EPIC tracks changes to the Hudi storage format. Format change is anything that changes any bits related to - *Timeline* : active or archived timeline contents, file names. - *Base Files*: file format versions, any changes to any data types, file footers, file names. - *Log Files*: Block structure, content, names. - *Metadata Table*: (should we call this index table instead?) partition names, number of file groups, key/value schema and metadata to MDT row mappings. - *Table properties*: What's written to hoodie.properties. - *Marker files* : how would we treat these? The following functionality should be met (at a minimum) - Ability to mix different types of base files within a single table or even a single file group (e.g images, json, vectors ...) - Ability to support general purpose multi-table transactions, esp between data and metadata tables. - Support lockless/non-blocking transactions, where writers don't block each other even in face of conflicts. - Support for long lived instants in timeline, break down distinction between active/archived - Support encoding of watermarks/event time fields as first class citizen, for handling late arriving data. - Support writing updates as deletes and inserts, instead of logging as update to base file. - Support checking of uniqueness constraints, even in face of two concurrent insert transactions. - Support precise time-travel queries - Support time-travel writes. - Support schema history tracking and aid in schema evol impl. - Easy integration of metadata for JVM and non-jvm clients - TrueTime store/support for instant times - CDC format is GA. - Support different logical partitions on the same data - Position based skipping of base file - Storage of table spread across buckets/root folders - Additional metadata to avoid more RPCs to scan base file/log blocks. - Decouple table location from timeline, metadata. They can all be in different places - Support having changeset of columns in each write, other headers - ML/Column family use-case was: This EPIC tracks changes to the Hudi storage format. Format change is anything that changes any bits related to - *Timeline* : active or archived timeline contents, file names. - *Base Files*: file format versions, any changes to any data types, file footers, file names. - *Log Files*: Block structure, content, names. - *Metadata Table*: (should we call this index table instead?) partition names, number of file groups, key/value schema and metadata to MDT row mappings. - *Table properties*: What's written to hoodie.properties. - *Marker files* : how would we treat these? The following functionality should be met (at a minimum) - Ability to mix different types of base files within a single table or even a single file group (e.g images, json, vectors ...) - Ability to support general purpose multi-table transactions, esp between data and metadata tables. - Support lockless/non-blocking transactions, where writers don't block each other even in face of conflicts. - Support encoding of watermarks/event time fields as first class citizen, for handling late arriving data. - Support writing updates as deletes and inserts, instead of logging as update to base file. - Support checking of uniqueness constraints, even in face of two concurrent insert transactions. - Support precise time-travel queries - Support time-travel writes. - Support schema history tracking and aid in schema evol impl. - Easy integration of metadata for JVM and non-jvm clients - TrueTime store/support for instant times - CDC format is GA. - Support different logical partitions on the same data - Position based skipping of base file - Storage of table spread across buckets/root folders - Additional metadata to avoid more RPCs to scan base file/log blocks. - Decouple table location from timeline, metadata. They can all be in different places - Support having changeset of columns in each write, other headers - ML/Column family use-case > Format changes for Hudi 1.X release line > > > Key: HUDI-6242 > URL: https://issues.apache.org/jira/browse/HUDI-6242 > Project: Apache Hudi > Issue Type: Epic > Components: core >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Labels: hudi-umbrellas > Fix For: 1.0.0 > > > This EPIC tracks changes to the Hudi storage format. > Format change is anything that changes any bits related to > - *Timeline* : active or archived timeline contents, file names. > - *Base Files*: file format versions, any changes to any data types, file > foot
[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line
[ https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-6242: - Description: This EPIC tracks changes to the Hudi storage format. Format change is anything that changes any bits related to - *Timeline* : active or archived timeline contents, file names. - *Base Files*: file format versions, any changes to any data types, file footers, file names. - *Log Files*: Block structure, content, names. - *Metadata Table*: (should we call this index table instead?) partition names, number of file groups, key/value schema and metadata to MDT row mappings. - *Table properties*: What's written to hoodie.properties. - *Marker files* : how would we treat these? The following functionality should be met (at a minimum) - Ability to mix different types of base files within a single table or even a single file group (e.g images, json, vectors ...) - Ability to support general purpose multi-table transactions, esp between data and metadata tables. - Support lockless/non-blocking transactions, where writers don't block each other even in face of conflicts. - Support encoding of watermarks/event time fields as first class citizen, for handling late arriving data. - Support writing updates as deletes and inserts, instead of logging as update to base file. - Support checking of uniqueness constraints, even in face of two concurrent insert transactions. - Support precise time-travel queries - Support time-travel writes. - Support schema history tracking and aid in schema evol impl. - Easy integration of metadata for JVM and non-jvm clients - TrueTime store/support for instant times - CDC format is GA. - Support different logical partitions on the same data - Position based skipping of base file - Storage of table spread across buckets/root folders - Additional metadata to avoid more RPCs to scan base file/log blocks. - Decouple table location from timeline, metadata. They can all be in different places - Support having changeset of columns in each write, other headers - ML/Column family use-case was: This EPIC tracks changes to the Hudi storage format. Format change is anything that changes any bits related to - *Timeline* : active or archived timeline contents, file names. - *Base Files*: file format versions, any changes to any data types, file footers, file names. - *Log Files*: Block structure, content, names. - *Metadata Table*: (should we call this index table instead?) partition names, number of file groups, key/value schema and metadata to MDT row mappings. - *Table properties*: What's written to hoodie.properties. - *Marker files* : how would we treat these? The following functionality should be met (at a minimum) - Ability to mix different types of base files within a single table or even a single file group (e.g images, json, vectors ...) - Ability to support general purpose multi-table transactions, esp between data and metadata tables. - Support lockless/non-blocking transactions, where writers don't block each other even in face of conflicts. - Support encoding of watermarks/event time fields as first class citizen, for handling late arriving data. - Support writing updates as deletes and inserts, instead of logging as update to base file. - Support checking of uniqueness constraints, even in face of two concurrent insert transactions. - Support precise time-travel queries - Support time-travel writes. - Support schema history tracking and aid in schema evol impl. - Easy integration of metadata for JVM and non-jvm clients - TrueTime store/support for instant times - CDC format is GA. - Support different logical partitions on the same data - Position based skipping of base file - Storage of table spread across buckets/root folders - Additional metadata to avoid more RPCs to scan base file/log blocks. - Decouple table location from timeline, metadata. They can all be in different places - Support having changeset of columns in each write, other headers > Format changes for Hudi 1.X release line > > > Key: HUDI-6242 > URL: https://issues.apache.org/jira/browse/HUDI-6242 > Project: Apache Hudi > Issue Type: Epic > Components: core >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Labels: hudi-umbrellas > Fix For: 1.0.0 > > > This EPIC tracks changes to the Hudi storage format. > Format change is anything that changes any bits related to > - *Timeline* : active or archived timeline contents, file names. > - *Base Files*: file format versions, any changes to any data types, file > footers, file names. > - *Log Files*: Block structure, content, names. > - *Metadata Table*: (should we call this index table
[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line
[ https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-6242: - Description: This EPIC tracks changes to the Hudi storage format. Format change is anything that changes any bits related to - *Timeline* : active or archived timeline contents, file names. - *Base Files*: file format versions, any changes to any data types, file footers, file names. - *Log Files*: Block structure, content, names. - *Metadata Table*: (should we call this index table instead?) partition names, number of file groups, key/value schema and metadata to MDT row mappings. - *Table properties*: What's written to hoodie.properties. - *Marker files* : how would we treat these? The following functionality should be met (at a minimum) - Ability to mix different types of base files within a single table or even a single file group (e.g images, json, vectors ...) - Ability to support general purpose multi-table transactions, esp between data and metadata tables. - Support lockless/non-blocking transactions, where writers don't block each other even in face of conflicts. - Support encoding of watermarks/event time fields as first class citizen, for handling late arriving data. - Support writing updates as deletes and inserts, instead of logging as update to base file. - Support checking of uniqueness constraints, even in face of two concurrent insert transactions. - Support precise time-travel queries - Support time-travel writes. - Support schema history tracking and aid in schema evol impl. - Easy integration of metadata for JVM and non-jvm clients - TrueTime store/support for instant times - CDC format is GA. - Support different logical partitions on the same data - Position based skipping of base file - Storage of table spread across buckets/root folders - Additional metadata to avoid more RPCs to scan base file/log blocks. - Decouple table location from timeline, metadata. They can all be in different places - Support having changeset of columns in each write, other headers was: This EPIC tracks changes to the Hudi storage format. Format change is anything that changes any bits related to - *Timeline* : active or archived timeline contents, file names. - *Base Files*: file format versions, any changes to any data types, file footers, file names. - *Log Files*: Block structure, content, names. - *Metadata Table*: (should we call this index table instead?) partition names, number of file groups, key/value schema and metadata to MDT row mappings. - *Table properties*: What's written to hoodie.properties. The following functionality should be met (at a minimum) - Ability to mix different types of base files within a single table or even a single file group (e.g images, json, vectors ...) - Ability to support general purpose multi-table transactions, esp between data and metadata tables. - Support lockless/non-blocking transactions, where writers don't block each other even in face of conflicts. - Support encoding of watermarks/event time fields as first class citizen, for handling late arriving data. - Support writing updates as deletes and inserts, instead of logging as update to base file. - Support checking of uniqueness constraints, even in face of two concurrent insert transactions. - Support precise time-travel queries - Support time-travel writes. - Support schema history tracking and aid in schema evol impl. - Easy integration of metadata for JVM and non-jvm clients - TrueTime store/support for instant times - CDC format is GA. - Support different logical partitions on the same data - Position based skipping of base file - Storage of table spread across buckets/root folders - Additional metadata to avoid more RPCs to scan base file/log blocks. - Decouple table location from timeline, metadata. They can all be in different places > Format changes for Hudi 1.X release line > > > Key: HUDI-6242 > URL: https://issues.apache.org/jira/browse/HUDI-6242 > Project: Apache Hudi > Issue Type: Epic > Components: core >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Labels: hudi-umbrellas > Fix For: 1.0.0 > > > This EPIC tracks changes to the Hudi storage format. > Format change is anything that changes any bits related to > - *Timeline* : active or archived timeline contents, file names. > - *Base Files*: file format versions, any changes to any data types, file > footers, file names. > - *Log Files*: Block structure, content, names. > - *Metadata Table*: (should we call this index table instead?) partition > names, number of file groups, key/value schema and metadata to MDT row > mappings. > - *Table properties*: What's wri
[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line
[ https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-6242: - Description: This EPIC tracks changes to the Hudi storage format. Format change is anything that changes any bits related to - *Timeline* : active or archived timeline contents, file names. - *Base Files*: file format versions, any changes to any data types, file footers, file names. - *Log Files*: Block structure, content, names. - *Metadata Table*: (should we call this index table instead?) partition names, number of file groups, key/value schema and metadata to MDT row mappings. - *Table properties*: What's written to hoodie.properties. The following functionality should be met (at a minimum) - Ability to mix different types of base files within a single table or even a single file group (e.g images, json, vectors ...) - Ability to support general purpose multi-table transactions, esp between data and metadata tables. - Support lockless/non-blocking transactions, where writers don't block each other even in face of conflicts. - Support encoding of watermarks/event time fields as first class citizen, for handling late arriving data. - Support writing updates as deletes and inserts, instead of logging as update to base file. - Support checking of uniqueness constraints, even in face of two concurrent insert transactions. - Support precise time-travel queries - Support time-travel writes. - Support schema history tracking and aid in schema evol impl. - Easy integration of metadata for JVM and non-jvm clients - TrueTime store/support for instant times - CDC format is GA. - Support different logical partitions on the same data - Position based skipping of base file - Storage of table spread across buckets/root folders - Additional metadata to avoid more RPCs to scan base file/log blocks. - Decouple table location from timeline, metadata. They can all be in different places was: This EPIC tracks changes to the Hudi storage format. Format change is anything that changes any bits related to - *Timeline* : active or archived timeline contents, file names. - *Base Files*: file format versions, any changes to any data types, file footers, file names. - *Log Files*: Block structure, content, names. - *Metadata Table*: (should we call this index table instead?) partition names, number of file groups, key/value schema and metadata to MDT row mappings. - *Table properties*: What's written to hoodie.properties. The following functionality should be met (at a minimum) - Ability to mix different types of base files within a single table or even a single file group (e.g images, json, vectors ...) - Ability to support general purpose multi-table transactions, esp between data and metadata tables. - Support lockless/non-blocking transactions, where writers don't block each other even in face of conflicts. - Support encoding of watermarks/event time fields as first class citizen, for handling late arriving data. - Support writing updates as deletes and inserts, instead of logging as update to base file. - Support checking of uniqueness constraints, even in face of two concurrent insert transactions. - Support precise time-travel queries - Support time-travel writes. - Support schema history tracking and aid in schema evol impl. - Easy integration of metadata for JVM and non-jvm clients - TrueTime store/support for instant times - CDC format is GA. - Support different logical partitions on the same data - Position based skipping of base file - Storage of table spread across buckets/root folders - Additional metadata to avoid more RPCs to scan base file/log blocks. - Decouple table location from timeline, metadata. They can all be in different places - > Format changes for Hudi 1.X release line > > > Key: HUDI-6242 > URL: https://issues.apache.org/jira/browse/HUDI-6242 > Project: Apache Hudi > Issue Type: Epic > Components: core >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Labels: hudi-umbrellas > Fix For: 1.0.0 > > > This EPIC tracks changes to the Hudi storage format. > Format change is anything that changes any bits related to > - *Timeline* : active or archived timeline contents, file names. > - *Base Files*: file format versions, any changes to any data types, file > footers, file names. > - *Log Files*: Block structure, content, names. > - *Metadata Table*: (should we call this index table instead?) partition > names, number of file groups, key/value schema and metadata to MDT row > mappings. > - *Table properties*: What's written to hoodie.properties. > The following functionality should be met (at a minimum) > - Ability to mix dif
[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line
[ https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-6242: - Description: This EPIC tracks changes to the Hudi storage format. Format change is anything that changes any bits related to - *Timeline* : active or archived timeline contents, file names. - *Base Files*: file format versions, any changes to any data types, file footers, file names. - *Log Files*: Block structure, content, names. - *Metadata Table*: (should we call this index table instead?) partition names, number of file groups, key/value schema and metadata to MDT row mappings. - *Table properties*: What's written to hoodie.properties. The following functionality should be met. - Ability to mix different types of base files within a single table or even a single file group (e.g images, json, vectors ...) - Ability to support general purpose multi-table transactions, esp between data and metadata tables. - Support lockless/non-blocking transactions, where writers don't block each other even in face of conflicts. - Support encoding of watermarks/event time fields as first class citizen, for handling late arriving data. - Support writing updates as deletes and inserts, instead of logging as update to base file. - Support checking of uniqueness constraints, even in face of two concurrent insert transactions. - Support precise time-travel queries - Support time-travel writes. - Support schema history tracking and aid in schema evol impl. - Easy integration of metadata for JVM and non-jvm clients - TrueTime store/support for instant times - CDC format is GA. - Support different logical partitions on the same data - Position based skipping of base file - Storage of table spread across buckets/root folders - Additional metadata to avoid more RPCs to scan base file/log blocks. - Decouple table location from timeline, metadata. They can all be in different places - was: This EPIC tracks changes to the Hudi storage format. Format change is anything that changes any bits related to - *Timeline* : active or archived timeline contents, file names. - *Base Files*: file format versions, any changes to any data types, file footers, file names. - *Log Files*: Block structure, content, names. - *Metadata Table*: (should we call this index table instead?) partition names, number of file groups, key/value schema and metadata to MDT row mappings. - *Table properties*: What's written to hoodie.properties. The following functionality should be met. - Ability to mix different types of base files within a single table or even a single file group (e.g images, json, vectors ...) - Ability to support general purpose multi-table transactions, esp between data and metadata tables. - Support lockless/non-blocking transactions, where writers don't block each other even in face of conflicts. - Support encoding of watermarks/event time fields as first class citizen, for handling late arriving data. - Pluggable implementations for write handles, timeline, metadata table. > Format changes for Hudi 1.X release line > > > Key: HUDI-6242 > URL: https://issues.apache.org/jira/browse/HUDI-6242 > Project: Apache Hudi > Issue Type: Epic > Components: core >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Labels: hudi-umbrellas > Fix For: 1.0.0 > > > This EPIC tracks changes to the Hudi storage format. > Format change is anything that changes any bits related to > - *Timeline* : active or archived timeline contents, file names. > - *Base Files*: file format versions, any changes to any data types, file > footers, file names. > - *Log Files*: Block structure, content, names. > - *Metadata Table*: (should we call this index table instead?) partition > names, number of file groups, key/value schema and metadata to MDT row > mappings. > - *Table properties*: What's written to hoodie.properties. > The following functionality should be met. > - Ability to mix different types of base files within a single table or even > a single file group (e.g images, json, vectors ...) > - Ability to support general purpose multi-table transactions, esp between > data and metadata tables. > - Support lockless/non-blocking transactions, where writers don't block each > other even in face of conflicts. > - Support encoding of watermarks/event time fields as first class citizen, > for handling late arriving data. > - Support writing updates as deletes and inserts, instead of logging as > update to base file. > - Support checking of uniqueness constraints, even in face of two concurrent > insert transactions. > - Support precise time-travel queries > - Support time-travel writes. > -
[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line
[ https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-6242: - Description: This EPIC tracks changes to the Hudi storage format. Format change is anything that changes any bits related to - *Timeline* : active or archived timeline contents, file names. - *Base Files*: file format versions, any changes to any data types, file footers, file names. - *Log Files*: Block structure, content, names. - *Metadata Table*: (should we call this index table instead?) partition names, number of file groups, key/value schema and metadata to MDT row mappings. - *Table properties*: What's written to hoodie.properties. The following functionality should be met (at a minimum) - Ability to mix different types of base files within a single table or even a single file group (e.g images, json, vectors ...) - Ability to support general purpose multi-table transactions, esp between data and metadata tables. - Support lockless/non-blocking transactions, where writers don't block each other even in face of conflicts. - Support encoding of watermarks/event time fields as first class citizen, for handling late arriving data. - Support writing updates as deletes and inserts, instead of logging as update to base file. - Support checking of uniqueness constraints, even in face of two concurrent insert transactions. - Support precise time-travel queries - Support time-travel writes. - Support schema history tracking and aid in schema evol impl. - Easy integration of metadata for JVM and non-jvm clients - TrueTime store/support for instant times - CDC format is GA. - Support different logical partitions on the same data - Position based skipping of base file - Storage of table spread across buckets/root folders - Additional metadata to avoid more RPCs to scan base file/log blocks. - Decouple table location from timeline, metadata. They can all be in different places - was: This EPIC tracks changes to the Hudi storage format. Format change is anything that changes any bits related to - *Timeline* : active or archived timeline contents, file names. - *Base Files*: file format versions, any changes to any data types, file footers, file names. - *Log Files*: Block structure, content, names. - *Metadata Table*: (should we call this index table instead?) partition names, number of file groups, key/value schema and metadata to MDT row mappings. - *Table properties*: What's written to hoodie.properties. The following functionality should be met. - Ability to mix different types of base files within a single table or even a single file group (e.g images, json, vectors ...) - Ability to support general purpose multi-table transactions, esp between data and metadata tables. - Support lockless/non-blocking transactions, where writers don't block each other even in face of conflicts. - Support encoding of watermarks/event time fields as first class citizen, for handling late arriving data. - Support writing updates as deletes and inserts, instead of logging as update to base file. - Support checking of uniqueness constraints, even in face of two concurrent insert transactions. - Support precise time-travel queries - Support time-travel writes. - Support schema history tracking and aid in schema evol impl. - Easy integration of metadata for JVM and non-jvm clients - TrueTime store/support for instant times - CDC format is GA. - Support different logical partitions on the same data - Position based skipping of base file - Storage of table spread across buckets/root folders - Additional metadata to avoid more RPCs to scan base file/log blocks. - Decouple table location from timeline, metadata. They can all be in different places - > Format changes for Hudi 1.X release line > > > Key: HUDI-6242 > URL: https://issues.apache.org/jira/browse/HUDI-6242 > Project: Apache Hudi > Issue Type: Epic > Components: core >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Labels: hudi-umbrellas > Fix For: 1.0.0 > > > This EPIC tracks changes to the Hudi storage format. > Format change is anything that changes any bits related to > - *Timeline* : active or archived timeline contents, file names. > - *Base Files*: file format versions, any changes to any data types, file > footers, file names. > - *Log Files*: Block structure, content, names. > - *Metadata Table*: (should we call this index table instead?) partition > names, number of file groups, key/value schema and metadata to MDT row > mappings. > - *Table properties*: What's written to hoodie.properties. > The following functionality should be met (at a minimum) > - Ability to mix different typ
[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line
[ https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-6242: - Description: This EPIC tracks changes to the Hudi storage format. Format change is anything that changes any bits related to - *Timeline* : active or archived timeline contents, file names. - *Base Files*: file format versions, any changes to any data types, file footers, file names. - *Log Files*: Block structure, content, names. - *Metadata Table*: (should we call this index table instead?) partition names, number of file groups, key/value schema and metadata to MDT row mappings. - *Table properties*: What's written to hoodie.properties. The following functionality should be met. - Ability to mix different types of base files within a single table or even a single file group (e.g images, json, vectors ...) - Ability to support general purpose multi-table transactions, esp between data and metadata tables. - Support lockless/non-blocking transactions, where writers don't block each other even in face of conflicts. - Support encoding of watermarks/event time fields as first class citizen, for handling late arriving data. - Pluggable implementations for write handles, timeline, metadata table. > Format changes for Hudi 1.X release line > > > Key: HUDI-6242 > URL: https://issues.apache.org/jira/browse/HUDI-6242 > Project: Apache Hudi > Issue Type: Epic > Components: core >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Labels: hudi-umbrellas > Fix For: 1.0.0 > > > This EPIC tracks changes to the Hudi storage format. > Format change is anything that changes any bits related to > - *Timeline* : active or archived timeline contents, file names. > - *Base Files*: file format versions, any changes to any data types, file > footers, file names. > - *Log Files*: Block structure, content, names. > - *Metadata Table*: (should we call this index table instead?) partition > names, number of file groups, key/value schema and metadata to MDT row > mappings. > - *Table properties*: What's written to hoodie.properties. > The following functionality should be met. > - Ability to mix different types of base files within a single table or even > a single file group (e.g images, json, vectors ...) > - Ability to support general purpose multi-table transactions, esp between > data and metadata tables. > - Support lockless/non-blocking transactions, where writers don't block each > other even in face of conflicts. > - Support encoding of watermarks/event time fields as first class citizen, > for handling late arriving data. > - Pluggable implementations for write handles, timeline, metadata table. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line
[ https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-6242: - Labels: hudi-umbrellas (was: ) > Format changes for Hudi 1.X release line > > > Key: HUDI-6242 > URL: https://issues.apache.org/jira/browse/HUDI-6242 > Project: Apache Hudi > Issue Type: Epic > Components: core >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Labels: hudi-umbrellas > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)