[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

2024-03-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6242:
-
Epic Colour: ghx-label-2  (was: ghx-label-8)

> Format changes for Hudi 1.X release line
> 
>
> Key: HUDI-6242
> URL: https://issues.apache.org/jira/browse/HUDI-6242
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: hudi-umbrellas
> Fix For: 1.0.0
>
>
> This EPIC tracks changes to the Hudi storage format.
> h2. Proposals
> Format change is anything that changes any bits related to
>   * *Timeline* : active or archived timeline contents, file names.
>  * {*}Base Files{*}: file format versions, any changes to any data types, 
> file footers, file names.
>  * {*}Log Files{*}: Block structure, content, names.
>  * {*}Metadata Table{*}: (should we call this index table instead?) partition 
> names, number of file groups, key/value schema and metadata to MDT row 
> mappings.
>  * {*}Table properties{*}: What's written to 
> [hoodie.properties|http://hoodie.properties/].
>  * *Marker files* : Can be left to the writer implementation.
> h2. Change summary:
> The following functionality should be supportable by the new format tech 
> specs (at a minimum)
> Flexibility : 
>  * [Pending] Ability to mix different types of base files within a single 
> table or even a single file group (e.g images, json, vectors ...)
>  * [Pending] Easy integration of metadata for JVM and non-jvm clients 
> (parquet as MT format, HFile native APIs)
> Metafields : 
>  * [Resolved] Should _recordkey be uuid special handling?
>  * Semantics of _hoodie_commit_time , with completion time changes.
> Additional Info: 
>  * Support encoding of watermarks/event time fields as first class citizen, 
> for handling late arriving data.
>  * [Resolved] Position based skipping of base file
>  * [Pending] Additional metadata to avoid more RPCs to scan base file/log 
> blocks.
>  * [Pending] ML/Column family use-case?
>  * [Resolved] Support having changeset of columns in each write, other headers
> Log : 
>  * [No change needed] Support writing updates as deletes and inserts, instead 
> of logging as update to base file.
>  * [Pending] CDC format is GA.
> Table organization: 
>  * [Pending] Support different logical partitions on the same data
>  * [Pending] RFC-60/Storage of table spread across buckets/root folders
>  * [Pending] Decouple table location from timeline, metadata. They can all be 
> in different places
> Concurrency/Timeline: 
>   * [Pending] Ability to support general purpose multi-table transactions, 
> esp between data and metadata tables.
>  * [Pending] Support lockless/non-blocking transactions, where writers don't 
> block each other even in face of conflicts.
>  * [Resolved] Support for long lived instants in timeline, break down 
> distinction between active/archived
>  * [Pending] Support checking of uniqueness constraints, even in face of two 
> concurrent insert transactions.
>  * [Pending] Support precise time-travel queries
>  * [Pending] Support time-travel writes.
>  * [Pending] Support schema history tracking and aid in schema evol impl.
>  * [Resolved] TrueTime store/support for instant times
>  * [Pending] No more separate rollback action. make it a new state.
> Metadata table :
>   * Encode filegroup ID and commit time along with file metadata
> Table Properties:
>   * Partitioning information/indexing info
> Marker Files:
>   * Write marker files for logs as well, based on new marker format.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

2023-09-05 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6242:
-
Description: 
This EPIC tracks changes to the Hudi storage format.
h2. Proposals

Format change is anything that changes any bits related to

  * *Timeline* : active or archived timeline contents, file names.
 * {*}Base Files{*}: file format versions, any changes to any data types, file 
footers, file names.
 * {*}Log Files{*}: Block structure, content, names.
 * {*}Metadata Table{*}: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row mappings.
 * {*}Table properties{*}: What's written to 
[hoodie.properties|http://hoodie.properties/].
 * *Marker files* : Can be left to the writer implementation.

h2. Change summary:

The following functionality should be supportable by the new format tech specs 
(at a minimum)
Flexibility : 
 * [Pending] Ability to mix different types of base files within a single table 
or even a single file group (e.g images, json, vectors ...)
 * [Pending] Easy integration of metadata for JVM and non-jvm clients (parquet 
as MT format, HFile native APIs)

Metafields : 
 * [Resolved] Should _recordkey be uuid special handling?
 * Semantics of _hoodie_commit_time , with completion time changes.

Additional Info: 
 * Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data.
 * [Resolved] Position based skipping of base file
 * [Pending] Additional metadata to avoid more RPCs to scan base file/log 
blocks.
 * [Pending] ML/Column family use-case?
 * [Resolved] Support having changeset of columns in each write, other headers

Log : 
 * [No change needed] Support writing updates as deletes and inserts, instead 
of logging as update to base file.
 * [Pending] CDC format is GA.

Table organization: 

 * [Pending] Support different logical partitions on the same data
 * [Pending] RFC-60/Storage of table spread across buckets/root folders
 * [Pending] Decouple table location from timeline, metadata. They can all be 
in different places

Concurrency/Timeline: 

  * [Pending] Ability to support general purpose multi-table transactions, esp 
between data and metadata tables.
 * [Pending] Support lockless/non-blocking transactions, where writers don't 
block each other even in face of conflicts.
 * [Resolved] Support for long lived instants in timeline, break down 
distinction between active/archived
 * [Pending] Support checking of uniqueness constraints, even in face of two 
concurrent insert transactions.
 * [Pending] Support precise time-travel queries
 * [Pending] Support time-travel writes.
 * [Pending] Support schema history tracking and aid in schema evol impl.
 * [Resolved] TrueTime store/support for instant times
 * [Pending] No more separate rollback action. make it a new state.

Metadata table :

  * Encode filegroup ID and commit time along with file metadata

Table Properties:

  * Partitioning information/indexing info

Marker Files:

  * Write marker files for logs as well, based on new marker format.

  was:
This EPIC tracks changes to the Hudi storage format.
h2. Proposals
Format change is anything that changes any bits related to * *Timeline* : 
active or archived timeline contents, file names.
 * {*}Base Files{*}: file format versions, any changes to any data types, file 
footers, file names.
 * {*}Log Files{*}: Block structure, content, names.
 * {*}Metadata Table{*}: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row mappings.
 * {*}Table properties{*}: What's written to 
[hoodie.properties|http://hoodie.properties/].
 * *Marker files* : Can be left to the writer implementation.

h2. Change summary:
The following functionality should be supportable by the new format tech specs 
(at a minimum)
Flexibility : * [Pending] Ability to mix different types of base files within a 
single table or even a single file group (e.g images, json, vectors ...)
 * [Pending] Easy integration of metadata for JVM and non-jvm clients (parquet 
as MT format, HFile native APIs)

Metafields : * [Resolved] Should _recordkey be uuid special handling?
 * Semantics of _hoodie_commit_time , with completion time changes.

Additional Info: * Support encoding of watermarks/event time fields as first 
class citizen, for handling late arriving data.
 * [Resolved]  Position based skipping of base file
 * [Pending] Additional metadata to avoid more RPCs to scan base file/log 
blocks.
 * [Pending] ML/Column family use-case?
 * [Resolved] Support having changeset of columns in each write, other headers

Log : * [No change needed] Support writing updates as deletes and inserts, 
instead of logging as update to base file.
 * [Pending] CDC format is GA.

Table organization: * [Pending] Support different logical partitions on the 

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

2023-09-05 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6242:
-
Description: 
This EPIC tracks changes to the Hudi storage format.
h2. Proposals
Format change is anything that changes any bits related to * *Timeline* : 
active or archived timeline contents, file names.
 * {*}Base Files{*}: file format versions, any changes to any data types, file 
footers, file names.
 * {*}Log Files{*}: Block structure, content, names.
 * {*}Metadata Table{*}: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row mappings.
 * {*}Table properties{*}: What's written to 
[hoodie.properties|http://hoodie.properties/].
 * *Marker files* : Can be left to the writer implementation.

h2. Change summary:
The following functionality should be supportable by the new format tech specs 
(at a minimum)
Flexibility : * [Pending] Ability to mix different types of base files within a 
single table or even a single file group (e.g images, json, vectors ...)
 * [Pending] Easy integration of metadata for JVM and non-jvm clients (parquet 
as MT format, HFile native APIs)

Metafields : * [Resolved] Should _recordkey be uuid special handling?
 * Semantics of _hoodie_commit_time , with completion time changes.

Additional Info: * Support encoding of watermarks/event time fields as first 
class citizen, for handling late arriving data.
 * [Resolved]  Position based skipping of base file
 * [Pending] Additional metadata to avoid more RPCs to scan base file/log 
blocks.
 * [Pending] ML/Column family use-case?
 * [Resolved] Support having changeset of columns in each write, other headers

Log : * [No change needed] Support writing updates as deletes and inserts, 
instead of logging as update to base file.
 * [Pending] CDC format is GA.

Table organization: * [Pending] Support different logical partitions on the 
same data
 * [Pending] RFC-60/Storage of table spread across buckets/root folders
 * [Pending] Decouple table location from timeline, metadata. They can all be 
in different places

Concurrency/Timeline: * [Pending] Ability to support general purpose 
multi-table transactions, esp between data and metadata tables.
 * [Pending] Support lockless/non-blocking transactions, where writers don't 
block each other even in face of conflicts.
 * [Resolved] Support for long lived instants in timeline, break down 
distinction between active/archived
 * [Pending] Support checking of uniqueness constraints, even in face of two 
concurrent insert transactions.
 * [Pending] Support precise time-travel queries
 * [Pending] Support time-travel writes.
 * [Pending] Support schema history tracking and aid in schema evol impl.
 * [Resolved] TrueTime store/support for instant times
 * [Pending] No more separate rollback action. make it a new state.

Metadata table : * Encode filegroup ID and commit time along with file metadata

Table Properties: * Partitioning information/indexing info

Marker Files: * Write marker files for logs as well, based on new marker format.

  was:
This EPIC tracks changes to the Hudi storage format.
Format change is anything that changes any bits related to
 - *Timeline* : active or archived timeline contents, file names.
 - {*}Base Files{*}: file format versions, any changes to any data types, file 
footers, file names.
 - {*}Log Files{*}: Block structure, content, names.
 - {*}Metadata Table{*}: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row mappings.
 - {*}Table properties{*}: What's written to hoodie.properties.
 - *Marker files* : Can be left to the writer implementation.

h2. Change summary:

The following functionality should be supportable by the new format tech specs 
(at a minimum)

Flexibility :
 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...)
 - Easy integration of metadata for JVM and non-jvm clients

Metafields :
 - Should _recordkey be uuid special handling?

Additional Info:
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data.
 - Position based skipping of base file
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - ML/Column family use-case?
 - Support having changeset of columns in each write, other headers

Log :
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - CDC format is GA.

Table organization:
 - Support different logical partitions on the same data
 - Storage of table spread across buckets/root folders
 - Decouple table location from timeline, metadata. They can all be in 
different places

Concurrency/Timeline:
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Suppo

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

2023-08-29 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-6242:
--
Description: 
This EPIC tracks changes to the Hudi storage format.
Format change is anything that changes any bits related to
 - *Timeline* : active or archived timeline contents, file names.
 - {*}Base Files{*}: file format versions, any changes to any data types, file 
footers, file names.
 - {*}Log Files{*}: Block structure, content, names.
 - {*}Metadata Table{*}: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row mappings.
 - {*}Table properties{*}: What's written to hoodie.properties.
 - *Marker files* : Can be left to the writer implementation.

h2. Change summary:

The following functionality should be supportable by the new format tech specs 
(at a minimum)

Flexibility :
 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...)
 - Easy integration of metadata for JVM and non-jvm clients

Metafields :
 - Should _recordkey be uuid special handling?

Additional Info:
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data.
 - Position based skipping of base file
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - ML/Column family use-case?
 - Support having changeset of columns in each write, other headers

Log :
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - CDC format is GA.

Table organization:
 - Support different logical partitions on the same data
 - Storage of table spread across buckets/root folders
 - Decouple table location from timeline, metadata. They can all be in 
different places

Concurrency/Timeline:
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts.
 - Support for long lived instants in timeline, break down distinction between 
active/archived
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions.
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - TrueTime store/support for instant times
 - No more separate rollback action. make it a new state.

Metadata table :
 - Encode filegroup ID and commit time along with file metadata

Table Properties:
 - Partitioning information/indexing info

  was:
This EPIC tracks changes to the Hudi storage format.
Format change is anything that changes any bits related to
 - *Timeline* : active or archived timeline contents, file names.
 - {*}Base Files{*}: file format versions, any changes to any data types, file 
footers, file names.
 - {*}Log Files{*}: Block structure, content, names.
 - {*}Metadata Table{*}: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row mappings.
 - {*}Table properties{*}: What's written to hoodie.properties.
 - *Marker files* : Can be left to the writer implementation.

h2. Change summary:

 

 

 

 

 

 

 

The following functionality should be supportable by the new format tech specs 
(at a minimum)

Flexibility :
 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...)
 - Easy integration of metadata for JVM and non-jvm clients

Metafields :
 - Should _recordkey be uuid special handling?

Additional Info:
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data.
 - Position based skipping of base file
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - ML/Column family use-case?
 - Support having changeset of columns in each write, other headers

Log :
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - CDC format is GA.

Table organization:
 - Support different logical partitions on the same data
 - Storage of table spread across buckets/root folders
 - Decouple table location from timeline, metadata. They can all be in 
different places

Concurrency/Timeline:
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts.
 - Support for long lived instants in timeline, break down distinction between 
active/archived
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions.
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - TrueTime

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

2023-08-17 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6242:
-
Description: 
This EPIC tracks changes to the Hudi storage format.
Format change is anything that changes any bits related to
 - *Timeline* : active or archived timeline contents, file names.
 - {*}Base Files{*}: file format versions, any changes to any data types, file 
footers, file names.
 - {*}Log Files{*}: Block structure, content, names.
 - {*}Metadata Table{*}: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row mappings.
 - {*}Table properties{*}: What's written to hoodie.properties.
 - *Marker files* : Can be left to the writer implementation.

h2. Change summary:

 

 

 

 

 

 

 

The following functionality should be supportable by the new format tech specs 
(at a minimum)

Flexibility :
 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...)
 - Easy integration of metadata for JVM and non-jvm clients

Metafields :
 - Should _recordkey be uuid special handling?

Additional Info:
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data.
 - Position based skipping of base file
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - ML/Column family use-case?
 - Support having changeset of columns in each write, other headers

Log :
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - CDC format is GA.

Table organization:
 - Support different logical partitions on the same data
 - Storage of table spread across buckets/root folders
 - Decouple table location from timeline, metadata. They can all be in 
different places

Concurrency/Timeline:
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts.
 - Support for long lived instants in timeline, break down distinction between 
active/archived
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions.
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - TrueTime store/support for instant times
 - No more separate rollback action. make it a new state.

Metadata table :
 - Encode filegroup ID and commit time along with file metadata

Table Properties:
 - Partitioning information/indexing info

  was:
This EPIC tracks changes to the Hudi storage format.
Format change is anything that changes any bits related to
 - *Timeline* : active or archived timeline contents, file names.
 - {*}Base Files{*}: file format versions, any changes to any data types, file 
footers, file names.
 - {*}Log Files{*}: Block structure, content, names.
 - {*}Metadata Table{*}: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row mappings.
 - {*}Table properties{*}: What's written to hoodie.properties.
 - *Marker files* : Can be left to the writer implementation.

h2. Change summary:

 

 

 

 

 

 

 

The following functionality should be supportable by the new format tech specs 
(at a minimum)

Flexibility :
 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...)
 - Easy integration of metadata for JVM and non-jvm clients

Metafields :
 - Should _recordkey be uuid special handling?

Additional Info:
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data.
 - Position based skipping of base file
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - ML/Column family use-case?
 - Support having changeset of columns in each write, other headers

Log :
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - CDC format is GA.

Table organization:
 - Support different logical partitions on the same data
 - Storage of table spread across buckets/root folders
 - Decouple table location from timeline, metadata. They can all be in 
different places

Concurrency/Timeline:
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts.
 - Support for long lived instants in timeline, break down distinction between 
active/archived
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions.
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in sc

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

2023-08-16 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6242:
-
Description: 
This EPIC tracks changes to the Hudi storage format.
Format change is anything that changes any bits related to
 - *Timeline* : active or archived timeline contents, file names.
 - {*}Base Files{*}: file format versions, any changes to any data types, file 
footers, file names.
 - {*}Log Files{*}: Block structure, content, names.
 - {*}Metadata Table{*}: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row mappings.
 - {*}Table properties{*}: What's written to hoodie.properties.
 - *Marker files* : Can be left to the writer implementation.

h2. Change summary:

 

 

 

 

 

 

 

The following functionality should be supportable by the new format tech specs 
(at a minimum)

Flexibility :
 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...)
 - Easy integration of metadata for JVM and non-jvm clients

Metafields :
 - Should _recordkey be uuid special handling?

Additional Info:
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data.
 - Position based skipping of base file
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - ML/Column family use-case?
 - Support having changeset of columns in each write, other headers

Log :
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - CDC format is GA.

Table organization:
 - Support different logical partitions on the same data
 - Storage of table spread across buckets/root folders
 - Decouple table location from timeline, metadata. They can all be in 
different places

Concurrency/Timeline:
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts.
 - Support for long lived instants in timeline, break down distinction between 
active/archived
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions.
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - TrueTime store/support for instant times

Metadata table :
 - Encode filegroup ID and commit time along with file metadata

Table Properties:
 - Partitioning information/indexing info

  was:
This EPIC tracks changes to the Hudi storage format.
Format change is anything that changes any bits related to

 - *Timeline* : active or archived timeline contents, file names.
 - *Base Files*: file format versions, any changes to any data types, file 
footers, file names.
 - *Log Files*:  Block structure, content, names. 
 - *Metadata Table*: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row 
mappings. 
- *Table properties*: What's written to hoodie.properties.
- *Marker files* : Can be left to the writer implementation.


The following functionality should be supportable by the new format tech specs 
(at a minimum) 

Flexibility : 
 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...) 
 - Easy integration of metadata for JVM and non-jvm clients

Metafields :
- Should _recordkey be uuid special handling?


Additional Info:
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data. 
 - Position based skipping of base file
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - ML/Column family use-case?
 - Support having changeset of columns in each write, other headers


Log : 
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - CDC format is GA.

Table organization:
 - Support different logical partitions on the same data
 - Storage of table spread across buckets/root folders
 - Decouple table location from timeline, metadata. They can all be in 
different places


Concurrency/Timeline: 
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts. 
 - Support for long lived instants in timeline, break down distinction between 
active/archived
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions. 
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - TrueTime store/support for instant times

Metadata table :
 - Encode filegroup ID a

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

2023-08-16 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6242:
-
Description: 
This EPIC tracks changes to the Hudi storage format.
Format change is anything that changes any bits related to

 - *Timeline* : active or archived timeline contents, file names.
 - *Base Files*: file format versions, any changes to any data types, file 
footers, file names.
 - *Log Files*:  Block structure, content, names. 
 - *Metadata Table*: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row 
mappings. 
- *Table properties*: What's written to hoodie.properties.
- *Marker files* : Can be left to the writer implementation.


The following functionality should be supportable by the new format tech specs 
(at a minimum) 

Flexibility : 
 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...) 
 - Easy integration of metadata for JVM and non-jvm clients

Metafields :
- Should _recordkey be uuid special handling?


Additional Info:
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data. 
 - Position based skipping of base file
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - ML/Column family use-case?
 - Support having changeset of columns in each write, other headers


Log : 
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - CDC format is GA.

Table organization:
 - Support different logical partitions on the same data
 - Storage of table spread across buckets/root folders
 - Decouple table location from timeline, metadata. They can all be in 
different places


Concurrency/Timeline: 
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts. 
 - Support for long lived instants in timeline, break down distinction between 
active/archived
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions. 
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - TrueTime store/support for instant times

Metadata table :
 - Encode filegroup ID and commit time along with file metadata


Table Properties: 
 - Partitioning information/indexing info


  was:
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

 - *Timeline* : active or archived timeline contents, file names.
 - *Base Files*: file format versions, any changes to any data types, file 
footers, file names.
 - *Log Files*:  Block structure, content, names. 
 - *Metadata Table*: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row 
mappings. 
- *Table properties*: What's written to hoodie.properties.
- *Marker files* : how would we treat these?


The following functionality should be supportable by the new format tech specs 
(at a minimum) 

Flexibility : 
 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...) 
 - Easy integration of metadata for JVM and non-jvm clients

Metafields :
- Should _recordkey be uuid special handling?


Additional Info:
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data. 
 - Position based skipping of base file
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - ML/Column family use-case?
 - Support having changeset of columns in each write, other headers


Log : 
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - CDC format is GA.

Table organization:
 - Support different logical partitions on the same data
 - Storage of table spread across buckets/root folders
 - Decouple table location from timeline, metadata. They can all be in 
different places


Concurrency/Timeline: 
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts. 
 - Support for long lived instants in timeline, break down distinction between 
active/archived
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions. 
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - TrueTime store/support for instant times

Metadata table :
 - Encode filegroup ID and commit time along with file metadata


Table Properti

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

2023-08-15 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6242:
-
Due Date: 13/Sep/23  (was: 30/Jun/23)

> Format changes for Hudi 1.X release line
> 
>
> Key: HUDI-6242
> URL: https://issues.apache.org/jira/browse/HUDI-6242
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: hudi-umbrellas
> Fix For: 1.0.0
>
>
> This EPIC tracks changes to the Hudi storage format.
> Format change is anything that changes any bits related to
>  - *Timeline* : active or archived timeline contents, file names.
>  - *Base Files*: file format versions, any changes to any data types, file 
> footers, file names.
>  - *Log Files*:  Block structure, content, names. 
>  - *Metadata Table*: (should we call this index table instead?) partition 
> names, number of file groups, key/value schema and metadata to MDT row 
> mappings. 
> - *Table properties*: What's written to hoodie.properties.
> - *Marker files* : how would we treat these?
> The following functionality should be supportable by the new format tech 
> specs (at a minimum) 
> Flexibility : 
>  - Ability to mix different types of base files within a single table or even 
> a single file group (e.g images, json, vectors ...) 
>  - Easy integration of metadata for JVM and non-jvm clients
> Metafields :
> - Should _recordkey be uuid special handling?
> Additional Info:
>  - Support encoding of watermarks/event time fields as first class citizen, 
> for handling late arriving data. 
>  - Position based skipping of base file
>  - Additional metadata to avoid more RPCs to scan base file/log blocks.
>  - ML/Column family use-case?
>  - Support having changeset of columns in each write, other headers
> Log : 
>  - Support writing updates as deletes and inserts, instead of logging as 
> update to base file.
>  - CDC format is GA.
> Table organization:
>  - Support different logical partitions on the same data
>  - Storage of table spread across buckets/root folders
>  - Decouple table location from timeline, metadata. They can all be in 
> different places
> Concurrency/Timeline: 
>  - Ability to support general purpose multi-table transactions, esp between 
> data and metadata tables.
>  - Support lockless/non-blocking transactions, where writers don't block each 
> other even in face of conflicts. 
>  - Support for long lived instants in timeline, break down distinction 
> between active/archived
>  - Support checking of uniqueness constraints, even in face of two concurrent 
> insert transactions. 
>  - Support precise time-travel queries
>  - Support time-travel writes.
>  - Support schema history tracking and aid in schema evol impl.
>  - TrueTime store/support for instant times
> Metadata table :
>  - Encode filegroup ID and commit time along with file metadata
> Table Properties: 
>  - Partitioning information/indexing info



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

2023-08-02 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6242:
-
Description: 
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

 - *Timeline* : active or archived timeline contents, file names.
 - *Base Files*: file format versions, any changes to any data types, file 
footers, file names.
 - *Log Files*:  Block structure, content, names. 
 - *Metadata Table*: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row 
mappings. 
- *Table properties*: What's written to hoodie.properties.
- *Marker files* : how would we treat these?


The following functionality should be supportable by the new format tech specs 
(at a minimum) 

Flexibility : 
 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...) 
 - Easy integration of metadata for JVM and non-jvm clients

Metafields :
- Should _recordkey be uuid special handling?


Additional Info:
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data. 
 - Position based skipping of base file
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - ML/Column family use-case?
 - Support having changeset of columns in each write, other headers


Log : 
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - CDC format is GA.

Table organization:
 - Support different logical partitions on the same data
 - Storage of table spread across buckets/root folders
 - Decouple table location from timeline, metadata. They can all be in 
different places


Concurrency/Timeline: 
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts. 
 - Support for long lived instants in timeline, break down distinction between 
active/archived
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions. 
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - TrueTime store/support for instant times

Metadata table :
 - Encode filegroup ID and commit time along with file metadata


Table Properties: 
 - Partitioning information/indexing info


  was:
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

 - *Timeline* : active or archived timeline contents, file names.
 - *Base Files*: file format versions, any changes to any data types, file 
footers, file names.
 - *Log Files*:  Block structure, content, names. 
 - *Metadata Table*: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row 
mappings. 
- *Table properties*: What's written to hoodie.properties.
- *Marker files* : how would we treat these?


The following functionality should be supportable by the new format tech specs 
(at a minimum) 

Flexibility : 
 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...) 
 - Easy integration of metadata for JVM and non-jvm clients


Additional Info:
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data. 
 - Position based skipping of base file
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - ML/Column family use-case?
 - Support having changeset of columns in each write, other headers


Log : 
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - CDC format is GA.

Table organization:
 - Support different logical partitions on the same data
 - Storage of table spread across buckets/root folders
 - Decouple table location from timeline, metadata. They can all be in 
different places


Concurrency/Timeline: 
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts. 
 - Support for long lived instants in timeline, break down distinction between 
active/archived
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions. 
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - TrueTime store/support for instant times

Metadata table :
 - Encode filegroup ID and commit time along with file metadata


Table Properties: 
 - Partitioning information/indexing info



> Format changes for Hudi

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

2023-07-06 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6242:
-
Description: 
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

 - *Timeline* : active or archived timeline contents, file names.
 - *Base Files*: file format versions, any changes to any data types, file 
footers, file names.
 - *Log Files*:  Block structure, content, names. 
 - *Metadata Table*: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row 
mappings. 
- *Table properties*: What's written to hoodie.properties.
- *Marker files* : how would we treat these?


The following functionality should be supportable by the new format tech specs 
(at a minimum) 

Flexibility : 
 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...) 
 - Easy integration of metadata for JVM and non-jvm clients


Additional Info:
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data. 
 - Position based skipping of base file
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - ML/Column family use-case?
 - Support having changeset of columns in each write, other headers


Log : 
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - CDC format is GA.

Table organization:
 - Support different logical partitions on the same data
 - Storage of table spread across buckets/root folders
 - Decouple table location from timeline, metadata. They can all be in 
different places


Concurrency/Timeline: 
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts. 
 - Support for long lived instants in timeline, break down distinction between 
active/archived
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions. 
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - TrueTime store/support for instant times

Metadata table :
 - Encode filegroup ID and commit time along with file metadata


Table Properties: 
 - Partitioning information/indexing info


  was:
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

 - *Timeline* : active or archived timeline contents, file names.
 - *Base Files*: file format versions, any changes to any data types, file 
footers, file names.
 - *Log Files*:  Block structure, content, names. 
 - *Metadata Table*: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row 
mappings. 
- *Table properties*: What's written to hoodie.properties.
- *Marker files* : how would we treat these?


The following functionality should be supportable by the new format tech specs 
(at a minimum) 

Flexibility : 
 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...) 
 - Easy integration of metadata for JVM and non-jvm clients


Additional Info:
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data. 
 - Position based skipping of base file
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - ML/Column family use-case?
 - Support having changeset of columns in each write, other headers


Log : 
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - CDC format is GA.

Table organization:
 - Support different logical partitions on the same data
 - Storage of table spread across buckets/root folders
 - Decouple table location from timeline, metadata. They can all be in 
different places


Concurrency/Timeline: 
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts. 
 - Support for long lived instants in timeline, break down distinction between 
active/archived
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions. 
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - TrueTime store/support for instant times



> Format changes for Hudi 1.X release line
> 
>
> Key: HUDI-6242
> URL: https://issues.apache.org/jira/browse/HUDI-6242
> Project: Apache Hudi
>  

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

2023-07-05 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6242:
-
Description: 
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

 - *Timeline* : active or archived timeline contents, file names.
 - *Base Files*: file format versions, any changes to any data types, file 
footers, file names.
 - *Log Files*:  Block structure, content, names. 
 - *Metadata Table*: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row 
mappings. 
- *Table properties*: What's written to hoodie.properties.
- *Marker files* : how would we treat these?


The following functionality should be supportable by the new format tech specs 
(at a minimum) 

Flexibility : 
 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...) 
 - Easy integration of metadata for JVM and non-jvm clients


Additional Info:
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data. 
 - Position based skipping of base file
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - ML/Column family use-case?
 - Support having changeset of columns in each write, other headers


Log : 
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - CDC format is GA.

Table organization:
 - Support different logical partitions on the same data
 - Storage of table spread across buckets/root folders
 - Decouple table location from timeline, metadata. They can all be in 
different places


Concurrency/Timeline: 
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts. 
 - Support for long lived instants in timeline, break down distinction between 
active/archived
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions. 
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - TrueTime store/support for instant times


  was:
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

 - *Timeline* : active or archived timeline contents, file names.
 - *Base Files*: file format versions, any changes to any data types, file 
footers, file names.
 - *Log Files*:  Block structure, content, names. 
 - *Metadata Table*: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row 
mappings. 
- *Table properties*: What's written to hoodie.properties.
- *Marker files* : how would we treat these?


The following functionality should be met (at a minimum) 

Flexibility : 
 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...) 
 - Easy integration of metadata for JVM and non-jvm clients


Additional Info:
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data. 
 - Position based skipping of base file
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - ML/Column family use-case?
 - Support having changeset of columns in each write, other headers


Log : 
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - CDC format is GA.

Table organization:
 - Support different logical partitions on the same data
 - Storage of table spread across buckets/root folders
 - Decouple table location from timeline, metadata. They can all be in 
different places


Concurrency/Timeline: 
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts. 
 - Support for long lived instants in timeline, break down distinction between 
active/archived
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions. 
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - TrueTime store/support for instant times



> Format changes for Hudi 1.X release line
> 
>
> Key: HUDI-6242
> URL: https://issues.apache.org/jira/browse/HUDI-6242
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: hu

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

2023-07-05 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6242:
-
Description: 
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

 - *Timeline* : active or archived timeline contents, file names.
 - *Base Files*: file format versions, any changes to any data types, file 
footers, file names.
 - *Log Files*:  Block structure, content, names. 
 - *Metadata Table*: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row 
mappings. 
- *Table properties*: What's written to hoodie.properties.
- *Marker files* : how would we treat these?


The following functionality should be met (at a minimum) 

Flexibility : 
 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...) 
 - Easy integration of metadata for JVM and non-jvm clients


Additional Info:
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data. 
 - Position based skipping of base file
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - ML/Column family use-case?
 - Support having changeset of columns in each write, other headers


Log : 
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - CDC format is GA.

Table organization:
 - Support different logical partitions on the same data
 - Storage of table spread across buckets/root folders
 - Decouple table location from timeline, metadata. They can all be in 
different places


Concurrency/Timeline: 
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts. 
 - Support for long lived instants in timeline, break down distinction between 
active/archived
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions. 
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - TrueTime store/support for instant times


  was:
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

 - *Timeline* : active or archived timeline contents, file names.
 - *Base Files*: file format versions, any changes to any data types, file 
footers, file names.
 - *Log Files*:  Block structure, content, names. 
 - *Metadata Table*: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row 
mappings. 
- *Table properties*: What's written to hoodie.properties.
- *Marker files* : how would we treat these?


The following functionality should be met (at a minimum) 

 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...) 
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts. 
 - Support for long lived instants in timeline, break down distinction between 
active/archived
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data. 
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions. 
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - Easy integration of metadata for JVM and non-jvm clients
 - TrueTime store/support for instant times
 - CDC format is GA.
 - Support different logical partitions on the same data
 - Position based skipping of base file
 - Storage of table spread across buckets/root folders
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - Decouple table location from timeline, metadata. They can all be in 
different places
 - Support having changeset of columns in each write, other headers
 - ML/Column family use-case



> Format changes for Hudi 1.X release line
> 
>
> Key: HUDI-6242
> URL: https://issues.apache.org/jira/browse/HUDI-6242
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: hudi-umbrellas
> Fix For: 1.0.0
>
>
> This EPIC tracks changes to the Hudi storage format.
> Format change is anything

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

2023-07-05 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6242:
-
Description: 
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

 - *Timeline* : active or archived timeline contents, file names.
 - *Base Files*: file format versions, any changes to any data types, file 
footers, file names.
 - *Log Files*:  Block structure, content, names. 
 - *Metadata Table*: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row 
mappings. 
- *Table properties*: What's written to hoodie.properties.
- *Marker files* : how would we treat these?


The following functionality should be met (at a minimum) 

 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...) 
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts. 
 - Support for long lived instants in timeline, break down distinction between 
active/archived
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data. 
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions. 
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - Easy integration of metadata for JVM and non-jvm clients
 - TrueTime store/support for instant times
 - CDC format is GA.
 - Support different logical partitions on the same data
 - Position based skipping of base file
 - Storage of table spread across buckets/root folders
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - Decouple table location from timeline, metadata. They can all be in 
different places
 - Support having changeset of columns in each write, other headers
 - ML/Column family use-case


  was:
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

 - *Timeline* : active or archived timeline contents, file names.
 - *Base Files*: file format versions, any changes to any data types, file 
footers, file names.
 - *Log Files*:  Block structure, content, names. 
 - *Metadata Table*: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row 
mappings. 
- *Table properties*: What's written to hoodie.properties.
- *Marker files* : how would we treat these?


The following functionality should be met (at a minimum) 

 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...) 
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts. 
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data. 
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions. 
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - Easy integration of metadata for JVM and non-jvm clients
 - TrueTime store/support for instant times
 - CDC format is GA.
 - Support different logical partitions on the same data
 - Position based skipping of base file
 - Storage of table spread across buckets/root folders
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - Decouple table location from timeline, metadata. They can all be in 
different places
 - Support having changeset of columns in each write, other headers
 - ML/Column family use-case



> Format changes for Hudi 1.X release line
> 
>
> Key: HUDI-6242
> URL: https://issues.apache.org/jira/browse/HUDI-6242
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: hudi-umbrellas
> Fix For: 1.0.0
>
>
> This EPIC tracks changes to the Hudi storage format.
> Format change is anything that changes any bits related to
>  - *Timeline* : active or archived timeline contents, file names.
>  - *Base Files*: file format versions, any changes to any data types, file 
> foot

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

2023-07-05 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6242:
-
Description: 
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

 - *Timeline* : active or archived timeline contents, file names.
 - *Base Files*: file format versions, any changes to any data types, file 
footers, file names.
 - *Log Files*:  Block structure, content, names. 
 - *Metadata Table*: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row 
mappings. 
- *Table properties*: What's written to hoodie.properties.
- *Marker files* : how would we treat these?


The following functionality should be met (at a minimum) 

 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...) 
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts. 
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data. 
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions. 
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - Easy integration of metadata for JVM and non-jvm clients
 - TrueTime store/support for instant times
 - CDC format is GA.
 - Support different logical partitions on the same data
 - Position based skipping of base file
 - Storage of table spread across buckets/root folders
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - Decouple table location from timeline, metadata. They can all be in 
different places
 - Support having changeset of columns in each write, other headers
 - ML/Column family use-case


  was:
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

 - *Timeline* : active or archived timeline contents, file names.
 - *Base Files*: file format versions, any changes to any data types, file 
footers, file names.
 - *Log Files*:  Block structure, content, names. 
 - *Metadata Table*: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row 
mappings. 
- *Table properties*: What's written to hoodie.properties.
- *Marker files* : how would we treat these?


The following functionality should be met (at a minimum) 

 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...) 
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts. 
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data. 
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions. 
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - Easy integration of metadata for JVM and non-jvm clients
 - TrueTime store/support for instant times
 - CDC format is GA.
 - Support different logical partitions on the same data
 - Position based skipping of base file
 - Storage of table spread across buckets/root folders
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - Decouple table location from timeline, metadata. They can all be in 
different places
 - Support having changeset of columns in each write, other headers



> Format changes for Hudi 1.X release line
> 
>
> Key: HUDI-6242
> URL: https://issues.apache.org/jira/browse/HUDI-6242
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: hudi-umbrellas
> Fix For: 1.0.0
>
>
> This EPIC tracks changes to the Hudi storage format.
> Format change is anything that changes any bits related to
>  - *Timeline* : active or archived timeline contents, file names.
>  - *Base Files*: file format versions, any changes to any data types, file 
> footers, file names.
>  - *Log Files*:  Block structure, content, names. 
>  - *Metadata Table*: (should we call this index table

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

2023-07-05 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6242:
-
Description: 
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

 - *Timeline* : active or archived timeline contents, file names.
 - *Base Files*: file format versions, any changes to any data types, file 
footers, file names.
 - *Log Files*:  Block structure, content, names. 
 - *Metadata Table*: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row 
mappings. 
- *Table properties*: What's written to hoodie.properties.
- *Marker files* : how would we treat these?


The following functionality should be met (at a minimum) 

 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...) 
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts. 
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data. 
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions. 
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - Easy integration of metadata for JVM and non-jvm clients
 - TrueTime store/support for instant times
 - CDC format is GA.
 - Support different logical partitions on the same data
 - Position based skipping of base file
 - Storage of table spread across buckets/root folders
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - Decouple table location from timeline, metadata. They can all be in 
different places
 - Support having changeset of columns in each write, other headers


  was:
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

 - *Timeline* : active or archived timeline contents, file names.
 - *Base Files*: file format versions, any changes to any data types, file 
footers, file names.
 - *Log Files*:  Block structure, content, names. 
 - *Metadata Table*: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row 
mappings. 
- *Table properties*: What's written to hoodie.properties.


The following functionality should be met (at a minimum) 

 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...) 
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts. 
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data. 
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions. 
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - Easy integration of metadata for JVM and non-jvm clients
 - TrueTime store/support for instant times
 - CDC format is GA.
 - Support different logical partitions on the same data
 - Position based skipping of base file
 - Storage of table spread across buckets/root folders
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - Decouple table location from timeline, metadata. They can all be in 
different places


> Format changes for Hudi 1.X release line
> 
>
> Key: HUDI-6242
> URL: https://issues.apache.org/jira/browse/HUDI-6242
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: hudi-umbrellas
> Fix For: 1.0.0
>
>
> This EPIC tracks changes to the Hudi storage format.
> Format change is anything that changes any bits related to
>  - *Timeline* : active or archived timeline contents, file names.
>  - *Base Files*: file format versions, any changes to any data types, file 
> footers, file names.
>  - *Log Files*:  Block structure, content, names. 
>  - *Metadata Table*: (should we call this index table instead?) partition 
> names, number of file groups, key/value schema and metadata to MDT row 
> mappings. 
> - *Table properties*: What's wri

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

2023-07-05 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6242:
-
Description: 
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

 - *Timeline* : active or archived timeline contents, file names.
 - *Base Files*: file format versions, any changes to any data types, file 
footers, file names.
 - *Log Files*:  Block structure, content, names. 
 - *Metadata Table*: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row 
mappings. 
- *Table properties*: What's written to hoodie.properties.


The following functionality should be met (at a minimum) 

 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...) 
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts. 
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data. 
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions. 
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - Easy integration of metadata for JVM and non-jvm clients
 - TrueTime store/support for instant times
 - CDC format is GA.
 - Support different logical partitions on the same data
 - Position based skipping of base file
 - Storage of table spread across buckets/root folders
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - Decouple table location from timeline, metadata. They can all be in 
different places

  was:
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

 - *Timeline* : active or archived timeline contents, file names.
 - *Base Files*: file format versions, any changes to any data types, file 
footers, file names.
 - *Log Files*:  Block structure, content, names. 
 - *Metadata Table*: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row 
mappings. 
- *Table properties*: What's written to hoodie.properties.


The following functionality should be met (at a minimum) 

 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...) 
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts. 
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data. 
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions. 
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - Easy integration of metadata for JVM and non-jvm clients
 - TrueTime store/support for instant times
 - CDC format is GA.
 - Support different logical partitions on the same data
 - Position based skipping of base file
 - Storage of table spread across buckets/root folders
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - Decouple table location from timeline, metadata. They can all be in 
different places
 - 


> Format changes for Hudi 1.X release line
> 
>
> Key: HUDI-6242
> URL: https://issues.apache.org/jira/browse/HUDI-6242
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: hudi-umbrellas
> Fix For: 1.0.0
>
>
> This EPIC tracks changes to the Hudi storage format.
> Format change is anything that changes any bits related to
>  - *Timeline* : active or archived timeline contents, file names.
>  - *Base Files*: file format versions, any changes to any data types, file 
> footers, file names.
>  - *Log Files*:  Block structure, content, names. 
>  - *Metadata Table*: (should we call this index table instead?) partition 
> names, number of file groups, key/value schema and metadata to MDT row 
> mappings. 
> - *Table properties*: What's written to hoodie.properties.
> The following functionality should be met (at a minimum) 
>  - Ability to mix dif

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

2023-07-05 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6242:
-
Description: 
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

 - *Timeline* : active or archived timeline contents, file names.
 - *Base Files*: file format versions, any changes to any data types, file 
footers, file names.
 - *Log Files*:  Block structure, content, names. 
 - *Metadata Table*: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row 
mappings. 
- *Table properties*: What's written to hoodie.properties.


The following functionality should be met. 

 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...) 
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts. 
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data. 
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions. 
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - Easy integration of metadata for JVM and non-jvm clients
 - TrueTime store/support for instant times
 - CDC format is GA.
 - Support different logical partitions on the same data
 - Position based skipping of base file
 - Storage of table spread across buckets/root folders
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - Decouple table location from timeline, metadata. They can all be in 
different places
 - 

  was:
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

 - *Timeline* : active or archived timeline contents, file names.
 - *Base Files*: file format versions, any changes to any data types, file 
footers, file names.
 - *Log Files*:  Block structure, content, names. 
 - *Metadata Table*: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row 
mappings. 
- *Table properties*: What's written to hoodie.properties.


The following functionality should be met. 

 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...) 
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts. 
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data. 
 - Pluggable implementations for write handles, timeline, metadata table.



> Format changes for Hudi 1.X release line
> 
>
> Key: HUDI-6242
> URL: https://issues.apache.org/jira/browse/HUDI-6242
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: hudi-umbrellas
> Fix For: 1.0.0
>
>
> This EPIC tracks changes to the Hudi storage format.
> Format change is anything that changes any bits related to
>  - *Timeline* : active or archived timeline contents, file names.
>  - *Base Files*: file format versions, any changes to any data types, file 
> footers, file names.
>  - *Log Files*:  Block structure, content, names. 
>  - *Metadata Table*: (should we call this index table instead?) partition 
> names, number of file groups, key/value schema and metadata to MDT row 
> mappings. 
> - *Table properties*: What's written to hoodie.properties.
> The following functionality should be met. 
>  - Ability to mix different types of base files within a single table or even 
> a single file group (e.g images, json, vectors ...) 
>  - Ability to support general purpose multi-table transactions, esp between 
> data and metadata tables.
>  - Support lockless/non-blocking transactions, where writers don't block each 
> other even in face of conflicts. 
>  - Support encoding of watermarks/event time fields as first class citizen, 
> for handling late arriving data. 
>  - Support writing updates as deletes and inserts, instead of logging as 
> update to base file.
>  - Support checking of uniqueness constraints, even in face of two concurrent 
> insert transactions. 
>  - Support precise time-travel queries
>  - Support time-travel writes.
>  - 

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

2023-07-05 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6242:
-
Description: 
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

 - *Timeline* : active or archived timeline contents, file names.
 - *Base Files*: file format versions, any changes to any data types, file 
footers, file names.
 - *Log Files*:  Block structure, content, names. 
 - *Metadata Table*: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row 
mappings. 
- *Table properties*: What's written to hoodie.properties.


The following functionality should be met (at a minimum) 

 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...) 
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts. 
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data. 
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions. 
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - Easy integration of metadata for JVM and non-jvm clients
 - TrueTime store/support for instant times
 - CDC format is GA.
 - Support different logical partitions on the same data
 - Position based skipping of base file
 - Storage of table spread across buckets/root folders
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - Decouple table location from timeline, metadata. They can all be in 
different places
 - 

  was:
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

 - *Timeline* : active or archived timeline contents, file names.
 - *Base Files*: file format versions, any changes to any data types, file 
footers, file names.
 - *Log Files*:  Block structure, content, names. 
 - *Metadata Table*: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row 
mappings. 
- *Table properties*: What's written to hoodie.properties.


The following functionality should be met. 

 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...) 
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts. 
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data. 
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions. 
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - Easy integration of metadata for JVM and non-jvm clients
 - TrueTime store/support for instant times
 - CDC format is GA.
 - Support different logical partitions on the same data
 - Position based skipping of base file
 - Storage of table spread across buckets/root folders
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - Decouple table location from timeline, metadata. They can all be in 
different places
 - 


> Format changes for Hudi 1.X release line
> 
>
> Key: HUDI-6242
> URL: https://issues.apache.org/jira/browse/HUDI-6242
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: hudi-umbrellas
> Fix For: 1.0.0
>
>
> This EPIC tracks changes to the Hudi storage format.
> Format change is anything that changes any bits related to
>  - *Timeline* : active or archived timeline contents, file names.
>  - *Base Files*: file format versions, any changes to any data types, file 
> footers, file names.
>  - *Log Files*:  Block structure, content, names. 
>  - *Metadata Table*: (should we call this index table instead?) partition 
> names, number of file groups, key/value schema and metadata to MDT row 
> mappings. 
> - *Table properties*: What's written to hoodie.properties.
> The following functionality should be met (at a minimum) 
>  - Ability to mix different typ

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

2023-07-05 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6242:
-
Description: 
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

 - *Timeline* : active or archived timeline contents, file names.
 - *Base Files*: file format versions, any changes to any data types, file 
footers, file names.
 - *Log Files*:  Block structure, content, names. 
 - *Metadata Table*: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row 
mappings. 
- *Table properties*: What's written to hoodie.properties.


The following functionality should be met. 

 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...) 
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts. 
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data. 
 - Pluggable implementations for write handles, timeline, metadata table.


> Format changes for Hudi 1.X release line
> 
>
> Key: HUDI-6242
> URL: https://issues.apache.org/jira/browse/HUDI-6242
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: hudi-umbrellas
> Fix For: 1.0.0
>
>
> This EPIC tracks changes to the Hudi storage format.
> Format change is anything that changes any bits related to
>  - *Timeline* : active or archived timeline contents, file names.
>  - *Base Files*: file format versions, any changes to any data types, file 
> footers, file names.
>  - *Log Files*:  Block structure, content, names. 
>  - *Metadata Table*: (should we call this index table instead?) partition 
> names, number of file groups, key/value schema and metadata to MDT row 
> mappings. 
> - *Table properties*: What's written to hoodie.properties.
> The following functionality should be met. 
>  - Ability to mix different types of base files within a single table or even 
> a single file group (e.g images, json, vectors ...) 
>  - Ability to support general purpose multi-table transactions, esp between 
> data and metadata tables.
>  - Support lockless/non-blocking transactions, where writers don't block each 
> other even in face of conflicts. 
>  - Support encoding of watermarks/event time fields as first class citizen, 
> for handling late arriving data. 
>  - Pluggable implementations for write handles, timeline, metadata table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

2023-05-19 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6242:
-
Labels: hudi-umbrellas  (was: )

> Format changes for Hudi 1.X release line
> 
>
> Key: HUDI-6242
> URL: https://issues.apache.org/jira/browse/HUDI-6242
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: hudi-umbrellas
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)