subject:"\[jira\] \[Updated\] \(HUDI\-6242\) Format changes for Hudi 1.X release line"

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

2024-03-25 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6242:
-
Epic Colour: ghx-label-2  (was: ghx-label-8)

> Format changes for Hudi 1.X release line
> 
>
> Key: HUDI-6242
> URL: https://issues.apache.org/jira/browse/HUDI-6242
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: hudi-umbrellas
> Fix For: 1.0.0
>
>
> This EPIC tracks changes to the Hudi storage format.
> h2. Proposals
> Format change is anything that changes any bits related to
>   * *Timeline* : active or archived timeline contents, file names.
>  * {*}Base Files{*}: file format versions, any changes to any data types, 
> file footers, file names.
>  * {*}Log Files{*}: Block structure, content, names.
>  * {*}Metadata Table{*}: (should we call this index table instead?) partition 
> names, number of file groups, key/value schema and metadata to MDT row 
> mappings.
>  * {*}Table properties{*}: What's written to 
> [hoodie.properties|http://hoodie.properties/].
>  * *Marker files* : Can be left to the writer implementation.
> h2. Change summary:
> The following functionality should be supportable by the new format tech 
> specs (at a minimum)
> Flexibility : 
>  * [Pending] Ability to mix different types of base files within a single 
> table or even a single file group (e.g images, json, vectors ...)
>  * [Pending] Easy integration of metadata for JVM and non-jvm clients 
> (parquet as MT format, HFile native APIs)
> Metafields : 
>  * [Resolved] Should _recordkey be uuid special handling?
>  * Semantics of _hoodie_commit_time , with completion time changes.
> Additional Info: 
>  * Support encoding of watermarks/event time fields as first class citizen, 
> for handling late arriving data.
>  * [Resolved] Position based skipping of base file
>  * [Pending] Additional metadata to avoid more RPCs to scan base file/log 
> blocks.
>  * [Pending] ML/Column family use-case?
>  * [Resolved] Support having changeset of columns in each write, other headers
> Log : 
>  * [No change needed] Support writing updates as deletes and inserts, instead 
> of logging as update to base file.
>  * [Pending] CDC format is GA.
> Table organization: 
>  * [Pending] Support different logical partitions on the same data
>  * [Pending] RFC-60/Storage of table spread across buckets/root folders
>  * [Pending] Decouple table location from timeline, metadata. They can all be 
> in different places
> Concurrency/Timeline: 
>   * [Pending] Ability to support general purpose multi-table transactions, 
> esp between data and metadata tables.
>  * [Pending] Support lockless/non-blocking transactions, where writers don't 
> block each other even in face of conflicts.
>  * [Resolved] Support for long lived instants in timeline, break down 
> distinction between active/archived
>  * [Pending] Support checking of uniqueness constraints, even in face of two 
> concurrent insert transactions.
>  * [Pending] Support precise time-travel queries
>  * [Pending] Support time-travel writes.
>  * [Pending] Support schema history tracking and aid in schema evol impl.
>  * [Resolved] TrueTime store/support for instant times
>  * [Pending] No more separate rollback action. make it a new state.
> Metadata table :
>   * Encode filegroup ID and commit time along with file metadata
> Table Properties:
>   * Partitioning information/indexing info
> Marker Files:
>   * Write marker files for logs as well, based on new marker format.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

2023-09-05 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6242:
-
Description: 
This EPIC tracks changes to the Hudi storage format.
h2. Proposals

Format change is anything that changes any bits related to

  * *Timeline* : active or archived timeline contents, file names.
 * {*}Base Files{*}: file format versions, any changes to any data types, file 
footers, file names.
 * {*}Log Files{*}: Block structure, content, names.
 * {*}Metadata Table{*}: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row mappings.
 * {*}Table properties{*}: What's written to 
[hoodie.properties|http://hoodie.properties/].
 * *Marker files* : Can be left to the writer implementation.

h2. Change summary:

The following functionality should be supportable by the new format tech specs 
(at a minimum)
Flexibility : 
 * [Pending] Ability to mix different types of base files within a single table 
or even a single file group (e.g images, json, vectors ...)
 * [Pending] Easy integration of metadata for JVM and non-jvm clients (parquet 
as MT format, HFile native APIs)

Metafields : 
 * [Resolved] Should _recordkey be uuid special handling?
 * Semantics of _hoodie_commit_time , with completion time changes.

Additional Info: 
 * Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data.
 * [Resolved] Position based skipping of base file
 * [Pending] Additional metadata to avoid more RPCs to scan base file/log 
blocks.
 * [Pending] ML/Column family use-case?
 * [Resolved] Support having changeset of columns in each write, other headers

Log : 
 * [No change needed] Support writing updates as deletes and inserts, instead 
of logging as update to base file.
 * [Pending] CDC format is GA.

Table organization: 

 * [Pending] Support different logical partitions on the same data
 * [Pending] RFC-60/Storage of table spread across buckets/root folders
 * [Pending] Decouple table location from timeline, metadata. They can all be 
in different places

Concurrency/Timeline: 

  * [Pending] Ability to support general purpose multi-table transactions, esp 
between data and metadata tables.
 * [Pending] Support lockless/non-blocking transactions, where writers don't 
block each other even in face of conflicts.
 * [Resolved] Support for long lived instants in timeline, break down 
distinction between active/archived
 * [Pending] Support checking of uniqueness constraints, even in face of two 
concurrent insert transactions.
 * [Pending] Support precise time-travel queries
 * [Pending] Support time-travel writes.
 * [Pending] Support schema history tracking and aid in schema evol impl.
 * [Resolved] TrueTime store/support for instant times
 * [Pending] No more separate rollback action. make it a new state.

Metadata table :

  * Encode filegroup ID and commit time along with file metadata

Table Properties:

  * Partitioning information/indexing info

Marker Files:

  * Write marker files for logs as well, based on new marker format.

  was:
This EPIC tracks changes to the Hudi storage format.
h2. Proposals
Format change is anything that changes any bits related to * *Timeline* : 
active or archived timeline contents, file names.
 * {*}Base Files{*}: file format versions, any changes to any data types, file 
footers, file names.
 * {*}Log Files{*}: Block structure, content, names.
 * {*}Metadata Table{*}: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row mappings.
 * {*}Table properties{*}: What's written to 
[hoodie.properties|http://hoodie.properties/].
 * *Marker files* : Can be left to the writer implementation.

h2. Change summary:
The following functionality should be supportable by the new format tech specs 
(at a minimum)
Flexibility : * [Pending] Ability to mix different types of base files within a 
single table or even a single file group (e.g images, json, vectors ...)
 * [Pending] Easy integration of metadata for JVM and non-jvm clients (parquet 
as MT format, HFile native APIs)

Metafields : * [Resolved] Should _recordkey be uuid special handling?
 * Semantics of _hoodie_commit_time , with completion time changes.

Additional Info: * Support encoding of watermarks/event time fields as first 
class citizen, for handling late arriving data.
 * [Resolved]  Position based skipping of base file
 * [Pending] Additional metadata to avoid more RPCs to scan base file/log 
blocks.
 * [Pending] ML/Column family use-case?
 * [Resolved] Support having changeset of columns in each write, other headers

Log : * [No change needed] Support writing updates as deletes and inserts, 
instead of logging as update to base file.
 * [Pending] CDC format is GA.

Table organization: * [Pending] Support different logical partitions on the

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

2023-09-05 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6242:
-
Description: 
This EPIC tracks changes to the Hudi storage format.
h2. Proposals
Format change is anything that changes any bits related to * *Timeline* : 
active or archived timeline contents, file names.
 * {*}Base Files{*}: file format versions, any changes to any data types, file 
footers, file names.
 * {*}Log Files{*}: Block structure, content, names.
 * {*}Metadata Table{*}: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row mappings.
 * {*}Table properties{*}: What's written to 
[hoodie.properties|http://hoodie.properties/].
 * *Marker files* : Can be left to the writer implementation.

h2. Change summary:
The following functionality should be supportable by the new format tech specs 
(at a minimum)
Flexibility : * [Pending] Ability to mix different types of base files within a 
single table or even a single file group (e.g images, json, vectors ...)
 * [Pending] Easy integration of metadata for JVM and non-jvm clients (parquet 
as MT format, HFile native APIs)

Metafields : * [Resolved] Should _recordkey be uuid special handling?
 * Semantics of _hoodie_commit_time , with completion time changes.

Additional Info: * Support encoding of watermarks/event time fields as first 
class citizen, for handling late arriving data.
 * [Resolved]  Position based skipping of base file
 * [Pending] Additional metadata to avoid more RPCs to scan base file/log 
blocks.
 * [Pending] ML/Column family use-case?
 * [Resolved] Support having changeset of columns in each write, other headers

Log : * [No change needed] Support writing updates as deletes and inserts, 
instead of logging as update to base file.
 * [Pending] CDC format is GA.

Table organization: * [Pending] Support different logical partitions on the 
same data
 * [Pending] RFC-60/Storage of table spread across buckets/root folders
 * [Pending] Decouple table location from timeline, metadata. They can all be 
in different places

Concurrency/Timeline: * [Pending] Ability to support general purpose 
multi-table transactions, esp between data and metadata tables.
 * [Pending] Support lockless/non-blocking transactions, where writers don't 
block each other even in face of conflicts.
 * [Resolved] Support for long lived instants in timeline, break down 
distinction between active/archived
 * [Pending] Support checking of uniqueness constraints, even in face of two 
concurrent insert transactions.
 * [Pending] Support precise time-travel queries
 * [Pending] Support time-travel writes.
 * [Pending] Support schema history tracking and aid in schema evol impl.
 * [Resolved] TrueTime store/support for instant times
 * [Pending] No more separate rollback action. make it a new state.

Metadata table : * Encode filegroup ID and commit time along with file metadata

Table Properties: * Partitioning information/indexing info

Marker Files: * Write marker files for logs as well, based on new marker format.

  was:
This EPIC tracks changes to the Hudi storage format.
Format change is anything that changes any bits related to
 - *Timeline* : active or archived timeline contents, file names.
 - {*}Base Files{*}: file format versions, any changes to any data types, file 
footers, file names.
 - {*}Log Files{*}: Block structure, content, names.
 - {*}Metadata Table{*}: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row mappings.
 - {*}Table properties{*}: What's written to hoodie.properties.
 - *Marker files* : Can be left to the writer implementation.

h2. Change summary:

The following functionality should be supportable by the new format tech specs 
(at a minimum)

Flexibility :
 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...)
 - Easy integration of metadata for JVM and non-jvm clients

Metafields :
 - Should _recordkey be uuid special handling?

Additional Info:
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data.
 - Position based skipping of base file
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - ML/Column family use-case?
 - Support having changeset of columns in each write, other headers

Log :
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - CDC format is GA.

Table organization:
 - Support different logical partitions on the same data
 - Storage of table spread across buckets/root folders
 - Decouple table location from timeline, metadata. They can all be in 
different places

Concurrency/Timeline:
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Suppo

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

2023-08-29 Thread Sagar Sumit (Jira)

[
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sagar Sumit updated HUDI-6242:
--
Description:
This EPIC tracks changes to the Hudi storage format.
Format change is anything that changes any bits related to
- *Timeline* : active or archived timeline contents, file names.
- {*}Base Files{*}: file format versions, any changes to any data types, file
footers, file names.
- {*}Log Files{*}: Block structure, content, names.
- {*}Metadata Table{*}: (should we call this index table instead?) partition
names, number of file groups, key/value schema and metadata to MDT row mappings.
- {*}Table properties{*}: What's written to hoodie.properties.
- *Marker files* : Can be left to the writer implementation.

h2. Change summary:

The following functionality should be supportable by the new format tech specs
(at a minimum)

Flexibility :
- Ability to mix different types of base files within a single table or even a
single file group (e.g images, json, vectors ...)
- Easy integration of metadata for JVM and non-jvm clients

Metafields :
- Should _recordkey be uuid special handling?

Additional Info:
- Support encoding of watermarks/event time fields as first class citizen, for
handling late arriving data.
- Position based skipping of base file
- Additional metadata to avoid more RPCs to scan base file/log blocks.
- ML/Column family use-case?
- Support having changeset of columns in each write, other headers

Log :
- Support writing updates as deletes and inserts, instead of logging as update
to base file.
- CDC format is GA.

Table organization:
- Support different logical partitions on the same data
- Storage of table spread across buckets/root folders
- Decouple table location from timeline, metadata. They can all be in
different places

Concurrency/Timeline:
- Ability to support general purpose multi-table transactions, esp between
data and metadata tables.
- Support lockless/non-blocking transactions, where writers don't block each
other even in face of conflicts.
- Support for long lived instants in timeline, break down distinction between
active/archived
- Support checking of uniqueness constraints, even in face of two concurrent
insert transactions.
- Support precise time-travel queries
- Support time-travel writes.
- Support schema history tracking and aid in schema evol impl.
- TrueTime store/support for instant times
- No more separate rollback action. make it a new state.

Metadata table :
- Encode filegroup ID and commit time along with file metadata

Table Properties:
- Partitioning information/indexing info

was:
This EPIC tracks changes to the Hudi storage format.
Format change is anything that changes any bits related to
- *Timeline* : active or archived timeline contents, file names.
- {*}Base Files{*}: file format versions, any changes to any data types, file
footers, file names.
- {*}Log Files{*}: Block structure, content, names.
- {*}Metadata Table{*}: (should we call this index table instead?) partition
names, number of file groups, key/value schema and metadata to MDT row mappings.
- {*}Table properties{*}: What's written to hoodie.properties.
- *Marker files* : Can be left to the writer implementation.

h2. Change summary:

The following functionality should be supportable by the new format tech specs
(at a minimum)

Metafields :
- Should _recordkey be uuid special handling?

Log :
- Support writing updates as deletes and inserts, instead of logging as update
to base file.
- CDC format is GA.

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

2023-08-17 Thread Vinoth Chandar (Jira)

[
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Vinoth Chandar updated HUDI-6242:
-
Description:
This EPIC tracks changes to the Hudi storage format.
Format change is anything that changes any bits related to
- *Timeline* : active or archived timeline contents, file names.
- {*}Base Files{*}: file format versions, any changes to any data types, file
footers, file names.
- {*}Log Files{*}: Block structure, content, names.
- {*}Metadata Table{*}: (should we call this index table instead?) partition
names, number of file groups, key/value schema and metadata to MDT row mappings.
- {*}Table properties{*}: What's written to hoodie.properties.
- *Marker files* : Can be left to the writer implementation.

h2. Change summary:

The following functionality should be supportable by the new format tech specs
(at a minimum)

Metafields :
- Should _recordkey be uuid special handling?

Log :
- Support writing updates as deletes and inserts, instead of logging as update
to base file.
- CDC format is GA.

Metadata table :
- Encode filegroup ID and commit time along with file metadata

Table Properties:
- Partitioning information/indexing info

h2. Change summary:

The following functionality should be supportable by the new format tech specs
(at a minimum)

Metafields :
- Should _recordkey be uuid special handling?

Log :
- Support writing updates as deletes and inserts, instead of logging as update
to base file.
- CDC format is GA.

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

2023-08-16 Thread Vinoth Chandar (Jira)

[
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

h2. Change summary:

The following functionality should be supportable by the new format tech specs
(at a minimum)

Metafields :
- Should _recordkey be uuid special handling?

Log :
- Support writing updates as deletes and inserts, instead of logging as update
to base file.
- CDC format is GA.

Metadata table :
- Encode filegroup ID and commit time along with file metadata

Table Properties:
- Partitioning information/indexing info

was:
This EPIC tracks changes to the Hudi storage format.
Format change is anything that changes any bits related to

- *Timeline* : active or archived timeline contents, file names.
- *Base Files*: file format versions, any changes to any data types, file
footers, file names.
- *Log Files*: Block structure, content, names.
- *Metadata Table*: (should we call this index table instead?) partition
names, number of file groups, key/value schema and metadata to MDT row
mappings.
- *Table properties*: What's written to hoodie.properties.
- *Marker files* : Can be left to the writer implementation.

The following functionality should be supportable by the new format tech specs
(at a minimum)

Metafields :
- Should _recordkey be uuid special handling?

Log :
- Support writing updates as deletes and inserts, instead of logging as update
to base file.
- CDC format is GA.

Metadata table :
- Encode filegroup ID a

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

2023-08-16 Thread Vinoth Chandar (Jira)

[
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Vinoth Chandar updated HUDI-6242:
-
Description:
This EPIC tracks changes to the Hudi storage format.
Format change is anything that changes any bits related to

The following functionality should be supportable by the new format tech specs
(at a minimum)

Metafields :
- Should _recordkey be uuid special handling?

Log :
- Support writing updates as deletes and inserts, instead of logging as update
to base file.
- CDC format is GA.

Metadata table :
- Encode filegroup ID and commit time along with file metadata

Table Properties:
- Partitioning information/indexing info

was:
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

The following functionality should be supportable by the new format tech specs
(at a minimum)

Metafields :
- Should _recordkey be uuid special handling?

Log :
- Support writing updates as deletes and inserts, instead of logging as update
to base file.
- CDC format is GA.

Metadata table :
- Encode filegroup ID and commit time along with file metadata

Table Properti

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

2023-08-15 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6242:
-
Due Date: 13/Sep/23  (was: 30/Jun/23)

> Format changes for Hudi 1.X release line
> 
>
> Key: HUDI-6242
> URL: https://issues.apache.org/jira/browse/HUDI-6242
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: hudi-umbrellas
> Fix For: 1.0.0
>
>
> This EPIC tracks changes to the Hudi storage format.
> Format change is anything that changes any bits related to
>  - *Timeline* : active or archived timeline contents, file names.
>  - *Base Files*: file format versions, any changes to any data types, file 
> footers, file names.
>  - *Log Files*:  Block structure, content, names. 
>  - *Metadata Table*: (should we call this index table instead?) partition 
> names, number of file groups, key/value schema and metadata to MDT row 
> mappings. 
> - *Table properties*: What's written to hoodie.properties.
> - *Marker files* : how would we treat these?
> The following functionality should be supportable by the new format tech 
> specs (at a minimum) 
> Flexibility : 
>  - Ability to mix different types of base files within a single table or even 
> a single file group (e.g images, json, vectors ...) 
>  - Easy integration of metadata for JVM and non-jvm clients
> Metafields :
> - Should _recordkey be uuid special handling?
> Additional Info:
>  - Support encoding of watermarks/event time fields as first class citizen, 
> for handling late arriving data. 
>  - Position based skipping of base file
>  - Additional metadata to avoid more RPCs to scan base file/log blocks.
>  - ML/Column family use-case?
>  - Support having changeset of columns in each write, other headers
> Log : 
>  - Support writing updates as deletes and inserts, instead of logging as 
> update to base file.
>  - CDC format is GA.
> Table organization:
>  - Support different logical partitions on the same data
>  - Storage of table spread across buckets/root folders
>  - Decouple table location from timeline, metadata. They can all be in 
> different places
> Concurrency/Timeline: 
>  - Ability to support general purpose multi-table transactions, esp between 
> data and metadata tables.
>  - Support lockless/non-blocking transactions, where writers don't block each 
> other even in face of conflicts. 
>  - Support for long lived instants in timeline, break down distinction 
> between active/archived
>  - Support checking of uniqueness constraints, even in face of two concurrent 
> insert transactions. 
>  - Support precise time-travel queries
>  - Support time-travel writes.
>  - Support schema history tracking and aid in schema evol impl.
>  - TrueTime store/support for instant times
> Metadata table :
>  - Encode filegroup ID and commit time along with file metadata
> Table Properties: 
>  - Partitioning information/indexing info



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

2023-08-02 Thread Vinoth Chandar (Jira)

[
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Vinoth Chandar updated HUDI-6242:
-
Description:
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

The following functionality should be supportable by the new format tech specs
(at a minimum)

Metafields :
- Should _recordkey be uuid special handling?

Log :
- Support writing updates as deletes and inserts, instead of logging as update
to base file.
- CDC format is GA.

Metadata table :
- Encode filegroup ID and commit time along with file metadata

Table Properties:
- Partitioning information/indexing info

was:
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

The following functionality should be supportable by the new format tech specs
(at a minimum)

Log :
- Support writing updates as deletes and inserts, instead of logging as update
to base file.
- CDC format is GA.

Metadata table :
- Encode filegroup ID and commit time along with file metadata

Table Properties:
- Partitioning information/indexing info

> Format changes for Hudi

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

2023-07-06 Thread Vinoth Chandar (Jira)

[
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Vinoth Chandar updated HUDI-6242:
-
Description:
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

The following functionality should be supportable by the new format tech specs
(at a minimum)

Log :
- Support writing updates as deletes and inserts, instead of logging as update
to base file.
- CDC format is GA.

Metadata table :
- Encode filegroup ID and commit time along with file metadata

Table Properties:
- Partitioning information/indexing info

was:
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

The following functionality should be supportable by the new format tech specs
(at a minimum)

Log :
- Support writing updates as deletes and inserts, instead of logging as update
to base file.
- CDC format is GA.

> Format changes for Hudi 1.X release line
>
>
> Key: HUDI-6242
> URL: https://issues.apache.org/jira/browse/HUDI-6242
> Project: Apache Hudi
>

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

2023-07-05 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6242:
-
Description: 
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

 - *Timeline* : active or archived timeline contents, file names.
 - *Base Files*: file format versions, any changes to any data types, file 
footers, file names.
 - *Log Files*:  Block structure, content, names. 
 - *Metadata Table*: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row 
mappings. 
- *Table properties*: What's written to hoodie.properties.
- *Marker files* : how would we treat these?


The following functionality should be supportable by the new format tech specs 
(at a minimum) 

Flexibility : 
 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...) 
 - Easy integration of metadata for JVM and non-jvm clients


Additional Info:
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data. 
 - Position based skipping of base file
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - ML/Column family use-case?
 - Support having changeset of columns in each write, other headers


Log : 
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - CDC format is GA.

Table organization:
 - Support different logical partitions on the same data
 - Storage of table spread across buckets/root folders
 - Decouple table location from timeline, metadata. They can all be in 
different places


Concurrency/Timeline: 
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts. 
 - Support for long lived instants in timeline, break down distinction between 
active/archived
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions. 
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - TrueTime store/support for instant times


  was:
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

 - *Timeline* : active or archived timeline contents, file names.
 - *Base Files*: file format versions, any changes to any data types, file 
footers, file names.
 - *Log Files*:  Block structure, content, names. 
 - *Metadata Table*: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row 
mappings. 
- *Table properties*: What's written to hoodie.properties.
- *Marker files* : how would we treat these?


The following functionality should be met (at a minimum) 

Flexibility : 
 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...) 
 - Easy integration of metadata for JVM and non-jvm clients


Additional Info:
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data. 
 - Position based skipping of base file
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - ML/Column family use-case?
 - Support having changeset of columns in each write, other headers


Log : 
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - CDC format is GA.

Table organization:
 - Support different logical partitions on the same data
 - Storage of table spread across buckets/root folders
 - Decouple table location from timeline, metadata. They can all be in 
different places


Concurrency/Timeline: 
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts. 
 - Support for long lived instants in timeline, break down distinction between 
active/archived
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions. 
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - TrueTime store/support for instant times



> Format changes for Hudi 1.X release line
> 
>
> Key: HUDI-6242
> URL: https://issues.apache.org/jira/browse/HUDI-6242
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: hu

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

2023-07-05 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6242:
-
Description: 
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

 - *Timeline* : active or archived timeline contents, file names.
 - *Base Files*: file format versions, any changes to any data types, file 
footers, file names.
 - *Log Files*:  Block structure, content, names. 
 - *Metadata Table*: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row 
mappings. 
- *Table properties*: What's written to hoodie.properties.
- *Marker files* : how would we treat these?


The following functionality should be met (at a minimum) 

Flexibility : 
 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...) 
 - Easy integration of metadata for JVM and non-jvm clients


Additional Info:
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data. 
 - Position based skipping of base file
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - ML/Column family use-case?
 - Support having changeset of columns in each write, other headers


Log : 
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - CDC format is GA.

Table organization:
 - Support different logical partitions on the same data
 - Storage of table spread across buckets/root folders
 - Decouple table location from timeline, metadata. They can all be in 
different places


Concurrency/Timeline: 
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts. 
 - Support for long lived instants in timeline, break down distinction between 
active/archived
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions. 
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - TrueTime store/support for instant times


  was:
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

 - *Timeline* : active or archived timeline contents, file names.
 - *Base Files*: file format versions, any changes to any data types, file 
footers, file names.
 - *Log Files*:  Block structure, content, names. 
 - *Metadata Table*: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row 
mappings. 
- *Table properties*: What's written to hoodie.properties.
- *Marker files* : how would we treat these?


The following functionality should be met (at a minimum) 

 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...) 
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts. 
 - Support for long lived instants in timeline, break down distinction between 
active/archived
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data. 
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions. 
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - Easy integration of metadata for JVM and non-jvm clients
 - TrueTime store/support for instant times
 - CDC format is GA.
 - Support different logical partitions on the same data
 - Position based skipping of base file
 - Storage of table spread across buckets/root folders
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - Decouple table location from timeline, metadata. They can all be in 
different places
 - Support having changeset of columns in each write, other headers
 - ML/Column family use-case



> Format changes for Hudi 1.X release line
> 
>
> Key: HUDI-6242
> URL: https://issues.apache.org/jira/browse/HUDI-6242
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: hudi-umbrellas
> Fix For: 1.0.0
>
>
> This EPIC tracks changes to the Hudi storage format.
> Format change is anything

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

2023-07-05 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6242:
-
Description: 
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

 - *Timeline* : active or archived timeline contents, file names.
 - *Base Files*: file format versions, any changes to any data types, file 
footers, file names.
 - *Log Files*:  Block structure, content, names. 
 - *Metadata Table*: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row 
mappings. 
- *Table properties*: What's written to hoodie.properties.
- *Marker files* : how would we treat these?


The following functionality should be met (at a minimum) 

 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...) 
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts. 
 - Support for long lived instants in timeline, break down distinction between 
active/archived
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data. 
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions. 
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - Easy integration of metadata for JVM and non-jvm clients
 - TrueTime store/support for instant times
 - CDC format is GA.
 - Support different logical partitions on the same data
 - Position based skipping of base file
 - Storage of table spread across buckets/root folders
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - Decouple table location from timeline, metadata. They can all be in 
different places
 - Support having changeset of columns in each write, other headers
 - ML/Column family use-case


  was:
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

 - *Timeline* : active or archived timeline contents, file names.
 - *Base Files*: file format versions, any changes to any data types, file 
footers, file names.
 - *Log Files*:  Block structure, content, names. 
 - *Metadata Table*: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row 
mappings. 
- *Table properties*: What's written to hoodie.properties.
- *Marker files* : how would we treat these?


The following functionality should be met (at a minimum) 

 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...) 
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts. 
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data. 
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions. 
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - Easy integration of metadata for JVM and non-jvm clients
 - TrueTime store/support for instant times
 - CDC format is GA.
 - Support different logical partitions on the same data
 - Position based skipping of base file
 - Storage of table spread across buckets/root folders
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - Decouple table location from timeline, metadata. They can all be in 
different places
 - Support having changeset of columns in each write, other headers
 - ML/Column family use-case



> Format changes for Hudi 1.X release line
> 
>
> Key: HUDI-6242
> URL: https://issues.apache.org/jira/browse/HUDI-6242
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: hudi-umbrellas
> Fix For: 1.0.0
>
>
> This EPIC tracks changes to the Hudi storage format.
> Format change is anything that changes any bits related to
>  - *Timeline* : active or archived timeline contents, file names.
>  - *Base Files*: file format versions, any changes to any data types, file 
> foot

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

2023-07-05 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6242:
-
Description: 
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

 - *Timeline* : active or archived timeline contents, file names.
 - *Base Files*: file format versions, any changes to any data types, file 
footers, file names.
 - *Log Files*:  Block structure, content, names. 
 - *Metadata Table*: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row 
mappings. 
- *Table properties*: What's written to hoodie.properties.
- *Marker files* : how would we treat these?


The following functionality should be met (at a minimum) 

 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...) 
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts. 
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data. 
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions. 
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - Easy integration of metadata for JVM and non-jvm clients
 - TrueTime store/support for instant times
 - CDC format is GA.
 - Support different logical partitions on the same data
 - Position based skipping of base file
 - Storage of table spread across buckets/root folders
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - Decouple table location from timeline, metadata. They can all be in 
different places
 - Support having changeset of columns in each write, other headers
 - ML/Column family use-case


  was:
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

 - *Timeline* : active or archived timeline contents, file names.
 - *Base Files*: file format versions, any changes to any data types, file 
footers, file names.
 - *Log Files*:  Block structure, content, names. 
 - *Metadata Table*: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row 
mappings. 
- *Table properties*: What's written to hoodie.properties.
- *Marker files* : how would we treat these?


The following functionality should be met (at a minimum) 

 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...) 
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts. 
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data. 
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions. 
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - Easy integration of metadata for JVM and non-jvm clients
 - TrueTime store/support for instant times
 - CDC format is GA.
 - Support different logical partitions on the same data
 - Position based skipping of base file
 - Storage of table spread across buckets/root folders
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - Decouple table location from timeline, metadata. They can all be in 
different places
 - Support having changeset of columns in each write, other headers



> Format changes for Hudi 1.X release line
> 
>
> Key: HUDI-6242
> URL: https://issues.apache.org/jira/browse/HUDI-6242
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: hudi-umbrellas
> Fix For: 1.0.0
>
>
> This EPIC tracks changes to the Hudi storage format.
> Format change is anything that changes any bits related to
>  - *Timeline* : active or archived timeline contents, file names.
>  - *Base Files*: file format versions, any changes to any data types, file 
> footers, file names.
>  - *Log Files*:  Block structure, content, names. 
>  - *Metadata Table*: (should we call this index table

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

2023-07-05 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6242:
-
Description: 
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

 - *Timeline* : active or archived timeline contents, file names.
 - *Base Files*: file format versions, any changes to any data types, file 
footers, file names.
 - *Log Files*:  Block structure, content, names. 
 - *Metadata Table*: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row 
mappings. 
- *Table properties*: What's written to hoodie.properties.
- *Marker files* : how would we treat these?


The following functionality should be met (at a minimum) 

 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...) 
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts. 
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data. 
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions. 
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - Easy integration of metadata for JVM and non-jvm clients
 - TrueTime store/support for instant times
 - CDC format is GA.
 - Support different logical partitions on the same data
 - Position based skipping of base file
 - Storage of table spread across buckets/root folders
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - Decouple table location from timeline, metadata. They can all be in 
different places
 - Support having changeset of columns in each write, other headers


  was:
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

 - *Timeline* : active or archived timeline contents, file names.
 - *Base Files*: file format versions, any changes to any data types, file 
footers, file names.
 - *Log Files*:  Block structure, content, names. 
 - *Metadata Table*: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row 
mappings. 
- *Table properties*: What's written to hoodie.properties.


The following functionality should be met (at a minimum) 

 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...) 
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts. 
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data. 
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions. 
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - Easy integration of metadata for JVM and non-jvm clients
 - TrueTime store/support for instant times
 - CDC format is GA.
 - Support different logical partitions on the same data
 - Position based skipping of base file
 - Storage of table spread across buckets/root folders
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - Decouple table location from timeline, metadata. They can all be in 
different places


> Format changes for Hudi 1.X release line
> 
>
> Key: HUDI-6242
> URL: https://issues.apache.org/jira/browse/HUDI-6242
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: hudi-umbrellas
> Fix For: 1.0.0
>
>
> This EPIC tracks changes to the Hudi storage format.
> Format change is anything that changes any bits related to
>  - *Timeline* : active or archived timeline contents, file names.
>  - *Base Files*: file format versions, any changes to any data types, file 
> footers, file names.
>  - *Log Files*:  Block structure, content, names. 
>  - *Metadata Table*: (should we call this index table instead?) partition 
> names, number of file groups, key/value schema and metadata to MDT row 
> mappings. 
> - *Table properties*: What's wri

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

2023-07-05 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6242:
-
Description: 
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

 - *Timeline* : active or archived timeline contents, file names.
 - *Base Files*: file format versions, any changes to any data types, file 
footers, file names.
 - *Log Files*:  Block structure, content, names. 
 - *Metadata Table*: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row 
mappings. 
- *Table properties*: What's written to hoodie.properties.


The following functionality should be met (at a minimum) 

 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...) 
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts. 
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data. 
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions. 
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - Easy integration of metadata for JVM and non-jvm clients
 - TrueTime store/support for instant times
 - CDC format is GA.
 - Support different logical partitions on the same data
 - Position based skipping of base file
 - Storage of table spread across buckets/root folders
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - Decouple table location from timeline, metadata. They can all be in 
different places

  was:
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

 - *Timeline* : active or archived timeline contents, file names.
 - *Base Files*: file format versions, any changes to any data types, file 
footers, file names.
 - *Log Files*:  Block structure, content, names. 
 - *Metadata Table*: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row 
mappings. 
- *Table properties*: What's written to hoodie.properties.


The following functionality should be met (at a minimum) 

 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...) 
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts. 
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data. 
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions. 
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - Easy integration of metadata for JVM and non-jvm clients
 - TrueTime store/support for instant times
 - CDC format is GA.
 - Support different logical partitions on the same data
 - Position based skipping of base file
 - Storage of table spread across buckets/root folders
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - Decouple table location from timeline, metadata. They can all be in 
different places
 - 


> Format changes for Hudi 1.X release line
> 
>
> Key: HUDI-6242
> URL: https://issues.apache.org/jira/browse/HUDI-6242
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: hudi-umbrellas
> Fix For: 1.0.0
>
>
> This EPIC tracks changes to the Hudi storage format.
> Format change is anything that changes any bits related to
>  - *Timeline* : active or archived timeline contents, file names.
>  - *Base Files*: file format versions, any changes to any data types, file 
> footers, file names.
>  - *Log Files*:  Block structure, content, names. 
>  - *Metadata Table*: (should we call this index table instead?) partition 
> names, number of file groups, key/value schema and metadata to MDT row 
> mappings. 
> - *Table properties*: What's written to hoodie.properties.
> The following functionality should be met (at a minimum) 
>  - Ability to mix dif

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

2023-07-05 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6242:
-
Description: 
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

 - *Timeline* : active or archived timeline contents, file names.
 - *Base Files*: file format versions, any changes to any data types, file 
footers, file names.
 - *Log Files*:  Block structure, content, names. 
 - *Metadata Table*: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row 
mappings. 
- *Table properties*: What's written to hoodie.properties.


The following functionality should be met. 

 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...) 
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts. 
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data. 
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions. 
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - Easy integration of metadata for JVM and non-jvm clients
 - TrueTime store/support for instant times
 - CDC format is GA.
 - Support different logical partitions on the same data
 - Position based skipping of base file
 - Storage of table spread across buckets/root folders
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - Decouple table location from timeline, metadata. They can all be in 
different places
 - 

  was:
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

 - *Timeline* : active or archived timeline contents, file names.
 - *Base Files*: file format versions, any changes to any data types, file 
footers, file names.
 - *Log Files*:  Block structure, content, names. 
 - *Metadata Table*: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row 
mappings. 
- *Table properties*: What's written to hoodie.properties.


The following functionality should be met. 

 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...) 
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts. 
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data. 
 - Pluggable implementations for write handles, timeline, metadata table.



> Format changes for Hudi 1.X release line
> 
>
> Key: HUDI-6242
> URL: https://issues.apache.org/jira/browse/HUDI-6242
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: hudi-umbrellas
> Fix For: 1.0.0
>
>
> This EPIC tracks changes to the Hudi storage format.
> Format change is anything that changes any bits related to
>  - *Timeline* : active or archived timeline contents, file names.
>  - *Base Files*: file format versions, any changes to any data types, file 
> footers, file names.
>  - *Log Files*:  Block structure, content, names. 
>  - *Metadata Table*: (should we call this index table instead?) partition 
> names, number of file groups, key/value schema and metadata to MDT row 
> mappings. 
> - *Table properties*: What's written to hoodie.properties.
> The following functionality should be met. 
>  - Ability to mix different types of base files within a single table or even 
> a single file group (e.g images, json, vectors ...) 
>  - Ability to support general purpose multi-table transactions, esp between 
> data and metadata tables.
>  - Support lockless/non-blocking transactions, where writers don't block each 
> other even in face of conflicts. 
>  - Support encoding of watermarks/event time fields as first class citizen, 
> for handling late arriving data. 
>  - Support writing updates as deletes and inserts, instead of logging as 
> update to base file.
>  - Support checking of uniqueness constraints, even in face of two concurrent 
> insert transactions. 
>  - Support precise time-travel queries
>  - Support time-travel writes.
>  -

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

2023-07-05 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6242:
-
Description: 
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

 - *Timeline* : active or archived timeline contents, file names.
 - *Base Files*: file format versions, any changes to any data types, file 
footers, file names.
 - *Log Files*:  Block structure, content, names. 
 - *Metadata Table*: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row 
mappings. 
- *Table properties*: What's written to hoodie.properties.


The following functionality should be met (at a minimum) 

 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...) 
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts. 
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data. 
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions. 
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - Easy integration of metadata for JVM and non-jvm clients
 - TrueTime store/support for instant times
 - CDC format is GA.
 - Support different logical partitions on the same data
 - Position based skipping of base file
 - Storage of table spread across buckets/root folders
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - Decouple table location from timeline, metadata. They can all be in 
different places
 - 

  was:
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

 - *Timeline* : active or archived timeline contents, file names.
 - *Base Files*: file format versions, any changes to any data types, file 
footers, file names.
 - *Log Files*:  Block structure, content, names. 
 - *Metadata Table*: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row 
mappings. 
- *Table properties*: What's written to hoodie.properties.


The following functionality should be met. 

 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...) 
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts. 
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data. 
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions. 
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - Easy integration of metadata for JVM and non-jvm clients
 - TrueTime store/support for instant times
 - CDC format is GA.
 - Support different logical partitions on the same data
 - Position based skipping of base file
 - Storage of table spread across buckets/root folders
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - Decouple table location from timeline, metadata. They can all be in 
different places
 - 


> Format changes for Hudi 1.X release line
> 
>
> Key: HUDI-6242
> URL: https://issues.apache.org/jira/browse/HUDI-6242
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: hudi-umbrellas
> Fix For: 1.0.0
>
>
> This EPIC tracks changes to the Hudi storage format.
> Format change is anything that changes any bits related to
>  - *Timeline* : active or archived timeline contents, file names.
>  - *Base Files*: file format versions, any changes to any data types, file 
> footers, file names.
>  - *Log Files*:  Block structure, content, names. 
>  - *Metadata Table*: (should we call this index table instead?) partition 
> names, number of file groups, key/value schema and metadata to MDT row 
> mappings. 
> - *Table properties*: What's written to hoodie.properties.
> The following functionality should be met (at a minimum) 
>  - Ability to mix different typ

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

2023-07-05 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6242:
-
Description: 
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

 - *Timeline* : active or archived timeline contents, file names.
 - *Base Files*: file format versions, any changes to any data types, file 
footers, file names.
 - *Log Files*:  Block structure, content, names. 
 - *Metadata Table*: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row 
mappings. 
- *Table properties*: What's written to hoodie.properties.


The following functionality should be met. 

 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...) 
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts. 
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data. 
 - Pluggable implementations for write handles, timeline, metadata table.


> Format changes for Hudi 1.X release line
> 
>
> Key: HUDI-6242
> URL: https://issues.apache.org/jira/browse/HUDI-6242
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: hudi-umbrellas
> Fix For: 1.0.0
>
>
> This EPIC tracks changes to the Hudi storage format.
> Format change is anything that changes any bits related to
>  - *Timeline* : active or archived timeline contents, file names.
>  - *Base Files*: file format versions, any changes to any data types, file 
> footers, file names.
>  - *Log Files*:  Block structure, content, names. 
>  - *Metadata Table*: (should we call this index table instead?) partition 
> names, number of file groups, key/value schema and metadata to MDT row 
> mappings. 
> - *Table properties*: What's written to hoodie.properties.
> The following functionality should be met. 
>  - Ability to mix different types of base files within a single table or even 
> a single file group (e.g images, json, vectors ...) 
>  - Ability to support general purpose multi-table transactions, esp between 
> data and metadata tables.
>  - Support lockless/non-blocking transactions, where writers don't block each 
> other even in face of conflicts. 
>  - Support encoding of watermarks/event time fields as first class citizen, 
> for handling late arriving data. 
>  - Pluggable implementations for write handles, timeline, metadata table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

2023-05-19 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6242:
-
Labels: hudi-umbrellas  (was: )

> Format changes for Hudi 1.X release line
> 
>
> Key: HUDI-6242
> URL: https://issues.apache.org/jira/browse/HUDI-6242
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: hudi-umbrellas
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

20 matches

Site Navigation

Mail list logo

Footer information