from:"sivabalan"

(hudi) branch master updated: [HUDI-7979] Adjusting defaults with spillable map memory (#11611)

2024-07-11 Thread sivabalan

This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 9ee96655094 [HUDI-7979] Adjusting defaults with spillable map memory 
(#11611)
9ee96655094 is described below

commit 9ee9665509451c58736b37daa686510fbd1ad536
Author: Sivabalan Narayanan 
AuthorDate: Thu Jul 11 22:11:19 2024 -0700

[HUDI-7979] Adjusting defaults with spillable map memory (#11611)
---
 .../apache/hudi/config/TestHoodieWriteConfig.java  | 21 +++
 .../common/config/SerializableConfiguration.java   | 69 ++
 .../table/view/FileSystemViewStorageConfig.java| 11 ++--
 .../org/apache/hudi/exception/HoodieException.java | 48 +++
 .../apache/hudi/exception/HoodieIOException.java   | 44 ++
 .../HoodieIncompatibleSchemaException.java | 33 +++
 6 files changed, 221 insertions(+), 5 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/test/java/org/apache/hudi/config/TestHoodieWriteConfig.java
 
b/hudi-client/hudi-client-common/src/test/java/org/apache/hudi/config/TestHoodieWriteConfig.java
index 00c9bfbd163..b39ac241407 100644
--- 
a/hudi-client/hudi-client-common/src/test/java/org/apache/hudi/config/TestHoodieWriteConfig.java
+++ 
b/hudi-client/hudi-client-common/src/test/java/org/apache/hudi/config/TestHoodieWriteConfig.java
@@ -30,6 +30,7 @@ import org.apache.hudi.common.model.HoodieTableType;
 import org.apache.hudi.common.model.WriteConcurrencyMode;
 import org.apache.hudi.common.table.HoodieTableConfig;
 import org.apache.hudi.common.table.marker.MarkerType;
+import org.apache.hudi.common.table.view.FileSystemViewStorageConfig;
 import org.apache.hudi.config.HoodieWriteConfig.Builder;
 import org.apache.hudi.index.HoodieIndex;
 import org.apache.hudi.keygen.constant.KeyGeneratorOptions;
@@ -553,6 +554,26 @@ public class TestHoodieWriteConfig {
 "Non-blocking concurrency control requires the MOR table with simple 
bucket index");
   }
 
+  @Test
+  public void testFileSystemViewStorageConfigDefaults() {
+HoodieWriteConfig writeConfig = 
HoodieWriteConfig.newBuilder().withPath("/tmp").build();
+assertEquals(FileSystemViewStorageConfig.SPILLABLE_MEMORY.defaultValue() * 
FileSystemViewStorageConfig.BOOTSTRAP_BASE_FILE_MEM_FRACTION.defaultValue(),
+writeConfig.getViewStorageConfig().getMaxMemoryForBootstrapBaseFile());
+assertEquals(FileSystemViewStorageConfig.SPILLABLE_MEMORY.defaultValue() * 
FileSystemViewStorageConfig.SPILLABLE_COMPACTION_MEM_FRACTION.defaultValue(),
+writeConfig.getViewStorageConfig().getMaxMemoryForPendingCompaction());
+assertEquals(FileSystemViewStorageConfig.SPILLABLE_MEMORY.defaultValue() * 
FileSystemViewStorageConfig.SPILLABLE_LOG_COMPACTION_MEM_FRACTION.defaultValue(),
+
writeConfig.getViewStorageConfig().getMaxMemoryForPendingLogCompaction());
+assertEquals(FileSystemViewStorageConfig.SPILLABLE_MEMORY.defaultValue() * 
FileSystemViewStorageConfig.SPILLABLE_CLUSTERING_MEM_FRACTION.defaultValue(),
+
writeConfig.getViewStorageConfig().getMaxMemoryForPendingClusteringFileGroups());
+assertEquals(FileSystemViewStorageConfig.SPILLABLE_MEMORY.defaultValue() * 
FileSystemViewStorageConfig.SPILLABLE_REPLACED_MEM_FRACTION.defaultValue(),
+
writeConfig.getViewStorageConfig().getMaxMemoryForReplacedFileGroups());
+assertEquals(FileSystemViewStorageConfig.SPILLABLE_MEMORY.defaultValue() - 
writeConfig.getViewStorageConfig().getMaxMemoryForBootstrapBaseFile()
+- 
writeConfig.getViewStorageConfig().getMaxMemoryForPendingCompaction() - 
writeConfig.getViewStorageConfig().getMaxMemoryForPendingLogCompaction()
+- 
writeConfig.getViewStorageConfig().getMaxMemoryForPendingClusteringFileGroups()
+- 
writeConfig.getViewStorageConfig().getMaxMemoryForReplacedFileGroups(),
+writeConfig.getViewStorageConfig().getMaxMemoryForFileGroupMap());
+  }
+
   private HoodieWriteConfig createWriteConfig(Map configs) {
 final Properties properties = new Properties();
 configs.forEach(properties::setProperty);
diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/config/SerializableConfiguration.java
 
b/hudi-common/src/main/java/org/apache/hudi/common/config/SerializableConfiguration.java
new file mode 100644
index 000..23a22e01822
--- /dev/null
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/config/SerializableConfiguration.java
@@ -0,0 +1,69 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance

[jira] [Assigned] (HUDI-7972) Add fallback for deletion vector in 0.16.x reader while reading 1.x tables

2024-07-11 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-7972:
-

Assignee: sivabalan narayanan

> Add fallback for deletion vector in 0.16.x reader while reading 1.x tables
> --
>
> Key: HUDI-7972
> URL: https://issues.apache.org/jira/browse/HUDI-7972
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: reader-core
>Reporter: sivabalan narayanan
>    Assignee: sivabalan narayanan
>Priority: Major
>  Labels: 1.0-migration
> Fix For: 1.0.0
>
>
> If 0.16.x reader is used to read a 1.x table with deletion vector, we should 
> fallback to using key based merges instead of position based merges. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7971) Test and Certify 0.14.x to 0.16.x tables are readable in 1.x Hudi reader

2024-07-10 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7971:
--
Description: 
Lets ensure 1.x reader is fully compatible w/ reading any of 0.14.x to 0.16.x 
tables 

 

Readers :  1.x
 # Spark SQL
 # Spark Datasource
 # Trino/Presto
 # Hive
 # Flink

Writer: 0.16

Table State:
 * COW
 ** few write commits 
 ** Pending clustering
 ** Completed Clustering
 ** Failed writes with no rollbacks
 ** Insert overwrite table/partition
 ** Savepoint for Time-travel query

 * MOR
 ** Same as COW
 ** Pending and completed async compaction (with log-files and no base file)
 ** Custom Payloads (for MOR snapshot queries) (e:g SQL Expression Payload)
 ** Log block formats - DELETE, rollback block

Other knobs:
 # Metadata enabled/disabled (all combinations)
 # Column Stats enabled/disabled and data-skipping enabled/disabled
 # RLI enabled with eq/IN queries

 # Non-Partitioned dataset (all combinations)
 # CDC Reads 
 # Incremental Reads
 # Time-travel query

 

What to test ?
 # Query Results Correctness
 # Performance : See the benefit of 
 # Partition Pruning
 # Metadata  table - col stats, RLI,

 

Corner Case Testing:

 
 # Schema Evolution with different file-groups having different generation of 
schema
 # Dynamic Partition Pruning
 # Does Column Projection work correctly for log files reading 

  was:
Lets ensure 1.x reader is fully compatible w/ reading any of 0.14.x to 0.16.x 
tables 

 

Readers :  1.x
 # Spark SQL
 # Spark Datasource
 # Trino/Presto
 # Hive
 # Flink

Writer: 0.16

Table State:
 * COW
 ** few write commits 
 ** Pending clustering
 ** Completed Clustering
 ** Failed writes with no rollbacks
 ** Insert overwrite table/partition
 ** Savepoint for Time-travel query

 * MOR
 ** Same as COW
 ** Pending and completed async compaction (with log-files and no base file)
 ** Custom Payloads (for MOR snapshot queries) (e:g SQL Expression Payload)
 ** Log block formats - DELETE, rollback block

Other knobs:
 # Metadata enabled/disabled
 # Column Stats enabled/disabled and data-skipping enabled/disabled
 # RLI enabled with eq/IN queries

 # Non-Partitioned dataset
 # CDC Reads 
 # Incremental Reads
 # Time-travel query

 

What to test ?
 # Query Results Correctness
 # Performance : See the benefit of 
 # Partition Pruning
 # Metadata  table - col stats, RLI,

 

Corner Case Testing:

 
 # Schema Evolution with different file-groups having different generation of 
schema
 # Dynamic Partition Pruning
 # Does Column Projection work correctly for log files reading 


> Test and Certify 0.14.x to 0.16.x tables are readable in 1.x Hudi reader 
> -
>
> Key: HUDI-7971
> URL: https://issues.apache.org/jira/browse/HUDI-7971
> Project: Apache Hudi
>  Issue Type: Sub-task
>    Reporter: sivabalan narayanan
>Priority: Major
> Fix For: 1.0.0
>
>
> Lets ensure 1.x reader is fully compatible w/ reading any of 0.14.x to 0.16.x 
> tables 
>  
> Readers :  1.x
>  # Spark SQL
>  # Spark Datasource
>  # Trino/Presto
>  # Hive
>  # Flink
> Writer: 0.16
> Table State:
>  * COW
>  ** few write commits 
>  ** Pending clustering
>  ** Completed Clustering
>  ** Failed writes with no rollbacks
>  ** Insert overwrite table/partition
>  ** Savepoint for Time-travel query
>  * MOR
>  ** Same as COW
>  ** Pending and completed async compaction (with log-files and no base file)
>  ** Custom Payloads (for MOR snapshot queries) (e:g SQL Expression Payload)
>  ** Log block formats - DELETE, rollback block
> Other knobs:
>  # Metadata enabled/disabled (all combinations)
>  # Column Stats enabled/disabled and data-skipping enabled/disabled
>  # RLI enabled with eq/IN queries
>  # Non-Partitioned dataset (all combinations)
>  # CDC Reads 
>  # Incremental Reads
>  # Time-travel query
>  
> What to test ?
>  # Query Results Correctness
>  # Performance : See the benefit of 
>  # Partition Pruning
>  # Metadata  table - col stats, RLI,
>  
> Corner Case Testing:
>  
>  # Schema Evolution with different file-groups having different generation of 
> schema
>  # Dynamic Partition Pruning
>  # Does Column Projection work correctly for log files reading 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7971) Test and Certify 0.14.x to 0.16.x tables are readable in 1.x Hudi reader

2024-07-10 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7971:
--
Description: 
Lets ensure 1.x reader is fully compatible w/ reading any of 0.14.x to 0.16.x 
tables 

 

Readers :  1.x
 # Spark SQL
 # Spark Datasource
 # Trino/Presto
 # Hive
 # Flink

Writer: 0.16

Table State:
 * COW
 ** few write commits 
 ** Pending clustering
 ** Completed Clustering
 ** Failed writes with no rollbacks
 ** Insert overwrite table/partition
 ** Savepoint for Time-travel query

 * MOR
 ** Same as COW
 ** Pending and completed async compaction (with log-files and no base file)
 ** Custom Payloads (for MOR snapshot queries) (e:g SQL Expression Payload)
 ** Log block formats - DELETE, rollback block

Other knobs:
 # Metadata enabled/disabled
 # Column Stats enabled/disabled and data-skipping enabled/disabled
 # RLI enabled with eq/IN queries

 # Non-Partitioned dataset
 # CDC Reads 
 # Incremental Reads
 # Time-travel query

 

What to test ?
 # Query Results Correctness
 # Performance : See the benefit of 
 # Partition Pruning
 # Metadata  table - col stats, RLI,

 

Corner Case Testing:

 
 # Schema Evolution with different file-groups having different generation of 
schema
 # Dynamic Partition Pruning
 # Does Column Projection work correctly for log files reading 

  was:
Lets ensure 1.x reader is fully compatible w/ reading any of 0.14.x to 0.16.x 
tables 

 

Readers :  1.x
 # Spark SQL
 # Spark Datasource
 # Trino/Presto
 # Hive
 # Flink

Writer: 0.16

Table State:
 * COW
 ** few write commits 
 ** Pending clustering
 ** Completed Clustering
 ** Failed writes with no rollbacks
 ** Insert overwrite table/partition
 ** Savepoint for Time-travel query

 * MOR
 ** Same as COW
 ** Pending and completed async compaction (with log-files and no base file)
 ** Custom Payloads (for MOR snapshot queries) (e:g SQL Expression Payload)
 ** Rollback formats - DELETE, rollback block

Other knobs:
 # Metadata enabled/disabled
 # Column Stats enabled/disabled and data-skipping enabled/disabled
 # RLI enabled with eq/IN queries

 # Non-Partitioned dataset
 # CDC Reads 
 # Incremental Reads
 # Time-travel query

 

What to test ?
 # Query Results Correctness
 # Performance : See the benefit of 
 # Partition Pruning
 # Metadata  table - col stats, RLI,

 

Corner Case Testing:

 
 # Schema Evolution with different file-groups having different generation of 
schema
 # Dynamic Partition Pruning
 # Does Column Projection work correctly for log files reading 


> Test and Certify 0.14.x to 0.16.x tables are readable in 1.x Hudi reader 
> -
>
> Key: HUDI-7971
> URL: https://issues.apache.org/jira/browse/HUDI-7971
> Project: Apache Hudi
>  Issue Type: Sub-task
>    Reporter: sivabalan narayanan
>Priority: Major
> Fix For: 1.0.0
>
>
> Lets ensure 1.x reader is fully compatible w/ reading any of 0.14.x to 0.16.x 
> tables 
>  
> Readers :  1.x
>  # Spark SQL
>  # Spark Datasource
>  # Trino/Presto
>  # Hive
>  # Flink
> Writer: 0.16
> Table State:
>  * COW
>  ** few write commits 
>  ** Pending clustering
>  ** Completed Clustering
>  ** Failed writes with no rollbacks
>  ** Insert overwrite table/partition
>  ** Savepoint for Time-travel query
>  * MOR
>  ** Same as COW
>  ** Pending and completed async compaction (with log-files and no base file)
>  ** Custom Payloads (for MOR snapshot queries) (e:g SQL Expression Payload)
>  ** Log block formats - DELETE, rollback block
> Other knobs:
>  # Metadata enabled/disabled
>  # Column Stats enabled/disabled and data-skipping enabled/disabled
>  # RLI enabled with eq/IN queries
>  # Non-Partitioned dataset
>  # CDC Reads 
>  # Incremental Reads
>  # Time-travel query
>  
> What to test ?
>  # Query Results Correctness
>  # Performance : See the benefit of 
>  # Partition Pruning
>  # Metadata  table - col stats, RLI,
>  
> Corner Case Testing:
>  
>  # Schema Evolution with different file-groups having different generation of 
> schema
>  # Dynamic Partition Pruning
>  # Does Column Projection work correctly for log files reading 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

(hudi) branch asf-site updated: [HUDI-7978][DOCS] Add a note on field oldering in partitioned by clause of create sql (#11610)

2024-07-10 Thread sivabalan

This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 488438fbb6a [HUDI-7978][DOCS] Add a note on field oldering in 
partitioned by clause of create sql (#11610)
488438fbb6a is described below

commit 488438fbb6ae2f8dfcc9257016c66a38c0352171
Author: Sagar Sumit 
AuthorDate: Thu Jul 11 08:38:42 2024 +0530

[HUDI-7978][DOCS] Add a note on field oldering in partitioned by clause of 
create sql (#11610)
---
 website/docs/sql_ddl.md| 5 -
 website/versioned_docs/version-0.11.0/quick-start-guide.md | 7 +++
 website/versioned_docs/version-0.11.1/quick-start-guide.md | 7 +++
 website/versioned_docs/version-0.12.0/quick-start-guide.md | 7 +++
 website/versioned_docs/version-0.12.1/quick-start-guide.md | 7 +++
 website/versioned_docs/version-0.12.2/quick-start-guide.md | 7 +++
 website/versioned_docs/version-0.12.3/quick-start-guide.md | 7 +++
 website/versioned_docs/version-0.13.0/quick-start-guide.md | 7 +++
 website/versioned_docs/version-0.13.1/quick-start-guide.md | 7 +++
 website/versioned_docs/version-0.14.0/sql_ddl.md   | 5 -
 website/versioned_docs/version-0.14.1/sql_ddl.md   | 5 -
 website/versioned_docs/version-0.15.0/sql_ddl.md   | 5 -
 12 files changed, 72 insertions(+), 4 deletions(-)

diff --git a/website/docs/sql_ddl.md b/website/docs/sql_ddl.md
index 61e7d33cd7f..a85d8a7bb04 100644
--- a/website/docs/sql_ddl.md
+++ b/website/docs/sql_ddl.md
@@ -67,7 +67,10 @@ PARTITIONED BY (dt);
 ```
 
 :::note
-You can also create a table partitioned by multiple fields by supplying 
comma-separated field names. For, e.g., "partitioned by dt, hh"
+You can also create a table partitioned by multiple fields by supplying 
comma-separated field names.
+When creating a table partitioned by multiple fields, ensure that you specify 
the columns in the `PARTITIONED BY` clause 
+in the same order as they appear in the `CREATE TABLE` schema. For example, 
for the above table, the partition fields 
+should be specified as `PARTITIONED BY (dt, hh)`.
 :::
 
 ### Create table with record keys and ordering fields
diff --git a/website/versioned_docs/version-0.11.0/quick-start-guide.md 
b/website/versioned_docs/version-0.11.0/quick-start-guide.md
index 9f670871f6a..35336d3f4d4 100644
--- a/website/versioned_docs/version-0.11.0/quick-start-guide.md
+++ b/website/versioned_docs/version-0.11.0/quick-start-guide.md
@@ -298,6 +298,13 @@ partitioned by (dt, hh)
 location '/tmp/hudi/hudi_cow_pt_tbl';
 ```
 
+:::note
+You can also create a table partitioned by multiple fields by supplying 
comma-separated field names.
+When creating a table partitioned by multiple fields, ensure that you specify 
the columns in the `PARTITIONED BY` clause
+in the same order as they appear in the `CREATE TABLE` schema. For example, 
for the above table, the partition fields
+should be specified as `PARTITIONED BY (dt, hh)`.
+:::
+
 **Create Table for an existing Hudi Table**
 
 We can create a table on an existing hudi table(created with spark-shell or 
deltastreamer). This is useful to
diff --git a/website/versioned_docs/version-0.11.1/quick-start-guide.md 
b/website/versioned_docs/version-0.11.1/quick-start-guide.md
index d45b535ef42..d0c32790d5a 100644
--- a/website/versioned_docs/version-0.11.1/quick-start-guide.md
+++ b/website/versioned_docs/version-0.11.1/quick-start-guide.md
@@ -296,6 +296,13 @@ partitioned by (dt, hh)
 location '/tmp/hudi/hudi_cow_pt_tbl';
 ```
 
+:::note
+You can also create a table partitioned by multiple fields by supplying 
comma-separated field names.
+When creating a table partitioned by multiple fields, ensure that you specify 
the columns in the `PARTITIONED BY` clause
+in the same order as they appear in the `CREATE TABLE` schema. For example, 
for the above table, the partition fields
+should be specified as `PARTITIONED BY (dt, hh)`.
+:::
+
 **Create Table for an existing Hudi Table**
 
 We can create a table on an existing hudi table(created with spark-shell or 
deltastreamer). This is useful to
diff --git a/website/versioned_docs/version-0.12.0/quick-start-guide.md 
b/website/versioned_docs/version-0.12.0/quick-start-guide.md
index aac9a9bd048..9fc3a0414f5 100644
--- a/website/versioned_docs/version-0.12.0/quick-start-guide.md
+++ b/website/versioned_docs/version-0.12.0/quick-start-guide.md
@@ -322,6 +322,13 @@ partitioned by (dt, hh)
 location '/tmp/hudi/hudi_cow_pt_tbl';
 ```
 
+:::note
+You can also create a table partitioned by multiple fields by supplying 
comma-separated field names.
+When creating a table partitioned by multiple fields, ensure that you specify 
the columns in the `PARTITIONED BY` clause
+in the same order as they appear in the `CREATE TABLE` schema. For example, 
for the a

(hudi) branch master updated: [HUDI-7921] Fixing file system view closures in MDT (#11496)

2024-07-10 Thread sivabalan

This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 3789840be3d [HUDI-7921] Fixing file system view closures in MDT 
(#11496)
3789840be3d is described below

commit 3789840be3d041cbcfc6b24786740210e4e6d6ac
Author: Sivabalan Narayanan 
AuthorDate: Wed Jul 10 19:25:41 2024 -0700

[HUDI-7921] Fixing file system view closures in MDT (#11496)
---
 .../metadata/HoodieBackedTableMetadataWriter.java  |  55 ++--
 .../common/testutils/HoodieMetadataTestTable.java  |   6 +
 .../java/org/apache/hudi/table/TestCleaner.java| 326 +++--
 .../table/functional/TestCleanPlanExecutor.java| 325 ++--
 .../hudi/testutils/HoodieCleanerTestBase.java  |  31 +-
 .../hudi/metadata/HoodieBackedTableMetadata.java   |   4 +
 .../hudi/metadata/HoodieTableMetadataUtil.java |  48 +--
 .../hudi/common/testutils/HoodieTestTable.java |   8 +-
 8 files changed, 440 insertions(+), 363 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
index 89d21e79b22..c38a68e37cf 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
@@ -1081,9 +1081,8 @@ public abstract class HoodieBackedTableMetadataWriter 
implements HoodieTableM
   private HoodieData 
getFunctionalIndexUpdates(HoodieCommitMetadata commitMetadata, String 
indexPartition, String instantTime) throws Exception {
 HoodieIndexDefinition indexDefinition = 
getFunctionalIndexDefinition(indexPartition);
 List> partitionFileSlicePairs = new ArrayList<>();
-HoodieTableFileSystemView fsView = 
HoodieTableMetadataUtil.getFileSystemView(dataMetaClient);
 commitMetadata.getPartitionToWriteStats().forEach((dataPartition, value) 
-> {
-  List fileSlices = 
getPartitionLatestFileSlicesIncludingInflight(dataMetaClient, 
Option.ofNullable(fsView), dataPartition);
+  List fileSlices = 
getPartitionLatestFileSlicesIncludingInflight(dataMetaClient, Option.empty(), 
dataPartition);
   fileSlices.forEach(fileSlice -> {
 // Filter log files for the instant time and add to this partition 
fileSlice pairs
 List logFilesForInstant = fileSlice.getLogFiles()
@@ -1411,35 +1410,35 @@ public abstract class 
HoodieBackedTableMetadataWriter implements HoodieTableM
   HoodieData> partitionRecordsMap) {
 // The result set
 HoodieData allPartitionRecords = 
engineContext.emptyHoodieData();
+try (HoodieTableFileSystemView fsView = 
HoodieTableMetadataUtil.getFileSystemView(metadataMetaClient)) {
+  for (Map.Entry> entry : 
partitionRecordsMap.entrySet()) {
+final String partitionName = 
HoodieIndexUtils.getPartitionNameFromPartitionType(entry.getKey(), 
dataMetaClient, dataWriteConfig.getIndexingConfig().getIndexName());
+HoodieData records = entry.getValue();
+
+List fileSlices =
+
HoodieTableMetadataUtil.getPartitionLatestFileSlices(metadataMetaClient, 
Option.ofNullable(fsView), partitionName);
+if (fileSlices.isEmpty()) {
+  // scheduling of INDEX only initializes the file group and not add 
commit
+  // so if there are no committed file slices, look for inflight slices
+  fileSlices = 
getPartitionLatestFileSlicesIncludingInflight(metadataMetaClient, 
Option.ofNullable(fsView), partitionName);
+}
+final int fileGroupCount = fileSlices.size();
+ValidationUtils.checkArgument(fileGroupCount > 0, 
String.format("FileGroup count for MDT partition %s should be >0", 
partitionName));
+
+List finalFileSlices = fileSlices;
+HoodieData rddSinglePartitionRecords = records.map(r -> {
+  FileSlice slice = 
finalFileSlices.get(HoodieTableMetadataUtil.mapRecordKeyToFileGroupIndex(r.getRecordKey(),
+  fileGroupCount));
+  r.unseal();
+  r.setCurrentLocation(new 
HoodieRecordLocation(slice.getBaseInstantTime(), slice.getFileId()));
+  r.seal();
+  return r;
+});
 
-HoodieTableFileSystemView fsView = 
HoodieTableMetadataUtil.getFileSystemView(metadataMetaClient);
-for (Map.Entry> entry : 
partitionRecordsMap.entrySet()) {
-  final String partitionName = 
HoodieIndexUtils.getPartitionNameFromPartitionType(entry.getKey(), 
dataMetaClient, dataWriteConfig.getIndexingConfig().getIndexName());
-  HoodieData records = entry.getValue();
-
-  List fileSlices =
-  
HoodieTableMetadataUtil.getPartitionLatestFileSlices(me

[jira] [Created] (HUDI-7979) Fix out of the box defaults with spillable memory configs

2024-07-10 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7979:
-

 Summary: Fix out of the box defaults with spillable memory configs 
 Key: HUDI-7979
 URL: https://issues.apache.org/jira/browse/HUDI-7979
 Project: Apache Hudi
  Issue Type: Improvement
  Components: reader-core, writer-core
Reporter: sivabalan narayanan


Looks like we are very conservative wrt memory configs used for spillable map 
based FSV. 

 
For eg, we are only allocating 15Mb out of the box to file groups when using 
spillable map based FSV.
 public long getMaxMemoryForFileGroupMap() \{
long totalMemory = getLong(SPILLABLE_MEMORY);
return totalMemory - getMaxMemoryForPendingCompaction() - 
getMaxMemoryForBootstrapBaseFile();
  }
 
SPILLABLE_MEMORY = default is 100Mb.
getMaxMemoryForPendingCompaction = 80% of 100MB.
getMaxMemoryForBootstrapBaseFile = 5% of 100Mb.
so, overall, out of the box we are allocating only 15Mb for 
getMaxMemoryForFileGroupMap.
ref: 
[https://github.com/apache/hudi/blob/bb0621edee97507cf2460e8cb57b5307510b917e/hudi-[…]/apache/hudi/common/table/view/FileSystemViewStorageConfig.java|https://github.com/apache/hudi/blob/bb0621edee97507cf2460e8cb57b5307510b917e/hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewStorageConfig.java#L224]
Wondering do we even need 80% for pending compaction tracker in our FSV. I am 
thinking to make it 15%. so that we can give more memory to actual file groups. 
We may not have lot of pending compactions for a given table. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7971) Test and Certify 0.14.x to 0.16.x tables are readable in 1.x Hudi reader

2024-07-10 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7971:
--
Description: 
Lets ensure 1.x reader is fully compatible w/ reading any of 0.14.x to 0.16.x 
tables 

 

Readers :  1.x
 # Spark SQL
 # Spark Datasource
 # Trino/Presto
 # Hive
 # Flink

Writer: 0.16

Table State:
 * COW
 ** few write commits 
 ** Pending clustering
 ** Completed Clustering
 ** Failed writes with no rollbacks
 ** Insert overwrite table/partition
 ** Savepoint for Time-travel query

 * MOR
 ** Same as COW
 ** Pending and completed async compaction (with log-files and no base file)
 ** Custom Payloads (for MOR snapshot queries) (e:g SQL Expression Payload)
 ** Rollback formats - DELETE, rollback block

Other knobs:
 # Metadata enabled/disabled
 # Column Stats enabled/disabled and data-skipping enabled/disabled
 # RLI enabled with eq/IN queries

 # Non-Partitioned dataset
 # CDC Reads 
 # Incremental Reads
 # Time-travel query

 

What to test ?
 # Query Results Correctness
 # Performance : See the benefit of 
 # Partition Pruning
 # Metadata  table - col stats, RLI,

 

Corner Case Testing:

 
 # Schema Evolution with different file-groups having different generation of 
schema
 # Dynamic Partition Pruning
 # Does Column Projection work correctly for log files reading 

  was:
Lets ensure 1.x reader is fully compatible w/ reading any of 0.14.x to 0.16.x 
tables 

 

Readers :  1.x
 # Spark SQL
 # Spark Datasource
 # Trino/Presto
 # Hive
 # Flink

Writer: 0.16

Table State:
 * COW
 * Pending clustering
 * Completed Clustering
 * Failed writes with no rollbacks
 * Insert overwrite table/partition
 * Savepoint for Time-travel query


 * MOR
 * Same as COW
 * Pending and completed async compaction (with log-files and no base file)
 * Custom Payloads (for MOR snapshot queries) (e:g SQL Expression Payload)
 * Rollback formats - DELETE, rollback block

Other knobs:
 # Metadata enabled/disabled
 # Column Stats enabled/disabled and data-skipping enabled/disabled
 # RLI enabled with eq/IN queries


 # Non-Partitioned dataset
 # CDC Reads 
 # Incremental Reads
 # Time-travel query

 

What to test ?
 # Query Results Correctness
 # Performance : See the benefit of 
 # Partition Pruning
 # Metadata  table - col stats, RLI,

 

Corner Case Testing:

 
 # Schema Evolution with different file-groups having different generation of 
schema
 # Dynamic Partition Pruning
 # Does Column Projection work correctly for log files reading 


> Test and Certify 0.14.x to 0.16.x tables are readable in 1.x Hudi reader 
> -
>
> Key: HUDI-7971
> URL: https://issues.apache.org/jira/browse/HUDI-7971
> Project: Apache Hudi
>  Issue Type: Sub-task
>    Reporter: sivabalan narayanan
>Priority: Major
> Fix For: 1.0.0
>
>
> Lets ensure 1.x reader is fully compatible w/ reading any of 0.14.x to 0.16.x 
> tables 
>  
> Readers :  1.x
>  # Spark SQL
>  # Spark Datasource
>  # Trino/Presto
>  # Hive
>  # Flink
> Writer: 0.16
> Table State:
>  * COW
>  ** few write commits 
>  ** Pending clustering
>  ** Completed Clustering
>  ** Failed writes with no rollbacks
>  ** Insert overwrite table/partition
>  ** Savepoint for Time-travel query
>  * MOR
>  ** Same as COW
>  ** Pending and completed async compaction (with log-files and no base file)
>  ** Custom Payloads (for MOR snapshot queries) (e:g SQL Expression Payload)
>  ** Rollback formats - DELETE, rollback block
> Other knobs:
>  # Metadata enabled/disabled
>  # Column Stats enabled/disabled and data-skipping enabled/disabled
>  # RLI enabled with eq/IN queries
>  # Non-Partitioned dataset
>  # CDC Reads 
>  # Incremental Reads
>  # Time-travel query
>  
> What to test ?
>  # Query Results Correctness
>  # Performance : See the benefit of 
>  # Partition Pruning
>  # Metadata  table - col stats, RLI,
>  
> Corner Case Testing:
>  
>  # Schema Evolution with different file-groups having different generation of 
> schema
>  # Dynamic Partition Pruning
>  # Does Column Projection work correctly for log files reading 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7866) Pull commit metadata changes in bridge release.

2024-07-09 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7866:
--
Epic Link: (was: HUDI-7856)

> Pull commit metadata changes in bridge release.
> ---
>
> Key: HUDI-7866
> URL: https://issues.apache.org/jira/browse/HUDI-7866
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Sagar Sumit
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.16.0, 1.0.0
>
>
> In 1.0.0, we changed some commit metadata to be written in avro. The scope 
> this task is to ensure that bridge release should be able to read commit 
> metadata written by 1.0.0. 
>  
> Scope could be lot more. 
> We could try to parse commit metadata at lot of adhoc places like compaction 
> planning, clean execution etc. So, we need to ensure we account for both 
> formats (json and avro) with 0.16.0 reader since we do not know if commit 
> metadata is from 0.16.0 or from 1.0. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-7866) Pull commit metadata changes in bridge release.

2024-07-09 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-7866:
-

Assignee: Balaji Varadarajan  (was: sivabalan narayanan)

> Pull commit metadata changes in bridge release.
> ---
>
> Key: HUDI-7866
> URL: https://issues.apache.org/jira/browse/HUDI-7866
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Sagar Sumit
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.16.0, 1.0.0
>
>
> In 1.0.0, we changed some commit metadata to be written in avro. The scope 
> this task is to ensure that bridge release should be able to read commit 
> metadata written by 1.0.0. 
>  
> Scope could be lot more. 
> We could try to parse commit metadata at lot of adhoc places like compaction 
> planning, clean execution etc. So, we need to ensure we account for both 
> formats (json and avro) with 0.16.0 reader since we do not know if commit 
> metadata is from 0.16.0 or from 1.0. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7866) Pull commit metadata changes in bridge release.

2024-07-09 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7866:
--
Parent: HUDI-7882
Issue Type: Sub-task  (was: Task)

> Pull commit metadata changes in bridge release.
> ---
>
> Key: HUDI-7866
> URL: https://issues.apache.org/jira/browse/HUDI-7866
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Sagar Sumit
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.16.0, 1.0.0
>
>
> In 1.0.0, we changed some commit metadata to be written in avro. The scope 
> this task is to ensure that bridge release should be able to read commit 
> metadata written by 1.0.0. 
>  
> Scope could be lot more. 
> We could try to parse commit metadata at lot of adhoc places like compaction 
> planning, clean execution etc. So, we need to ensure we account for both 
> formats (json and avro) with 0.16.0 reader since we do not know if commit 
> metadata is from 0.16.0 or from 1.0. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7973) Add table property to track list of columns being indexed in col stats

2024-07-09 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7973:
--
Fix Version/s: 1.0.0

> Add table property to track list of columns being indexed in col stats 
> ---
>
> Key: HUDI-7973
> URL: https://issues.apache.org/jira/browse/HUDI-7973
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: metadata
>Reporter: sivabalan narayanan
>Priority: Major
> Fix For: 1.0.0
>
>
> we need to add a new table property to track what cols are being indexed in 
> col stats. 
> If not for table property, could be a aux folder or somewhere. but we need to 
> store this state somewhere. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7973) Add table property to track list of columns being indexed in col stats

2024-07-09 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7973:
-

 Summary: Add table property to track list of columns being indexed 
in col stats 
 Key: HUDI-7973
 URL: https://issues.apache.org/jira/browse/HUDI-7973
 Project: Apache Hudi
  Issue Type: Improvement
  Components: metadata
Reporter: sivabalan narayanan


we need to add a new table property to track what cols are being indexed in col 
stats. 

If not for table property, could be a aux folder or somewhere. but we need to 
store this state somewhere. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7973) Add table property to track list of columns being indexed in col stats

2024-07-09 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7973:
--
Epic Link: (was: HUDI-7856)

> Add table property to track list of columns being indexed in col stats 
> ---
>
> Key: HUDI-7973
> URL: https://issues.apache.org/jira/browse/HUDI-7973
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: metadata
>Reporter: sivabalan narayanan
>Priority: Major
>
> we need to add a new table property to track what cols are being indexed in 
> col stats. 
> If not for table property, could be a aux folder or somewhere. but we need to 
> store this state somewhere. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7973) Add table property to track list of columns being indexed in col stats

2024-07-09 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7973:
--
Epic Link: HUDI-7856

> Add table property to track list of columns being indexed in col stats 
> ---
>
> Key: HUDI-7973
> URL: https://issues.apache.org/jira/browse/HUDI-7973
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata
>Reporter: sivabalan narayanan
>Priority: Major
>
> we need to add a new table property to track what cols are being indexed in 
> col stats. 
> If not for table property, could be a aux folder or somewhere. but we need to 
> store this state somewhere. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7973) Add table property to track list of columns being indexed in col stats

2024-07-09 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7973:
--
Parent: HUDI-7882
Issue Type: Sub-task  (was: Improvement)

> Add table property to track list of columns being indexed in col stats 
> ---
>
> Key: HUDI-7973
> URL: https://issues.apache.org/jira/browse/HUDI-7973
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: metadata
>Reporter: sivabalan narayanan
>Priority: Major
>
> we need to add a new table property to track what cols are being indexed in 
> col stats. 
> If not for table property, could be a aux folder or somewhere. but we need to 
> store this state somewhere. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7888) Throw meaningful error when reading partial update or DV written in 1.x from 0.16.0 reader

2024-07-09 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7888:
--
Fix Version/s: 1.0.0

> Throw meaningful error when reading partial update or DV written in 1.x from 
> 0.16.0 reader
> --
>
> Key: HUDI-7888
> URL: https://issues.apache.org/jira/browse/HUDI-7888
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: reader-core
>    Reporter: sivabalan narayanan
>Assignee: Jonathan Vexler
>Priority: Major
> Fix For: 1.0.0
>
>
> If 0.16.x reader is used to read 1.x table having partial updates/merges 
> enabled, we need to throw meaningful error to end user. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7888) Throw meaningful error when reading partial update or DV written in 1.x from 0.16.0 reader

2024-07-09 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7888:
--
Description: 
If 0.16.x reader is used to read 1.x table having partial updates/merges 
enabled, we need to throw meaningful error to end user. 

 

 

  was:
We wanted to support reading 1.x tables in 0.16.0 reader.   

 

If 1.x table does not have any new features enabled which are backwards 
incompatible we are good. If not, if someone has enabled partial update feature 
or deletion vector support, we should parse and throw a meaningful error from 
0.16.0 reader. Lets also comb for any other additional features in 1.x and 
throw meaningful error. 

 


> Throw meaningful error when reading partial update or DV written in 1.x from 
> 0.16.0 reader
> --
>
> Key: HUDI-7888
> URL: https://issues.apache.org/jira/browse/HUDI-7888
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: reader-core
>    Reporter: sivabalan narayanan
>Assignee: Jonathan Vexler
>Priority: Major
>
> If 0.16.x reader is used to read 1.x table having partial updates/merges 
> enabled, we need to throw meaningful error to end user. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7972) Add fallback for deletion vector in 0.16.x reader while reading 1.x tables

2024-07-09 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7972:
--
Parent: HUDI-7882
Issue Type: Sub-task  (was: Improvement)

> Add fallback for deletion vector in 0.16.x reader while reading 1.x tables
> --
>
> Key: HUDI-7972
> URL: https://issues.apache.org/jira/browse/HUDI-7972
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: reader-core
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: 1.0-migration
> Fix For: 1.0.0
>
>
> If 0.16.x reader is used to read a 1.x table with deletion vector, we should 
> fallback to using key based merges instead of position based merges. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7972) Add fallback for deletion vector in 0.16.x reader while reading 1.x tables

2024-07-09 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7972:
-

 Summary: Add fallback for deletion vector in 0.16.x reader while 
reading 1.x tables
 Key: HUDI-7972
 URL: https://issues.apache.org/jira/browse/HUDI-7972
 Project: Apache Hudi
  Issue Type: Improvement
  Components: reader-core
Reporter: sivabalan narayanan


If 0.16.x reader is used to read a 1.x table with deletion vector, we should 
fallback to using key based merges instead of position based merges. 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7865) Pull table properties changes in bridge release

2024-07-09 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7865:
--
Parent: HUDI-7882
Issue Type: Sub-task  (was: Task)

> Pull table properties changes in bridge release
> ---
>
> Key: HUDI-7865
> URL: https://issues.apache.org/jira/browse/HUDI-7865
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Sagar Sumit
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.16.0, 1.0.0
>
>
> In 1.0.0, we changed some table properties to have nums as value instead of 
> classnames and then added infer functions. The scope of this task is to 
> ensure that bridge release should be able to read hoodie.properties written 
> by 1.0.0.
> a. Payload enum change reference - 
> [https://github.com/apache/hudi/pull/9590/files]
> b. hoodie.record.merge.mode : ref links : #9894, #11439. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7888) Throw meaningful error when reading partial update or DV written in 1.x from 0.16.0 reader

2024-07-09 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7888:
--
Epic Link: (was: HUDI-7856)

> Throw meaningful error when reading partial update or DV written in 1.x from 
> 0.16.0 reader
> --
>
> Key: HUDI-7888
> URL: https://issues.apache.org/jira/browse/HUDI-7888
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: reader-core
>    Reporter: sivabalan narayanan
>Assignee: Jonathan Vexler
>Priority: Major
>
> We wanted to support reading 1.x tables in 0.16.0 reader.   
>  
> If 1.x table does not have any new features enabled which are backwards 
> incompatible we are good. If not, if someone has enabled partial update 
> feature or deletion vector support, we should parse and throw a meaningful 
> error from 0.16.0 reader. Lets also comb for any other additional features in 
> 1.x and throw meaningful error. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7865) Pull table properties changes in bridge release

2024-07-09 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7865:
--
Epic Link: (was: HUDI-7856)

> Pull table properties changes in bridge release
> ---
>
> Key: HUDI-7865
> URL: https://issues.apache.org/jira/browse/HUDI-7865
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Sagar Sumit
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.16.0, 1.0.0
>
>
> In 1.0.0, we changed some table properties to have nums as value instead of 
> classnames and then added infer functions. The scope of this task is to 
> ensure that bridge release should be able to read hoodie.properties written 
> by 1.0.0.
> a. Payload enum change reference - 
> [https://github.com/apache/hudi/pull/9590/files]
> b. hoodie.record.merge.mode : ref links : #9894, #11439. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7888) Throw meaningful error when reading partial update or DV written in 1.x from 0.16.0 reader

2024-07-09 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7888:
--
Parent: HUDI-7882
Issue Type: Sub-task  (was: Improvement)

> Throw meaningful error when reading partial update or DV written in 1.x from 
> 0.16.0 reader
> --
>
> Key: HUDI-7888
> URL: https://issues.apache.org/jira/browse/HUDI-7888
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: reader-core
>    Reporter: sivabalan narayanan
>Assignee: Jonathan Vexler
>Priority: Major
>
> We wanted to support reading 1.x tables in 0.16.0 reader.   
>  
> If 1.x table does not have any new features enabled which are backwards 
> incompatible we are good. If not, if someone has enabled partial update 
> feature or deletion vector support, we should parse and throw a meaningful 
> error from 0.16.0 reader. Lets also comb for any other additional features in 
> 1.x and throw meaningful error. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7972) Add fallback for deletion vector in 0.16.x reader while reading 1.x tables

2024-07-09 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7972:
--
Fix Version/s: 1.0.0

> Add fallback for deletion vector in 0.16.x reader while reading 1.x tables
> --
>
> Key: HUDI-7972
> URL: https://issues.apache.org/jira/browse/HUDI-7972
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: reader-core
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: 1.0-migration
> Fix For: 1.0.0
>
>
> If 0.16.x reader is used to read a 1.x table with deletion vector, we should 
> fallback to using key based merges instead of position based merges. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7972) Add fallback for deletion vector in 0.16.x reader while reading 1.x tables

2024-07-09 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7972:
--
Labels: 1.0-migration  (was: )

> Add fallback for deletion vector in 0.16.x reader while reading 1.x tables
> --
>
> Key: HUDI-7972
> URL: https://issues.apache.org/jira/browse/HUDI-7972
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: reader-core
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: 1.0-migration
>
> If 0.16.x reader is used to read a 1.x table with deletion vector, we should 
> fallback to using key based merges instead of position based merges. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7887) Any log format header types changes need to be ported to 0.16.0 from 1.x

2024-07-09 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7887:
--
Epic Link: (was: HUDI-7856)

> Any log format header types changes need to be ported to 0.16.0 from 1.x
> 
>
> Key: HUDI-7887
> URL: https://issues.apache.org/jira/browse/HUDI-7887
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: reader-core
>Reporter: sivabalan narayanan
>Assignee: Jonathan Vexler
>Priority: Major
>
> We wanted to support reading 1.x tables in 0.16.0 reader.   
>  
> Port any new log header metadata types introduced in 1.x to 0.16.0 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7887) Any log format header types changes need to be ported to 0.16.0 from 1.x

2024-07-09 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7887:
--
Parent: HUDI-7882
Issue Type: Sub-task  (was: Improvement)

> Any log format header types changes need to be ported to 0.16.0 from 1.x
> 
>
> Key: HUDI-7887
> URL: https://issues.apache.org/jira/browse/HUDI-7887
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: reader-core
>Reporter: sivabalan narayanan
>Assignee: Jonathan Vexler
>Priority: Major
>
> We wanted to support reading 1.x tables in 0.16.0 reader.   
>  
> Port any new log header metadata types introduced in 1.x to 0.16.0 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7886) Make metadata payload from 1.x readable in 0.16.0

2024-07-09 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7886:
--
Parent: HUDI-7882
Issue Type: Sub-task  (was: Improvement)

> Make metadata payload from 1.x readable in 0.16.0
> -
>
> Key: HUDI-7886
> URL: https://issues.apache.org/jira/browse/HUDI-7886
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: Lokesh Jain
>Priority: Major
>
> We wanted to support reading 1.x tables in 0.16.0 reader.   
>  
> So, lets port over all metadata payload schema changes to 0.16.0 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7886) Make metadata payload from 1.x readable in 0.16.0

2024-07-09 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7886:
--
Epic Link: (was: HUDI-7856)

> Make metadata payload from 1.x readable in 0.16.0
> -
>
> Key: HUDI-7886
> URL: https://issues.apache.org/jira/browse/HUDI-7886
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: Lokesh Jain
>Priority: Major
>
> We wanted to support reading 1.x tables in 0.16.0 reader.   
>  
> So, lets port over all metadata payload schema changes to 0.16.0 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7971) Test and Certify 0.14.x to 0.16.x tables are readable in 1.x Hudi reader

2024-07-09 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7971:
--
Epic Link: (was: HUDI-7856)

> Test and Certify 0.14.x to 0.16.x tables are readable in 1.x Hudi reader 
> -
>
> Key: HUDI-7971
> URL: https://issues.apache.org/jira/browse/HUDI-7971
> Project: Apache Hudi
>  Issue Type: Sub-task
>    Reporter: sivabalan narayanan
>Priority: Major
> Fix For: 1.0.0
>
>
> Lets ensure 1.x reader is fully compatible w/ reading any of 0.14.x to 0.16.x 
> tables 
>  
> Readers :  1.x
>  # Spark SQL
>  # Spark Datasource
>  # Trino/Presto
>  # Hive
>  # Flink
> Writer: 0.16
> Table State:
>  * COW
>  * Pending clustering
>  * Completed Clustering
>  * Failed writes with no rollbacks
>  * Insert overwrite table/partition
>  * Savepoint for Time-travel query
>  * MOR
>  * Same as COW
>  * Pending and completed async compaction (with log-files and no base file)
>  * Custom Payloads (for MOR snapshot queries) (e:g SQL Expression Payload)
>  * Rollback formats - DELETE, rollback block
> Other knobs:
>  # Metadata enabled/disabled
>  # Column Stats enabled/disabled and data-skipping enabled/disabled
>  # RLI enabled with eq/IN queries
>  # Non-Partitioned dataset
>  # CDC Reads 
>  # Incremental Reads
>  # Time-travel query
>  
> What to test ?
>  # Query Results Correctness
>  # Performance : See the benefit of 
>  # Partition Pruning
>  # Metadata  table - col stats, RLI,
>  
> Corner Case Testing:
>  
>  # Schema Evolution with different file-groups having different generation of 
> schema
>  # Dynamic Partition Pruning
>  # Does Column Projection work correctly for log files reading 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7971) Test and Certify 0.14.x to 0.16.x tables are readable in 1.x Hudi reader

2024-07-09 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7971:
--
Parent: HUDI-7882
Issue Type: Sub-task  (was: Test)

> Test and Certify 0.14.x to 0.16.x tables are readable in 1.x Hudi reader 
> -
>
> Key: HUDI-7971
> URL: https://issues.apache.org/jira/browse/HUDI-7971
> Project: Apache Hudi
>  Issue Type: Sub-task
>    Reporter: sivabalan narayanan
>Priority: Major
> Fix For: 1.0.0
>
>
> Lets ensure 1.x reader is fully compatible w/ reading any of 0.14.x to 0.16.x 
> tables 
>  
> Readers :  1.x
>  # Spark SQL
>  # Spark Datasource
>  # Trino/Presto
>  # Hive
>  # Flink
> Writer: 0.16
> Table State:
>  * COW
>  * Pending clustering
>  * Completed Clustering
>  * Failed writes with no rollbacks
>  * Insert overwrite table/partition
>  * Savepoint for Time-travel query
>  * MOR
>  * Same as COW
>  * Pending and completed async compaction (with log-files and no base file)
>  * Custom Payloads (for MOR snapshot queries) (e:g SQL Expression Payload)
>  * Rollback formats - DELETE, rollback block
> Other knobs:
>  # Metadata enabled/disabled
>  # Column Stats enabled/disabled and data-skipping enabled/disabled
>  # RLI enabled with eq/IN queries
>  # Non-Partitioned dataset
>  # CDC Reads 
>  # Incremental Reads
>  # Time-travel query
>  
> What to test ?
>  # Query Results Correctness
>  # Performance : See the benefit of 
>  # Partition Pruning
>  # Metadata  table - col stats, RLI,
>  
> Corner Case Testing:
>  
>  # Schema Evolution with different file-groups having different generation of 
> schema
>  # Dynamic Partition Pruning
>  # Does Column Projection work correctly for log files reading 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7971) Test and Certify 0.14.x to 0.16.x tables are readable in 1.x Hudi reader

2024-07-09 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7971:
-

 Summary: Test and Certify 0.14.x to 0.16.x tables are readable in 
1.x Hudi reader 
 Key: HUDI-7971
 URL: https://issues.apache.org/jira/browse/HUDI-7971
 Project: Apache Hudi
  Issue Type: Test
Reporter: sivabalan narayanan


Lets ensure 1.x reader is fully compatible w/ reading any of 0.14.x to 0.16.x 
tables 

 

Readers :  1.x
 # Spark SQL
 # Spark Datasource
 # Trino/Presto
 # Hive
 # Flink

Writer: 0.16

Table State:
 * COW
 * Pending clustering
 * Completed Clustering
 * Failed writes with no rollbacks
 * Insert overwrite table/partition
 * Savepoint for Time-travel query


 * MOR
 * Same as COW
 * Pending and completed async compaction (with log-files and no base file)
 * Custom Payloads (for MOR snapshot queries) (e:g SQL Expression Payload)
 * Rollback formats - DELETE, rollback block

Other knobs:
 # Metadata enabled/disabled
 # Column Stats enabled/disabled and data-skipping enabled/disabled
 # RLI enabled with eq/IN queries


 # Non-Partitioned dataset
 # CDC Reads 
 # Incremental Reads
 # Time-travel query

 

What to test ?
 # Query Results Correctness
 # Performance : See the benefit of 
 # Partition Pruning
 # Metadata  table - col stats, RLI,

 

Corner Case Testing:

 
 # Schema Evolution with different file-groups having different generation of 
schema
 # Dynamic Partition Pruning
 # Does Column Projection work correctly for log files reading 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7971) Test and Certify 0.14.x to 0.16.x tables are readable in 1.x Hudi reader

2024-07-09 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7971:
--
Fix Version/s: 1.0.0

> Test and Certify 0.14.x to 0.16.x tables are readable in 1.x Hudi reader 
> -
>
> Key: HUDI-7971
> URL: https://issues.apache.org/jira/browse/HUDI-7971
> Project: Apache Hudi
>  Issue Type: Test
>    Reporter: sivabalan narayanan
>Priority: Major
> Fix For: 1.0.0
>
>
> Lets ensure 1.x reader is fully compatible w/ reading any of 0.14.x to 0.16.x 
> tables 
>  
> Readers :  1.x
>  # Spark SQL
>  # Spark Datasource
>  # Trino/Presto
>  # Hive
>  # Flink
> Writer: 0.16
> Table State:
>  * COW
>  * Pending clustering
>  * Completed Clustering
>  * Failed writes with no rollbacks
>  * Insert overwrite table/partition
>  * Savepoint for Time-travel query
>  * MOR
>  * Same as COW
>  * Pending and completed async compaction (with log-files and no base file)
>  * Custom Payloads (for MOR snapshot queries) (e:g SQL Expression Payload)
>  * Rollback formats - DELETE, rollback block
> Other knobs:
>  # Metadata enabled/disabled
>  # Column Stats enabled/disabled and data-skipping enabled/disabled
>  # RLI enabled with eq/IN queries
>  # Non-Partitioned dataset
>  # CDC Reads 
>  # Incremental Reads
>  # Time-travel query
>  
> What to test ?
>  # Query Results Correctness
>  # Performance : See the benefit of 
>  # Partition Pruning
>  # Metadata  table - col stats, RLI,
>  
> Corner Case Testing:
>  
>  # Schema Evolution with different file-groups having different generation of 
> schema
>  # Dynamic Partition Pruning
>  # Does Column Projection work correctly for log files reading 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7882) Umbrella ticket for 1.x tables and 0.16.x compatibility

2024-07-09 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7882:
--
Description: 
We have 4 major goals w/ this umbrella ticket. 

a. 1.x reader should be capable of reading any of 0.14.x to 0.16.x tables for 
all query types. 

b. 0.16.x should be capable of reading 1.x tables for most features

c. Upgrade 0.16.x to 1.x 

d. Downgrade 1.x to 0.16.0. 

 

 

We wanted to support reading 1.x tables in 0.16.0 release. So, creating this 
umbrella ticket to track all of them.

 

RFC in progress: [https://github.com/apache/hudi/pull/11514] 

 

Changes required to be ported: 
0. Creating 0.16.0 branch

0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. 

 

1. Timeline 

1.a Hoodie instant parsing should be able to read 1.x instants. 
https://issues.apache.org/jira/browse/HUDI-7883 Sagar. 

1.b Commit metadata parsing is able to handle both json and avro formats. Scope 
might be non-trivial.  https://issues.apache.org/jira/browse/HUDI-7866  Siva.
1.c HoodieDefaultTimeline able to read both timelines based on table version.  
https://issues.apache.org/jira/browse/HUDI-7884 Siva.

1.d Reading LSM timeline using 0.16.0 
https://issues.apache.org/jira/browse/HUDI-7890 Siva. 

1.e Ensure 1.0 MDT timeline is readable by 0.16 - HUDI-7901

 

2. Table property changes 

2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885  
https://issues.apache.org/jira/browse/HUDI-7865 LJ

 

3. MDT table changes

3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ

3.b MDT payload schema changes. https://issues.apache.org/jira/browse/HUDI-7886 
LJ

 

4. Log format changes

4.a All metadata header types porting 
https://issues.apache.org/jira/browse/HUDI-7887 Jon

4.b Meaningful error for incompatible features from 1.x 
https://issues.apache.org/jira/browse/HUDI-7888 Jon

 

5. Log file slice or grouping detection compatibility 

 

5. Tests 

5.a Tests to validate that 1.x tables can be read w/ 0.16.0 
https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. 

 

6 Doc changes 

6.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. 
https://issues.apache.org/jira/browse/HUDI-7889 

  was:
We wanted to support reading 1.x tables in 0.16.0 release. So, creating this 
umbrella ticket to track all of them.

 

RFC in progress: [https://github.com/apache/hudi/pull/11514] 

 

Changes required to be ported: 
0. Creating 0.16.0 branch

0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. 

 

1. Timeline 

1.a Hoodie instant parsing should be able to read 1.x instants. 
https://issues.apache.org/jira/browse/HUDI-7883 Sagar. 

1.b Commit metadata parsing is able to handle both json and avro formats. Scope 
might be non-trivial.  https://issues.apache.org/jira/browse/HUDI-7866  Siva.
1.c HoodieDefaultTimeline able to read both timelines based on table version.  
https://issues.apache.org/jira/browse/HUDI-7884 Siva.

1.d Reading LSM timeline using 0.16.0 
https://issues.apache.org/jira/browse/HUDI-7890 Siva. 

1.e Ensure 1.0 MDT timeline is readable by 0.16 - HUDI-7901

 

2. Table property changes 

2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885  
https://issues.apache.org/jira/browse/HUDI-7865 LJ

 

3. MDT table changes

3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ

3.b MDT payload schema changes. https://issues.apache.org/jira/browse/HUDI-7886 
LJ

 

4. Log format changes

4.a All metadata header types porting 
https://issues.apache.org/jira/browse/HUDI-7887 Jon

4.b Meaningful error for incompatible features from 1.x 
https://issues.apache.org/jira/browse/HUDI-7888 Jon

 

5. Log file slice or grouping detection compatibility 

 

5. Tests 

5.a Tests to validate that 1.x tables can be read w/ 0.16.0 
https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. 

 

6 Doc changes 

6.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. 
https://issues.apache.org/jira/browse/HUDI-7889 


> Umbrella ticket for 1.x tables and 0.16.x compatibility
> ---
>
> Key: HUDI-7882
> URL: https://issues.apache.org/jira/browse/HUDI-7882
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: reader-core
>Reporter: sivabalan narayanan
>    Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0, 1.0.0
>
>
> We have 4 major goals w/ this umbrella ticket. 
> a. 1.x reader should be capable of reading any of 0.14.x to 0.16.x tables for 
> all query types. 
> b. 0.16.x should be capable of reading 1.x tables for most features
> c. Upgrade 0.16.x to 1.x 
&

[jira] [Updated] (HUDI-7882) Umbrella ticket for 1.x tables and 0.16.x compatibility

2024-07-09 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7882:
--
Summary: Umbrella ticket for 1.x tables and 0.16.x compatibility  (was: 
Umbrella ticket 1.x tables and 0.16.x compatibility)

> Umbrella ticket for 1.x tables and 0.16.x compatibility
> ---
>
> Key: HUDI-7882
> URL: https://issues.apache.org/jira/browse/HUDI-7882
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: reader-core
>Reporter: sivabalan narayanan
>    Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0, 1.0.0
>
>
> We wanted to support reading 1.x tables in 0.16.0 release. So, creating this 
> umbrella ticket to track all of them.
>  
> RFC in progress: [https://github.com/apache/hudi/pull/11514] 
>  
> Changes required to be ported: 
> 0. Creating 0.16.0 branch
> 0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. 
>  
> 1. Timeline 
> 1.a Hoodie instant parsing should be able to read 1.x instants. 
> https://issues.apache.org/jira/browse/HUDI-7883 Sagar. 
> 1.b Commit metadata parsing is able to handle both json and avro formats. 
> Scope might be non-trivial.  https://issues.apache.org/jira/browse/HUDI-7866  
> Siva.
> 1.c HoodieDefaultTimeline able to read both timelines based on table version. 
>  https://issues.apache.org/jira/browse/HUDI-7884 Siva.
> 1.d Reading LSM timeline using 0.16.0 
> https://issues.apache.org/jira/browse/HUDI-7890 Siva. 
> 1.e Ensure 1.0 MDT timeline is readable by 0.16 - HUDI-7901
>  
> 2. Table property changes 
> 2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885  
> https://issues.apache.org/jira/browse/HUDI-7865 LJ
>  
> 3. MDT table changes
> 3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ
> 3.b MDT payload schema changes. 
> https://issues.apache.org/jira/browse/HUDI-7886 LJ
>  
> 4. Log format changes
> 4.a All metadata header types porting 
> https://issues.apache.org/jira/browse/HUDI-7887 Jon
> 4.b Meaningful error for incompatible features from 1.x 
> https://issues.apache.org/jira/browse/HUDI-7888 Jon
>  
> 5. Log file slice or grouping detection compatibility 
>  
> 5. Tests 
> 5.a Tests to validate that 1.x tables can be read w/ 0.16.0 
> https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. 
>  
> 6 Doc changes 
> 6.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. 
> https://issues.apache.org/jira/browse/HUDI-7889 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7882) Umbrella ticket 1.x tables and 0.16.x compatibility

2024-07-09 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7882:
--
Summary: Umbrella ticket 1.x tables and 0.16.x compatibility  (was: 
Umbrella ticket to track all changes required to support reading 1.x tables 
with 0.16.0 )

> Umbrella ticket 1.x tables and 0.16.x compatibility
> ---
>
> Key: HUDI-7882
> URL: https://issues.apache.org/jira/browse/HUDI-7882
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: reader-core
>Reporter: sivabalan narayanan
>    Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0, 1.0.0
>
>
> We wanted to support reading 1.x tables in 0.16.0 release. So, creating this 
> umbrella ticket to track all of them.
>  
> RFC in progress: [https://github.com/apache/hudi/pull/11514] 
>  
> Changes required to be ported: 
> 0. Creating 0.16.0 branch
> 0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. 
>  
> 1. Timeline 
> 1.a Hoodie instant parsing should be able to read 1.x instants. 
> https://issues.apache.org/jira/browse/HUDI-7883 Sagar. 
> 1.b Commit metadata parsing is able to handle both json and avro formats. 
> Scope might be non-trivial.  https://issues.apache.org/jira/browse/HUDI-7866  
> Siva.
> 1.c HoodieDefaultTimeline able to read both timelines based on table version. 
>  https://issues.apache.org/jira/browse/HUDI-7884 Siva.
> 1.d Reading LSM timeline using 0.16.0 
> https://issues.apache.org/jira/browse/HUDI-7890 Siva. 
> 1.e Ensure 1.0 MDT timeline is readable by 0.16 - HUDI-7901
>  
> 2. Table property changes 
> 2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885  
> https://issues.apache.org/jira/browse/HUDI-7865 LJ
>  
> 3. MDT table changes
> 3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ
> 3.b MDT payload schema changes. 
> https://issues.apache.org/jira/browse/HUDI-7886 LJ
>  
> 4. Log format changes
> 4.a All metadata header types porting 
> https://issues.apache.org/jira/browse/HUDI-7887 Jon
> 4.b Meaningful error for incompatible features from 1.x 
> https://issues.apache.org/jira/browse/HUDI-7888 Jon
>  
> 5. Log file slice or grouping detection compatibility 
>  
> 5. Tests 
> 5.a Tests to validate that 1.x tables can be read w/ 0.16.0 
> https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. 
>  
> 6 Doc changes 
> 6.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. 
> https://issues.apache.org/jira/browse/HUDI-7889 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-7865) Pull table properties changes in bridge release

2024-07-09 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-7865:
-

Assignee: Balaji Varadarajan  (was: Lokesh Jain)

> Pull table properties changes in bridge release
> ---
>
> Key: HUDI-7865
> URL: https://issues.apache.org/jira/browse/HUDI-7865
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.16.0, 1.0.0
>
>
> In 1.0.0, we changed some table properties to have nums as value instead of 
> classnames and then added infer functions. The scope of this task is to 
> ensure that bridge release should be able to read hoodie.properties written 
> by 1.0.0.
> a. Payload enum change reference - 
> [https://github.com/apache/hudi/pull/9590/files]
> b. hoodie.record.merge.mode : ref links : #9894, #11439. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7865) Pull table properties changes in bridge release

2024-07-09 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7865:
--
Description: 
In 1.0.0, we changed some table properties to have nums as value instead of 
classnames and then added infer functions. The scope of this task is to ensure 
that bridge release should be able to read hoodie.properties written by 1.0.0.
a. Changes made to table properties in 1.0.0 - 
[https://github.com/apache/hudi/pull/9590/files]

b. hoodie.record.merge.mode : ref links : #9894, #11439. 

 

Lets also check for any new table property changes in 1.x compared to 0.16.x. 
and discuss w/ the team on how to go about it. 

  was:
In 1.0.0, we changed some table properties to have nums as value instead of 
classnames and then added infer functions. The scope of this task is to ensure 
that bridge release should be able to read hoodie.properties written by 1.0.0.
Changes made to table properties in 1.0.0 - 
[https://github.com/apache/hudi/pull/9590/files]

 

Lets also check for any new table property changes in 1.x compared to 0.16.x. 
and discuss w/ the team on how to go about it. 


> Pull table properties changes in bridge release
> ---
>
> Key: HUDI-7865
> URL: https://issues.apache.org/jira/browse/HUDI-7865
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Lokesh Jain
>Priority: Major
> Fix For: 0.16.0, 1.0.0
>
>
> In 1.0.0, we changed some table properties to have nums as value instead of 
> classnames and then added infer functions. The scope of this task is to 
> ensure that bridge release should be able to read hoodie.properties written 
> by 1.0.0.
> a. Changes made to table properties in 1.0.0 - 
> [https://github.com/apache/hudi/pull/9590/files]
> b. hoodie.record.merge.mode : ref links : #9894, #11439. 
>  
> Lets also check for any new table property changes in 1.x compared to 0.16.x. 
> and discuss w/ the team on how to go about it. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7865) Pull table properties changes in bridge release

2024-07-09 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7865:
--
Description: 
In 1.0.0, we changed some table properties to have nums as value instead of 
classnames and then added infer functions. The scope of this task is to ensure 
that bridge release should be able to read hoodie.properties written by 1.0.0.
a. Payload enum change reference - 
[https://github.com/apache/hudi/pull/9590/files]

b. hoodie.record.merge.mode : ref links : #9894, #11439. 

 

  was:
In 1.0.0, we changed some table properties to have nums as value instead of 
classnames and then added infer functions. The scope of this task is to ensure 
that bridge release should be able to read hoodie.properties written by 1.0.0.
a. Changes made to table properties in 1.0.0 - 
[https://github.com/apache/hudi/pull/9590/files]

b. hoodie.record.merge.mode : ref links : #9894, #11439. 

 

Lets also check for any new table property changes in 1.x compared to 0.16.x. 
and discuss w/ the team on how to go about it. 


> Pull table properties changes in bridge release
> ---
>
> Key: HUDI-7865
> URL: https://issues.apache.org/jira/browse/HUDI-7865
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Lokesh Jain
>Priority: Major
> Fix For: 0.16.0, 1.0.0
>
>
> In 1.0.0, we changed some table properties to have nums as value instead of 
> classnames and then added infer functions. The scope of this task is to 
> ensure that bridge release should be able to read hoodie.properties written 
> by 1.0.0.
> a. Payload enum change reference - 
> [https://github.com/apache/hudi/pull/9590/files]
> b. hoodie.record.merge.mode : ref links : #9894, #11439. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7968) RFC for robust handling of spark task failures and retries

2024-07-08 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7968:
--
Status: In Progress  (was: Open)

> RFC for robust handling of spark task failures and retries
> --
>
> Key: HUDI-7968
> URL: https://issues.apache.org/jira/browse/HUDI-7968
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: sivabalan narayanan
>    Assignee: sivabalan narayanan
>Priority: Major
>  Labels: RobustWrites, pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7507) ongoing concurrent writers with smaller timestamp can cause issues with table services

2024-07-08 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7507:
--
Status: Patch Available  (was: In Progress)

>  ongoing concurrent writers with smaller timestamp can cause issues with 
> table services
> ---
>
> Key: HUDI-7507
> URL: https://issues.apache.org/jira/browse/HUDI-7507
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: table-service
>Reporter: Krishen Bhan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
> Attachments: Flowchart (1).png, Flowchart.png
>
>
> *Scenarios:*
> Although HUDI operations hold a table lock when creating a .requested 
> instant, because HUDI writers do not generate a timestamp and create a 
> .requsted plan in the same transaction, there can be a scenario where 
>  # Job 1 starts, chooses timestamp (x) , Job 2 starts and chooses timestamp 
> (x - 1)
>  # Job 1 schedules and creates requested file with instant timestamp (x)
>  # Job 2 schedules and creates requested file with instant timestamp (x-1)
>  # Both jobs continue running
> If one job is writing a commit and the other is a table service, this can 
> cause issues:
>  * 
>  ** If Job 2 is ingestion commit and Job 1 is compaction/log compaction, then 
> when Job 1 runs before Job 2 and can create a compaction plan for all instant 
> times (up to (x) ) that doesn’t include instant time (x-1) .  Later Job 2 
> will create instant time (x-1), but timeline will be in a corrupted state 
> since compaction plan was supposed to include (x-1)
>  ** There is a similar issue with clean. If Job2 is a long-running commit 
> (that was stuck/delayed for a while before creating its .requested plan) and 
> Job 1 is a clean, then Job 1 can perform a clean that updates the 
> earliest-commit-to-retain without waiting for the inflight instant by Job 2 
> at (x-1) to complete. This causes Job2 to be "skipped" by clean.
>  ** If the completed commit files include som sort of "checkpointing" with 
> another "downstream job" performing incremental reads on this dataset (such 
> as Hoodie Streamer/DeltaSync) then there may be incorrect behavior, such as 
> the incremental reader skipping some completed commits (that have a smaller 
> instant timestamp than latest completed commit but were created after).
> [Edit] I added a diagram to visualize the issue, specifically the second 
> scenario with clean
> !Flowchart (1).png!
> *Proposed approach:*
> One way this can be resolved is by combining the operations of generating 
> instant time and creating a requested file in the same HUDI table 
> transaction. Specifically, executing the following steps whenever any instant 
> (commit, table service, etc) is scheduled
> Approach A
>  # Acquire table lock
>  # Look at the latest instant C on the active timeline (completed or not). 
> Generate a timestamp after C
>  # Create the plan and requested file using this new timestamp ( that is 
> greater than C)
>  # Release table lock
> Unfortunately (A) has the following drawbacks
>  * Every operation must now hold the table lock when computing its plan even 
> if it's an expensive operation and will take a while
>  * Users of HUDI cannot easily set their own instant time of an operation, 
> and this restriction would break any public APIs that allow this and would 
> require deprecating those APIs.
>  
> An alternate approach is to have every operation abort creating a .requested 
> file unless it has the latest timestamp. Specifically, for any instant type, 
> whenever an operation is about to create a .requested plan on timeline, it 
> should take the table lock and assert that there are no other instants on 
> timeline that are greater than it that could cause a conflict. If that 
> assertion fails, then throw a retry-able conflict resolution exception.
> Specifically, the following steps should be followed whenever any instant 
> (commit, table service, etc) is scheduled
> Approach B
>  # Acquire table lock. Assume that the desired instant time C and requested 
> file plan metadata have already been created, regardless of wether it was 
> before this step or right after acquiring the table lock.
>  # If there are any instants on the timeline that are greater than C 
> (regardless of their operation type or sate status) then release table lock 
> and throw an exception
>  # Create requested plan on timeline

[jira] [Assigned] (HUDI-7967) Robust handling of spark task failures and retries

2024-07-08 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-7967:
-

Assignee: sivabalan narayanan

> Robust handling of spark task failures and retries 
> ---
>
> Key: HUDI-7967
> URL: https://issues.apache.org/jira/browse/HUDI-7967
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: reader-core, writer-core
>Reporter: sivabalan narayanan
>    Assignee: sivabalan narayanan
>Priority: Major
>  Labels: RobustWrites, pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7503) Concurrent executions of table service plan should not corrupt dataset

2024-07-08 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7503:
--
Status: Patch Available  (was: In Progress)

> Concurrent executions of table service plan should not corrupt dataset
> --
>
> Key: HUDI-7503
> URL: https://issues.apache.org/jira/browse/HUDI-7503
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: compaction, table-service
>Reporter: Krishen Bhan
>    Assignee: sivabalan narayanan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.16.0, 1.0.0
>
>
> Some external workflow schedulers can accidentally (or) misbehave and 
> schedule duplicate executions of the same compaction plan. We need a way to 
> guard against this inside Hudi (vs user taking a lock externally). In such a 
> world,  2 instance of the job concurrently call 
> `org.apache.hudi.client.BaseHoodieTableServiceClient#compact` on the same 
> compaction instant. 
> This is since one writer might execute the instant and create an inflight, 
> while the other writer sees the inflight and tries to roll it back before 
> re-attempting to execute it (since it will assume said inflight was a 
> previously failed compaction attempt).
> This logic should be updated such that only one writer will actually execute 
> the compaction plan at a time (and the others will fail/abort).
> One approach is to use a transaction (base table lock) in conjunction with 
> heartbeating, to ensure that the writer triggers a heartbeat before executing 
> compaction, and any concurrent writers will use the heartbeat to check wether 
> the compaction is currently being executed by another writer. Specifically , 
> the compact API should execute the following steps
>  # Get the instant to compact C (as usual)
>  # Start a transaction
>  # Checks if C has an active heartbeat, if so finish transaction and throw 
> exception
>  # Start a heartbeat for C (this will implicitly re-start the heartbeat if it 
> has been started before by another job)
>  # Finish transaction
>  # Run the existing compact API logic on C 
>  # If execution succeeds, clean up heartbeat file . If it fails do nothing 
> (as the heartbeat will anyway be automatically expired later).
> Note that this approach only holds the table lock temporarily, when 
> checking/starting the heartbeat
> Also, this flow can be applied to execution of clean plans and other table 
> services



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7968) RFC for robust handling of spark task failures and retries

2024-07-08 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7968:
--
Sprint: 2024/06/17-30

> RFC for robust handling of spark task failures and retries
> --
>
> Key: HUDI-7968
> URL: https://issues.apache.org/jira/browse/HUDI-7968
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: sivabalan narayanan
>    Assignee: sivabalan narayanan
>Priority: Major
>  Labels: RobustWrites, pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7961) Optimize UpsertPartitioner for prepped write operations

2024-07-08 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7961:
--
Status: In Progress  (was: Open)

> Optimize UpsertPartitioner for prepped write operations
> ---
>
> Key: HUDI-7961
> URL: https://issues.apache.org/jira/browse/HUDI-7961
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: sivabalan narayanan
>    Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> We have avg record size calculation etc in UpsertPartitioner which does not 
> makes sense for prepped write operations. also, w/ MDT, we can optimize 
> these. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-7644) Add record key info with RLI validation in MDT Validator

2024-07-08 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7644.
-
Resolution: Fixed

> Add record key info with RLI validation in MDT Validator
> 
>
> Key: HUDI-7644
> URL: https://issues.apache.org/jira/browse/HUDI-7644
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata, tests-ci
>Reporter: sivabalan narayanan
>    Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.16.0, 1.0.0
>
>
> Add record key info with RLI validation in MDT Validator



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7961) Optimize UpsertPartitioner for prepped write operations

2024-07-08 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7961:
--
Status: Patch Available  (was: In Progress)

> Optimize UpsertPartitioner for prepped write operations
> ---
>
> Key: HUDI-7961
> URL: https://issues.apache.org/jira/browse/HUDI-7961
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: sivabalan narayanan
>    Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> We have avg record size calculation etc in UpsertPartitioner which does not 
> makes sense for prepped write operations. also, w/ MDT, we can optimize 
> these. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7961) Optimize UpsertPartitioner for prepped write operations

2024-07-08 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7961:
--
Fix Version/s: 1.0.0

> Optimize UpsertPartitioner for prepped write operations
> ---
>
> Key: HUDI-7961
> URL: https://issues.apache.org/jira/browse/HUDI-7961
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: sivabalan narayanan
>    Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> We have avg record size calculation etc in UpsertPartitioner which does not 
> makes sense for prepped write operations. also, w/ MDT, we can optimize 
> these. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-7961) Optimize UpsertPartitioner for prepped write operations

2024-07-08 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-7961:
-

Assignee: sivabalan narayanan

> Optimize UpsertPartitioner for prepped write operations
> ---
>
> Key: HUDI-7961
> URL: https://issues.apache.org/jira/browse/HUDI-7961
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: sivabalan narayanan
>    Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
>
> We have avg record size calculation etc in UpsertPartitioner which does not 
> makes sense for prepped write operations. also, w/ MDT, we can optimize 
> these. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-7968) RFC for robust handling of spark task failures and retries

2024-07-08 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-7968:
-

Assignee: sivabalan narayanan

> RFC for robust handling of spark task failures and retries
> --
>
> Key: HUDI-7968
> URL: https://issues.apache.org/jira/browse/HUDI-7968
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: sivabalan narayanan
>    Assignee: sivabalan narayanan
>Priority: Major
>  Labels: RobustWrites, pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7967) Robust handling of spark task failures and retries

2024-07-08 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7967:
--
Labels: RobustWrites  (was: )

> Robust handling of spark task failures and retries 
> ---
>
> Key: HUDI-7967
> URL: https://issues.apache.org/jira/browse/HUDI-7967
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: reader-core, writer-core
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: RobustWrites
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7968) RFC for robust handling of spark task failures and retries

2024-07-08 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7968:
--
Labels: RobustWrites  (was: )

> RFC for robust handling of spark task failures and retries
> --
>
> Key: HUDI-7968
> URL: https://issues.apache.org/jira/browse/HUDI-7968
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: RobustWrites
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7968) RFC for robust handling of spark task failures and retries

2024-07-08 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7968:
-

 Summary: RFC for robust handling of spark task failures and retries
 Key: HUDI-7968
 URL: https://issues.apache.org/jira/browse/HUDI-7968
 Project: Apache Hudi
  Issue Type: Improvement
  Components: writer-core
Reporter: sivabalan narayanan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6416) Completion Markers for handling spark retries

2024-07-08 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-6416:
--
Epic Link: HUDI-7967

> Completion Markers for handling spark retries
> -
>
> Key: HUDI-6416
> URL: https://issues.apache.org/jira/browse/HUDI-6416
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Balajee Nagasubramaniam
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
>
> During spark stage retries, spark driver may have all the information to 
> reconcile the commit and proceed with next steps, while a stray executor may 
> still be writing to a data file and complete later (before the JVM exit). 
> Extra files left on the dataset, excluded from reconcile commit step could 
> show up as data quality issue for query engines with duplicate records.
> This change brings completion markers which tries to prevent the dataset from 
> experiencing data quality issues,  in such corner case scenarios.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7967) Robust handling of spark task failures and retries

2024-07-08 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7967:
--
Summary: Robust handling of spark task failures and retries   (was: RFC for 
Robust handling of spark task failures and retries )

> Robust handling of spark task failures and retries 
> ---
>
> Key: HUDI-7967
> URL: https://issues.apache.org/jira/browse/HUDI-7967
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: reader-core, writer-core
>Reporter: sivabalan narayanan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7967) RFC for Robust handling of spark task failures and retries

2024-07-08 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7967:
--
Issue Type: Epic  (was: Improvement)

> RFC for Robust handling of spark task failures and retries 
> ---
>
> Key: HUDI-7967
> URL: https://issues.apache.org/jira/browse/HUDI-7967
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: reader-core, writer-core
>Reporter: sivabalan narayanan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7967) RFC for Robust handling of spark task failures and retries

2024-07-08 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7967:
-

 Summary: RFC for Robust handling of spark task failures and 
retries 
 Key: HUDI-7967
 URL: https://issues.apache.org/jira/browse/HUDI-7967
 Project: Apache Hudi
  Issue Type: Improvement
  Components: reader-core, writer-core
Reporter: sivabalan narayanan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

(hudi) branch master updated: Revert "[HUDI-7709] ClassCastException while reading the data using `TimestampBasedKeyGenerator` (#11501)" (#11586)

2024-07-07 Thread sivabalan

This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 26ac119ee25 Revert "[HUDI-7709] ClassCastException while reading the 
data using `TimestampBasedKeyGenerator` (#11501)" (#11586)
26ac119ee25 is described below

commit 26ac119ee25f03ff079bb396b5f397ee1264c406
Author: Sagar Sumit 
AuthorDate: Mon Jul 8 10:40:34 2024 +0530

Revert "[HUDI-7709] ClassCastException while reading the data using 
`TimestampBasedKeyGenerator` (#11501)" (#11586)

This reverts commit ae1ee05ab8c2bd732e57bee11c8748926b05ec4b.
---
 .../org/apache/hudi/BaseHoodieTableFileIndex.java  |  24 +---
 .../hudi/common/table/HoodieTableConfig.java   |   2 -
 .../main/scala/org/apache/hudi/DefaultSource.scala |   3 +
 .../TestSparkSqlWithTimestampKeyGenerator.scala| 148 -
 4 files changed, 9 insertions(+), 168 deletions(-)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java 
b/hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java
index 9cdf1adf971..5a0fd79fcc4 100644
--- a/hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java
+++ b/hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java
@@ -19,7 +19,6 @@
 package org.apache.hudi;
 
 import org.apache.hudi.common.config.HoodieMetadataConfig;
-import org.apache.hudi.common.config.TimestampKeyGeneratorConfig;
 import org.apache.hudi.common.config.TypedProperties;
 import org.apache.hudi.common.engine.HoodieEngineContext;
 import org.apache.hudi.common.fs.FSUtils;
@@ -27,7 +26,6 @@ import org.apache.hudi.common.model.BaseFile;
 import org.apache.hudi.common.model.FileSlice;
 import org.apache.hudi.common.model.HoodieLogFile;
 import org.apache.hudi.common.model.HoodieTableQueryType;
-import org.apache.hudi.common.table.HoodieTableConfig;
 import org.apache.hudi.common.table.HoodieTableMetaClient;
 import org.apache.hudi.common.table.timeline.HoodieInstant;
 import org.apache.hudi.common.table.timeline.HoodieTimeline;
@@ -42,7 +40,6 @@ import org.apache.hudi.exception.HoodieException;
 import org.apache.hudi.exception.HoodieIOException;
 import org.apache.hudi.expression.Expression;
 import org.apache.hudi.internal.schema.Types;
-import org.apache.hudi.keygen.constant.KeyGeneratorType;
 import org.apache.hudi.metadata.HoodieTableMetadata;
 import org.apache.hudi.metadata.HoodieTableMetadataUtil;
 import org.apache.hudi.storage.HoodieStorage;
@@ -356,22 +353,13 @@ public abstract class BaseHoodieTableFileIndex implements 
AutoCloseable {
   }
 
   private Object[] parsePartitionColumnValues(String[] partitionColumns, 
String partitionPath) {
-HoodieTableConfig tableConfig = metaClient.getTableConfig();
-Object[] partitionColumnValues;
-if (null != tableConfig.getKeyGeneratorClassName()
-&& 
tableConfig.getKeyGeneratorClassName().equals(KeyGeneratorType.TIMESTAMP.getClassName())
-&& 
tableConfig.propsMap().get(TimestampKeyGeneratorConfig.TIMESTAMP_TYPE_FIELD.key()).matches("SCALAR|UNIX_TIMESTAMP|EPOCHMILLISECONDS"))
 {
-  // For TIMESTAMP key generator when TYPE is SCALAR, UNIX_TIMESTAMP or 
EPOCHMILLISECONDS,
-  // we couldn't reconstruct initial partition column values from 
partition paths due to lost data after formatting in most cases
-  partitionColumnValues = new Object[partitionColumns.length];
-} else {
-  partitionColumnValues = doParsePartitionColumnValues(partitionColumns, 
partitionPath);
-  if (shouldListLazily && partitionColumnValues.length != 
partitionColumns.length) {
-throw new HoodieException("Failed to parse partition column values 
from the partition-path:"
-+ " likely non-encoded slashes being used in partition column's 
values. You can try to"
-+ " work this around by switching listing mode to eager");
-  }
+Object[] partitionColumnValues = 
doParsePartitionColumnValues(partitionColumns, partitionPath);
+if (shouldListLazily && partitionColumnValues.length != 
partitionColumns.length) {
+  throw new HoodieException("Failed to parse partition column values from 
the partition-path:"
+  + " likely non-encoded slashes being used in partition column's 
values. You can try to"
+  + " work this around by switching listing mode to eager");
 }
+
 return partitionColumnValues;
   }
 
diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java 
b/hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java
index 6053278d831..117b64ba29d 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig

(hudi) branch master updated: [HUDI-7966] Handle NPE from AvroSchemaUtils.createNewSchemaFromFieldsWithReference (#11585)

2024-07-07 Thread sivabalan

This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 6d01bcfa6a2 [HUDI-7966] Handle NPE from 
AvroSchemaUtils.createNewSchemaFromFieldsWithReference (#11585)
6d01bcfa6a2 is described below

commit 6d01bcfa6a26fb890d64802f6e4a16debb0bb11f
Author: Sagar Sumit 
AuthorDate: Mon Jul 8 10:38:23 2024 +0530

[HUDI-7966] Handle NPE from 
AvroSchemaUtils.createNewSchemaFromFieldsWithReference (#11585)
---
 .../java/org/apache/hudi/avro/AvroSchemaUtils.java | 15 +++-
 .../read/HoodieFileGroupReaderSchemaHandler.java   |  8 ++---
 .../org/apache/hudi/avro/TestAvroSchemaUtils.java  | 40 ++
 3 files changed, 56 insertions(+), 7 deletions(-)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java 
b/hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java
index 642624e0ed3..59917f6c652 100644
--- a/hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java
+++ b/hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java
@@ -27,6 +27,8 @@ import org.apache.hudi.exception.SchemaCompatibilityException;
 
 import org.apache.avro.Schema;
 import org.apache.avro.SchemaCompatibility;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
 
 import java.util.ArrayDeque;
 import java.util.ArrayList;
@@ -46,6 +48,8 @@ import static 
org.apache.hudi.common.util.ValidationUtils.checkState;
  */
 public class AvroSchemaUtils {
 
+  private static final Logger LOG = 
LoggerFactory.getLogger(AvroSchemaUtils.class);
+
   private AvroSchemaUtils() {}
 
   /**
@@ -302,8 +306,17 @@ public class AvroSchemaUtils {
* @return schema with fields from fields, and metadata from schema
*/
   public static Schema createNewSchemaFromFieldsWithReference(Schema schema, 
List fields) {
+if (schema == null) {
+  throw new IllegalArgumentException("Schema must not be null");
+}
 Schema newSchema = Schema.createRecord(schema.getName(), schema.getDoc(), 
schema.getNamespace(), schema.isError());
-for (Map.Entry prop : schema.getObjectProps().entrySet()) {
+Map schemaProps = Collections.emptyMap();
+try {
+  schemaProps = schema.getObjectProps();
+} catch (Exception e) {
+  LOG.warn("Error while getting object properties from schema: {}", 
schema, e);
+}
+for (Map.Entry prop : schemaProps.entrySet()) {
   newSchema.addProp(prop.getKey(), prop.getValue());
 }
 newSchema.setFields(fields);
diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupReaderSchemaHandler.java
 
b/hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupReaderSchemaHandler.java
index 85d0cef2893..8a962fdc2a7 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupReaderSchemaHandler.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupReaderSchemaHandler.java
@@ -40,6 +40,7 @@ import java.util.stream.Collectors;
 import java.util.stream.Stream;
 
 import static 
org.apache.hudi.avro.AvroSchemaUtils.appendFieldsToSchemaDedupNested;
+import static 
org.apache.hudi.avro.AvroSchemaUtils.createNewSchemaFromFieldsWithReference;
 import static org.apache.hudi.avro.AvroSchemaUtils.findNestedField;
 
 /**
@@ -178,11 +179,6 @@ public class HoodieFileGroupReaderSchemaHandler {
   Schema.Field curr = fields.get(i);
   fields.set(i, new Schema.Field(curr.name(), curr.schema(), curr.doc(), 
curr.defaultVal()));
 }
-Schema newSchema = Schema.createRecord(dataSchema.getName(), 
dataSchema.getDoc(), dataSchema.getNamespace(), dataSchema.isError());
-for (Map.Entry prop : 
dataSchema.getObjectProps().entrySet()) {
-  newSchema.addProp(prop.getKey(), prop.getValue());
-}
-newSchema.setFields(fields);
-return newSchema;
+return createNewSchemaFromFieldsWithReference(dataSchema, fields);
   }
 }
diff --git 
a/hudi-common/src/test/java/org/apache/hudi/avro/TestAvroSchemaUtils.java 
b/hudi-common/src/test/java/org/apache/hudi/avro/TestAvroSchemaUtils.java
index 37cea0d93a9..a11338d4017 100644
--- a/hudi-common/src/test/java/org/apache/hudi/avro/TestAvroSchemaUtils.java
+++ b/hudi-common/src/test/java/org/apache/hudi/avro/TestAvroSchemaUtils.java
@@ -115,6 +115,46 @@ public class TestAvroSchemaUtils {
   + "  ]\n"
   + "}\n";
 
+  @Test
+  public void testCreateNewSchemaFromFieldsWithReference_NullSchema() {
+// This test should throw an IllegalArgumentException
+assertThrows(IllegalArgumentException.class, () -> 
AvroSchemaUtils.createNewSchemaFromFieldsWithReference(null, 
Collections.emptyList()));
+  }
+
+  @Test
+  public void testCreateNewSchemaFromFieldsWithReference_NullObjectProps() {
+// Create a schema without any object p

[jira] [Created] (HUDI-7963) Avoid generating RLI records when disabled w/ MDT

2024-07-05 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7963:
-

 Summary: Avoid generating RLI records when disabled w/ MDT
 Key: HUDI-7963
 URL: https://issues.apache.org/jira/browse/HUDI-7963
 Project: Apache Hudi
  Issue Type: Improvement
  Components: metadata
Reporter: sivabalan narayanan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7961) Optimize UpsertPartitioner for prepped write operations

2024-07-05 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7961:
-

 Summary: Optimize UpsertPartitioner for prepped write operations
 Key: HUDI-7961
 URL: https://issues.apache.org/jira/browse/HUDI-7961
 Project: Apache Hudi
  Issue Type: Improvement
  Components: writer-core
Reporter: sivabalan narayanan


We have avg record size calculation etc in UpsertPartitioner which does not 
makes sense for prepped write operations. also, w/ MDT, we can optimize these. 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7882) Umbrella ticket to track all changes required to support reading 1.x tables with 0.16.0

2024-07-03 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7882:
--
Status: Patch Available  (was: In Progress)

> Umbrella ticket to track all changes required to support reading 1.x tables 
> with 0.16.0 
> 
>
> Key: HUDI-7882
> URL: https://issues.apache.org/jira/browse/HUDI-7882
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: reader-core
>    Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0, 1.0.0
>
>
> We wanted to support reading 1.x tables in 0.16.0 release. So, creating this 
> umbrella ticket to track all of them.
>  
> RFC in progress: [https://github.com/apache/hudi/pull/11514] 
>  
> Changes required to be ported: 
> 0. Creating 0.16.0 branch
> 0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. 
>  
> 1. Timeline 
> 1.a Hoodie instant parsing should be able to read 1.x instants. 
> https://issues.apache.org/jira/browse/HUDI-7883 Sagar. 
> 1.b Commit metadata parsing is able to handle both json and avro formats. 
> Scope might be non-trivial.  https://issues.apache.org/jira/browse/HUDI-7866  
> Siva.
> 1.c HoodieDefaultTimeline able to read both timelines based on table version. 
>  https://issues.apache.org/jira/browse/HUDI-7884 Siva.
> 1.d Reading LSM timeline using 0.16.0 
> https://issues.apache.org/jira/browse/HUDI-7890 Siva. 
> 1.e Ensure 1.0 MDT timeline is readable by 0.16 - HUDI-7901
>  
> 2. Table property changes 
> 2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885  
> https://issues.apache.org/jira/browse/HUDI-7865 LJ
>  
> 3. MDT table changes
> 3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ
> 3.b MDT payload schema changes. 
> https://issues.apache.org/jira/browse/HUDI-7886 LJ
>  
> 4. Log format changes
> 4.a All metadata header types porting 
> https://issues.apache.org/jira/browse/HUDI-7887 Jon
> 4.b Meaningful error for incompatible features from 1.x 
> https://issues.apache.org/jira/browse/HUDI-7888 Jon
>  
> 5. Log file slice or grouping detection compatibility 
>  
> 5. Tests 
> 5.a Tests to validate that 1.x tables can be read w/ 0.16.0 
> https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. 
>  
> 6 Doc changes 
> 6.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. 
> https://issues.apache.org/jira/browse/HUDI-7889 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7507) ongoing concurrent writers with smaller timestamp can cause issues with table services

2024-07-02 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7507:
--
Status: In Progress  (was: Open)

>  ongoing concurrent writers with smaller timestamp can cause issues with 
> table services
> ---
>
> Key: HUDI-7507
> URL: https://issues.apache.org/jira/browse/HUDI-7507
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: table-service
>Reporter: Krishen Bhan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
> Attachments: Flowchart (1).png, Flowchart.png
>
>
> *Scenarios:*
> Although HUDI operations hold a table lock when creating a .requested 
> instant, because HUDI writers do not generate a timestamp and create a 
> .requsted plan in the same transaction, there can be a scenario where 
>  # Job 1 starts, chooses timestamp (x) , Job 2 starts and chooses timestamp 
> (x - 1)
>  # Job 1 schedules and creates requested file with instant timestamp (x)
>  # Job 2 schedules and creates requested file with instant timestamp (x-1)
>  # Both jobs continue running
> If one job is writing a commit and the other is a table service, this can 
> cause issues:
>  * 
>  ** If Job 2 is ingestion commit and Job 1 is compaction/log compaction, then 
> when Job 1 runs before Job 2 and can create a compaction plan for all instant 
> times (up to (x) ) that doesn’t include instant time (x-1) .  Later Job 2 
> will create instant time (x-1), but timeline will be in a corrupted state 
> since compaction plan was supposed to include (x-1)
>  ** There is a similar issue with clean. If Job2 is a long-running commit 
> (that was stuck/delayed for a while before creating its .requested plan) and 
> Job 1 is a clean, then Job 1 can perform a clean that updates the 
> earliest-commit-to-retain without waiting for the inflight instant by Job 2 
> at (x-1) to complete. This causes Job2 to be "skipped" by clean.
>  ** If the completed commit files include som sort of "checkpointing" with 
> another "downstream job" performing incremental reads on this dataset (such 
> as Hoodie Streamer/DeltaSync) then there may be incorrect behavior, such as 
> the incremental reader skipping some completed commits (that have a smaller 
> instant timestamp than latest completed commit but were created after).
> [Edit] I added a diagram to visualize the issue, specifically the second 
> scenario with clean
> !Flowchart (1).png!
> *Proposed approach:*
> One way this can be resolved is by combining the operations of generating 
> instant time and creating a requested file in the same HUDI table 
> transaction. Specifically, executing the following steps whenever any instant 
> (commit, table service, etc) is scheduled
> Approach A
>  # Acquire table lock
>  # Look at the latest instant C on the active timeline (completed or not). 
> Generate a timestamp after C
>  # Create the plan and requested file using this new timestamp ( that is 
> greater than C)
>  # Release table lock
> Unfortunately (A) has the following drawbacks
>  * Every operation must now hold the table lock when computing its plan even 
> if it's an expensive operation and will take a while
>  * Users of HUDI cannot easily set their own instant time of an operation, 
> and this restriction would break any public APIs that allow this and would 
> require deprecating those APIs.
>  
> An alternate approach is to have every operation abort creating a .requested 
> file unless it has the latest timestamp. Specifically, for any instant type, 
> whenever an operation is about to create a .requested plan on timeline, it 
> should take the table lock and assert that there are no other instants on 
> timeline that are greater than it that could cause a conflict. If that 
> assertion fails, then throw a retry-able conflict resolution exception.
> Specifically, the following steps should be followed whenever any instant 
> (commit, table service, etc) is scheduled
> Approach B
>  # Acquire table lock. Assume that the desired instant time C and requested 
> file plan metadata have already been created, regardless of wether it was 
> before this step or right after acquiring the table lock.
>  # If there are any instants on the timeline that are greater than C 
> (regardless of their operation type or sate status) then release table lock 
> and throw an exception
>  # Create requested plan on timeline (As usual)

[jira] [Updated] (HUDI-6416) Completion Markers for handling spark retries

2024-07-02 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-6416:
--
Status: In Progress  (was: Open)

> Completion Markers for handling spark retries
> -
>
> Key: HUDI-6416
> URL: https://issues.apache.org/jira/browse/HUDI-6416
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Balajee Nagasubramaniam
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
>
> During spark stage retries, spark driver may have all the information to 
> reconcile the commit and proceed with next steps, while a stray executor may 
> still be writing to a data file and complete later (before the JVM exit). 
> Extra files left on the dataset, excluded from reconcile commit step could 
> show up as data quality issue for query engines with duplicate records.
> This change brings completion markers which tries to prevent the dataset from 
> experiencing data quality issues,  in such corner case scenarios.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7882) Umbrella ticket to track all changes required to support reading 1.x tables with 0.16.0

2024-07-01 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7882:
--
Description: 
We wanted to support reading 1.x tables in 0.16.0 release. So, creating this 
umbrella ticket to track all of them.

 

RFC in progress: [https://github.com/apache/hudi/pull/11514] 

 

Changes required to be ported: 
0. Creating 0.16.0 branch

0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. 

 

1. Timeline 

1.a Hoodie instant parsing should be able to read 1.x instants. 
https://issues.apache.org/jira/browse/HUDI-7883 Sagar. 

1.b Commit metadata parsing is able to handle both json and avro formats. Scope 
might be non-trivial.  https://issues.apache.org/jira/browse/HUDI-7866  Siva.
1.c HoodieDefaultTimeline able to read both timelines based on table version.  
https://issues.apache.org/jira/browse/HUDI-7884 Siva.

1.d Reading LSM timeline using 0.16.0 
https://issues.apache.org/jira/browse/HUDI-7890 Siva. 

1.e Ensure 1.0 MDT timeline is readable by 0.16 - HUDI-7901

 

2. Table property changes 

2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885  
https://issues.apache.org/jira/browse/HUDI-7865 LJ

 

3. MDT table changes

3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ

3.b MDT payload schema changes. https://issues.apache.org/jira/browse/HUDI-7886 
LJ

 

4. Log format changes

4.a All metadata header types porting 
https://issues.apache.org/jira/browse/HUDI-7887 Jon

4.b Meaningful error for incompatible features from 1.x 
https://issues.apache.org/jira/browse/HUDI-7888 Jon

 

5. Log file slice or grouping detection compatibility 

 

5. Tests 

5.a Tests to validate that 1.x tables can be read w/ 0.16.0 
https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. 

 

6 Doc changes 

6.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. 
https://issues.apache.org/jira/browse/HUDI-7889 

  was:
We wanted to support reading 1.x tables in 0.16.0 release. So, creating this 
umbrella ticket to track all of them.

 

Changes required to be ported: 
0. Creating 0.16.0 branch

0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. 

 

1. Timeline 

1.a Hoodie instant parsing should be able to read 1.x instants. 
https://issues.apache.org/jira/browse/HUDI-7883 Sagar. 

1.b Commit metadata parsing is able to handle both json and avro formats. Scope 
might be non-trivial.  https://issues.apache.org/jira/browse/HUDI-7866  Siva.
1.c HoodieDefaultTimeline able to read both timelines based on table version.  
https://issues.apache.org/jira/browse/HUDI-7884 Siva.

1.d Reading LSM timeline using 0.16.0 
https://issues.apache.org/jira/browse/HUDI-7890 Siva. 

1.e Ensure 1.0 MDT timeline is readable by 0.16 - HUDI-7901

 

2. Table property changes 

2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885  
https://issues.apache.org/jira/browse/HUDI-7865 LJ

 

3. MDT table changes

3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ

3.b MDT payload schema changes. https://issues.apache.org/jira/browse/HUDI-7886 
LJ

 

4. Log format changes

4.a All metadata header types porting 
https://issues.apache.org/jira/browse/HUDI-7887 Jon

4.b Meaningful error for incompatible features from 1.x 
https://issues.apache.org/jira/browse/HUDI-7888 Jon

 

5. Log file slice or grouping detection compatibility 

 

5. Tests 

5.a Tests to validate that 1.x tables can be read w/ 0.16.0 
https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. 

 

6 Doc changes 

6.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. 
https://issues.apache.org/jira/browse/HUDI-7889 


> Umbrella ticket to track all changes required to support reading 1.x tables 
> with 0.16.0 
> 
>
> Key: HUDI-7882
> URL: https://issues.apache.org/jira/browse/HUDI-7882
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: reader-core
>    Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.16.0, 1.0.0
>
>
> We wanted to support reading 1.x tables in 0.16.0 release. So, creating this 
> umbrella ticket to track all of them.
>  
> RFC in progress: [https://github.com/apache/hudi/pull/11514] 
>  
> Changes required to be ported: 
> 0. Creating 0.16.0 branch
> 0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. 
>  
> 1. Timeline 
> 1.a Hoodie instant parsing should be able to read 1.x instants. 
> https://issues.apache.org/jira/browse/HUDI-7883 Sagar. 
> 1.b Commit metadata parsing is able to handle both json and avro formats. 
> Sco

[jira] [Updated] (HUDI-7507) ongoing concurrent writers with smaller timestamp can cause issues with table services

2024-07-01 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7507:
--
Sprint: 2024/06/17-30

>  ongoing concurrent writers with smaller timestamp can cause issues with 
> table services
> ---
>
> Key: HUDI-7507
> URL: https://issues.apache.org/jira/browse/HUDI-7507
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: table-service
>Reporter: Krishen Bhan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
> Attachments: Flowchart (1).png, Flowchart.png
>
>
> *Scenarios:*
> Although HUDI operations hold a table lock when creating a .requested 
> instant, because HUDI writers do not generate a timestamp and create a 
> .requsted plan in the same transaction, there can be a scenario where 
>  # Job 1 starts, chooses timestamp (x) , Job 2 starts and chooses timestamp 
> (x - 1)
>  # Job 1 schedules and creates requested file with instant timestamp (x)
>  # Job 2 schedules and creates requested file with instant timestamp (x-1)
>  # Both jobs continue running
> If one job is writing a commit and the other is a table service, this can 
> cause issues:
>  * 
>  ** If Job 2 is ingestion commit and Job 1 is compaction/log compaction, then 
> when Job 1 runs before Job 2 and can create a compaction plan for all instant 
> times (up to (x) ) that doesn’t include instant time (x-1) .  Later Job 2 
> will create instant time (x-1), but timeline will be in a corrupted state 
> since compaction plan was supposed to include (x-1)
>  ** There is a similar issue with clean. If Job2 is a long-running commit 
> (that was stuck/delayed for a while before creating its .requested plan) and 
> Job 1 is a clean, then Job 1 can perform a clean that updates the 
> earliest-commit-to-retain without waiting for the inflight instant by Job 2 
> at (x-1) to complete. This causes Job2 to be "skipped" by clean.
>  ** If the completed commit files include som sort of "checkpointing" with 
> another "downstream job" performing incremental reads on this dataset (such 
> as Hoodie Streamer/DeltaSync) then there may be incorrect behavior, such as 
> the incremental reader skipping some completed commits (that have a smaller 
> instant timestamp than latest completed commit but were created after).
> [Edit] I added a diagram to visualize the issue, specifically the second 
> scenario with clean
> !Flowchart (1).png!
> *Proposed approach:*
> One way this can be resolved is by combining the operations of generating 
> instant time and creating a requested file in the same HUDI table 
> transaction. Specifically, executing the following steps whenever any instant 
> (commit, table service, etc) is scheduled
> Approach A
>  # Acquire table lock
>  # Look at the latest instant C on the active timeline (completed or not). 
> Generate a timestamp after C
>  # Create the plan and requested file using this new timestamp ( that is 
> greater than C)
>  # Release table lock
> Unfortunately (A) has the following drawbacks
>  * Every operation must now hold the table lock when computing its plan even 
> if it's an expensive operation and will take a while
>  * Users of HUDI cannot easily set their own instant time of an operation, 
> and this restriction would break any public APIs that allow this and would 
> require deprecating those APIs.
>  
> An alternate approach is to have every operation abort creating a .requested 
> file unless it has the latest timestamp. Specifically, for any instant type, 
> whenever an operation is about to create a .requested plan on timeline, it 
> should take the table lock and assert that there are no other instants on 
> timeline that are greater than it that could cause a conflict. If that 
> assertion fails, then throw a retry-able conflict resolution exception.
> Specifically, the following steps should be followed whenever any instant 
> (commit, table service, etc) is scheduled
> Approach B
>  # Acquire table lock. Assume that the desired instant time C and requested 
> file plan metadata have already been created, regardless of wether it was 
> before this step or right after acquiring the table lock.
>  # If there are any instants on the timeline that are greater than C 
> (regardless of their operation type or sate status) then release table lock 
> and throw an exception
>  # Create requested plan on timeline (As usual)
>  #

[jira] [Assigned] (HUDI-7507) ongoing concurrent writers with smaller timestamp can cause issues with table services

2024-07-01 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-7507:
-

Assignee: sivabalan narayanan

>  ongoing concurrent writers with smaller timestamp can cause issues with 
> table services
> ---
>
> Key: HUDI-7507
> URL: https://issues.apache.org/jira/browse/HUDI-7507
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: table-service
>Reporter: Krishen Bhan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
> Attachments: Flowchart (1).png, Flowchart.png
>
>
> *Scenarios:*
> Although HUDI operations hold a table lock when creating a .requested 
> instant, because HUDI writers do not generate a timestamp and create a 
> .requsted plan in the same transaction, there can be a scenario where 
>  # Job 1 starts, chooses timestamp (x) , Job 2 starts and chooses timestamp 
> (x - 1)
>  # Job 1 schedules and creates requested file with instant timestamp (x)
>  # Job 2 schedules and creates requested file with instant timestamp (x-1)
>  # Both jobs continue running
> If one job is writing a commit and the other is a table service, this can 
> cause issues:
>  * 
>  ** If Job 2 is ingestion commit and Job 1 is compaction/log compaction, then 
> when Job 1 runs before Job 2 and can create a compaction plan for all instant 
> times (up to (x) ) that doesn’t include instant time (x-1) .  Later Job 2 
> will create instant time (x-1), but timeline will be in a corrupted state 
> since compaction plan was supposed to include (x-1)
>  ** There is a similar issue with clean. If Job2 is a long-running commit 
> (that was stuck/delayed for a while before creating its .requested plan) and 
> Job 1 is a clean, then Job 1 can perform a clean that updates the 
> earliest-commit-to-retain without waiting for the inflight instant by Job 2 
> at (x-1) to complete. This causes Job2 to be "skipped" by clean.
>  ** If the completed commit files include som sort of "checkpointing" with 
> another "downstream job" performing incremental reads on this dataset (such 
> as Hoodie Streamer/DeltaSync) then there may be incorrect behavior, such as 
> the incremental reader skipping some completed commits (that have a smaller 
> instant timestamp than latest completed commit but were created after).
> [Edit] I added a diagram to visualize the issue, specifically the second 
> scenario with clean
> !Flowchart (1).png!
> *Proposed approach:*
> One way this can be resolved is by combining the operations of generating 
> instant time and creating a requested file in the same HUDI table 
> transaction. Specifically, executing the following steps whenever any instant 
> (commit, table service, etc) is scheduled
> Approach A
>  # Acquire table lock
>  # Look at the latest instant C on the active timeline (completed or not). 
> Generate a timestamp after C
>  # Create the plan and requested file using this new timestamp ( that is 
> greater than C)
>  # Release table lock
> Unfortunately (A) has the following drawbacks
>  * Every operation must now hold the table lock when computing its plan even 
> if it's an expensive operation and will take a while
>  * Users of HUDI cannot easily set their own instant time of an operation, 
> and this restriction would break any public APIs that allow this and would 
> require deprecating those APIs.
>  
> An alternate approach is to have every operation abort creating a .requested 
> file unless it has the latest timestamp. Specifically, for any instant type, 
> whenever an operation is about to create a .requested plan on timeline, it 
> should take the table lock and assert that there are no other instants on 
> timeline that are greater than it that could cause a conflict. If that 
> assertion fails, then throw a retry-able conflict resolution exception.
> Specifically, the following steps should be followed whenever any instant 
> (commit, table service, etc) is scheduled
> Approach B
>  # Acquire table lock. Assume that the desired instant time C and requested 
> file plan metadata have already been created, regardless of wether it was 
> before this step or right after acquiring the table lock.
>  # If there are any instants on the timeline that are greater than C 
> (regardless of their operation type or sate status) then release table lock 
> and throw an exception
>  # Create requested plan on timeline

[jira] [Assigned] (HUDI-6416) Completion Markers for handling spark retries

2024-07-01 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-6416:
-

Assignee: sivabalan narayanan

> Completion Markers for handling spark retries
> -
>
> Key: HUDI-6416
> URL: https://issues.apache.org/jira/browse/HUDI-6416
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Balajee Nagasubramaniam
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
>
> During spark stage retries, spark driver may have all the information to 
> reconcile the commit and proceed with next steps, while a stray executor may 
> still be writing to a data file and complete later (before the JVM exit). 
> Extra files left on the dataset, excluded from reconcile commit step could 
> show up as data quality issue for query engines with duplicate records.
> This change brings completion markers which tries to prevent the dataset from 
> experiencing data quality issues,  in such corner case scenarios.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6416) Completion Markers for handling spark retries

2024-07-01 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-6416:
--
Sprint: 2024/06/17-30

> Completion Markers for handling spark retries
> -
>
> Key: HUDI-6416
> URL: https://issues.apache.org/jira/browse/HUDI-6416
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Balajee Nagasubramaniam
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
>
> During spark stage retries, spark driver may have all the information to 
> reconcile the commit and proceed with next steps, while a stray executor may 
> still be writing to a data file and complete later (before the JVM exit). 
> Extra files left on the dataset, excluded from reconcile commit step could 
> show up as data quality issue for query engines with duplicate records.
> This change brings completion markers which tries to prevent the dataset from 
> experiencing data quality issues,  in such corner case scenarios.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

(hudi) branch master updated: [HUDI-7903] Fix storage partition stats index to skip data (#11472)

2024-06-29 Thread sivabalan

This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new eeafa734a6c [HUDI-7903] Fix storage partition stats index to skip data 
(#11472)
eeafa734a6c is described below

commit eeafa734a6cfb431b608cc67f8dcfe66249d7ec3
Author: Sagar Sumit 
AuthorDate: Sat Jun 29 20:08:11 2024 +0530

[HUDI-7903] Fix storage partition stats index to skip data (#11472)

* [HUDI-7903] Fix storage partition stats index to skip data

* Fix SQL and checkstyle

* handle npe in partition stats records

* Remove decimal field due to Spark 2.4 cast issue

* Address test comments

-

Co-authored-by: Y Ethan Guo 
---
 .../hudi/metadata/HoodieMetadataPayload.java   |   5 +-
 .../apache/hudi/PartitionStatsIndexSupport.scala   |   4 +-
 .../TestPartitionStatsIndexWithSql.scala   | 363 +
 3 files changed, 302 insertions(+), 70 deletions(-)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java 
b/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java
index 7b9359d5dbb..36ed57c87f5 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java
@@ -68,6 +68,7 @@ import java.util.stream.Stream;
 
 import static org.apache.hudi.avro.HoodieAvroUtils.unwrapAvroValueWrapper;
 import static org.apache.hudi.avro.HoodieAvroUtils.wrapValueIntoAvro;
+import static org.apache.hudi.common.util.StringUtils.nonEmpty;
 import static org.apache.hudi.common.util.TypeUtils.unsafeCast;
 import static org.apache.hudi.common.util.ValidationUtils.checkArgument;
 import static org.apache.hudi.common.util.ValidationUtils.checkState;
@@ -686,10 +687,11 @@ public class HoodieMetadataPayload implements 
HoodieRecordPayload {
   HoodieKey key = new HoodieKey(getPartitionStatsIndexKey(partitionPath, 
columnRangeMetadata.getColumnName()),
   MetadataPartitionType.PARTITION_STATS.getPartitionPath());
+  String fileName = nonEmpty(columnRangeMetadata.getFilePath()) ? new 
StoragePath(columnRangeMetadata.getFilePath()).getName() : null;
 
   HoodieMetadataPayload payload = new 
HoodieMetadataPayload(key.getRecordKey(),
   HoodieMetadataColumnStats.newBuilder()
-  .setFileName(null)
+  .setFileName(fileName)
   .setColumnName(columnRangeMetadata.getColumnName())
   
.setMinValue(wrapValueIntoAvro(columnRangeMetadata.getMinValue()))
   
.setMaxValue(wrapValueIntoAvro(columnRangeMetadata.getMaxValue()))
@@ -713,7 +715,6 @@ public class HoodieMetadataPayload implements 
HoodieRecordPayload new 
ColumnIndexID(colName).asBase64EncodedString())
+logDebug(s"Loading column stats for columns: ${targetColumns.mkString(", 
")},  Encoded column names: ${encodedTargetColumnNames.mkString(", ")}")
 val metadataRecords: HoodieData[HoodieRecord[HoodieMetadataPayload]] =
   metadataTable.getRecordsByKeyPrefixes(encodedTargetColumnNames.asJava, 
HoodieTableMetadataUtil.PARTITION_NAME_PARTITION_STATS, shouldReadInMemory)
 val columnStatsRecords: HoodieData[HoodieMetadataColumnStats] =
diff --git 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestPartitionStatsIndexWithSql.scala
 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestPartitionStatsIndexWithSql.scala
index 0e337cc3995..2be4a37c915 100644
--- 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestPartitionStatsIndexWithSql.scala
+++ 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestPartitionStatsIndexWithSql.scala
@@ -19,11 +19,20 @@
 
 package org.apache.hudi.functional
 
-import org.apache.hudi.common.model.WriteOperationType
+import org.apache.hudi.client.common.HoodieSparkEngineContext
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.model.{FileSlice, WriteOperationType}
 import org.apache.hudi.common.table.{HoodieTableConfig, HoodieTableMetaClient}
 import org.apache.hudi.common.testutils.HoodieTestUtils
-
+import org.apache.hudi.metadata.HoodieMetadataFileSystemView
+import org.apache.hudi.metadata.MetadataPartitionType.PARTITION_STATS
+import org.apache.hudi.util.JFunction
+import org.apache.hudi.{DataSourceReadOptions, HoodieFileIndex}
+import org.apache.spark.api.java.JavaSparkContext
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
Expression, GreaterThan, LessThan, Literal}
 import org.apache.spark.sql.hudi.common.HoodieSparkSqlTestBase
+import org.apache.spark.sql.types.{IntegerType, StringType}
+import org.junit.jupiter.api

(hudi) branch master updated: [HUDI-7908] HoodieFileGroupReader fails if preCombine and partition fields are the same (#11473)

2024-06-28 Thread sivabalan

This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 107cffca4af [HUDI-7908] HoodieFileGroupReader fails if preCombine and 
partition fields are the same (#11473)
107cffca4af is described below

commit 107cffca4aff52140dc8a59e120cf9bf695e4979
Author: Vova Kolmakov 
AuthorDate: Sat Jun 29 12:42:57 2024 +0700

[HUDI-7908] HoodieFileGroupReader fails if preCombine and partition fields 
are the same (#11473)

Co-authored-by: Vova Kolmakov 
---
 ...odieFileGroupReaderBasedParquetFileFormat.scala |  4 +-
 .../spark/sql/hudi/dml/TestInsertTable.scala   | 44 ++
 2 files changed, 47 insertions(+), 1 deletion(-)

diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala
index 0ad2bda9cf2..f7c09b007be 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala
@@ -107,7 +107,9 @@ class 
HoodieFileGroupReaderBasedParquetFileFormat(tableState: HoodieTableState,
   hadoopConf: Configuration): 
PartitionedFile => Iterator[InternalRow] = {
 //dataSchema is not always right due to spark bugs
 val partitionColumns = partitionSchema.fieldNames
-val dataSchema = 
StructType(tableSchema.structTypeSchema.fields.filterNot(f => 
partitionColumns.contains(f.name)))
+val preCombineField = 
options.getOrElse(HoodieTableConfig.PRECOMBINE_FIELD.key, "")
+val dataSchema = 
StructType(tableSchema.structTypeSchema.fields.filterNot(f => 
partitionColumns.contains(f.name)
+  && preCombineField.equals(f.name)))
 val outputSchema = StructType(requiredSchema.fields ++ 
partitionSchema.fields)
 val isCount = requiredSchema.isEmpty && !isMOR && !isIncremental
 val augmentedStorageConf = new 
HadoopStorageConfiguration(hadoopConf).getInline
diff --git 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/TestInsertTable.scala
 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/TestInsertTable.scala
index b46723e97bf..0b1d4ca8999 100644
--- 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/TestInsertTable.scala
+++ 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/TestInsertTable.scala
@@ -2587,6 +2587,50 @@ class TestInsertTable extends HoodieSparkSqlTestBase {
 })
   }
 
+  test(s"Test INSERT INTO with upsert operation type") {
+if (HoodieSparkUtils.gteqSpark3_2) {
+  withTempDir { tmp =>
+Seq("mor").foreach { tableType =>
+  val tableName = generateTableName
+  spark.sql(
+s"""
+   |create table $tableName (
+   |  id int,
+   |  name string,
+   |  ts long,
+   |  price int
+   |) using hudi
+   |partitioned by (ts)
+   |tblproperties (
+   |  type = '$tableType',
+   |  primaryKey = 'id',
+   |  preCombineField = 'ts'
+   |)
+   |location '${tmp.getCanonicalPath}/$tableName'
+   |""".stripMargin
+  )
+
+  // Test insert into with upsert operation type
+  spark.sql(
+s"""
+   | insert into $tableName
+   | values (1, 'a1', 1000, 10), (2, 'a2', 2000, 20), (3, 'a3', 
3000, 30), (4, 'a4', 2000, 10), (5, 'a5', 3000, 20), (6, 'a6', 4000, 30)
+   | """.stripMargin
+  )
+  checkAnswer(s"select id, name, price, ts from $tableName where price 
> 3000")(
+Seq(6, "a6", 4000, 30)
+  )
+
+  // Test update
+  spark.sql(s"update $tableName set price = price + 1 where id = 6")
+  checkAnswer(s"select id, name, price, ts from $tableName where price 
> 3000")(
+Seq(6, "a6", 4001, 30)
+  )
+}
+  }
+}
+  }
+
   test("Test Insert Into with extraMetadata") {
 withTempDir { tmp =>
   val tableName = generateTableName

[jira] [Assigned] (HUDI-7503) Concurrent executions of table service plan should not corrupt dataset

2024-06-26 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-7503:
-

Assignee: (was: sivabalan narayanan)

> Concurrent executions of table service plan should not corrupt dataset
> --
>
> Key: HUDI-7503
> URL: https://issues.apache.org/jira/browse/HUDI-7503
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: compaction, table-service
>Reporter: Krishen Bhan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.16.0, 1.0.0
>
>
> Some external workflow schedulers can accidentally (or) misbehave and 
> schedule duplicate executions of the same compaction plan. We need a way to 
> guard against this inside Hudi (vs user taking a lock externally). In such a 
> world,  2 instance of the job concurrently call 
> `org.apache.hudi.client.BaseHoodieTableServiceClient#compact` on the same 
> compaction instant. 
> This is since one writer might execute the instant and create an inflight, 
> while the other writer sees the inflight and tries to roll it back before 
> re-attempting to execute it (since it will assume said inflight was a 
> previously failed compaction attempt).
> This logic should be updated such that only one writer will actually execute 
> the compaction plan at a time (and the others will fail/abort).
> One approach is to use a transaction (base table lock) in conjunction with 
> heartbeating, to ensure that the writer triggers a heartbeat before executing 
> compaction, and any concurrent writers will use the heartbeat to check wether 
> the compaction is currently being executed by another writer. Specifically , 
> the compact API should execute the following steps
>  # Get the instant to compact C (as usual)
>  # Start a transaction
>  # Checks if C has an active heartbeat, if so finish transaction and throw 
> exception
>  # Start a heartbeat for C (this will implicitly re-start the heartbeat if it 
> has been started before by another job)
>  # Finish transaction
>  # Run the existing compact API logic on C 
>  # If execution succeeds, clean up heartbeat file . If it fails do nothing 
> (as the heartbeat will anyway be automatically expired later).
> Note that this approach only holds the table lock temporarily, when 
> checking/starting the heartbeat
> Also, this flow can be applied to execution of clean plans and other table 
> services



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-7503) Concurrent executions of table service plan should not corrupt dataset

2024-06-26 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-7503:
-

Assignee: sivabalan narayanan

> Concurrent executions of table service plan should not corrupt dataset
> --
>
> Key: HUDI-7503
> URL: https://issues.apache.org/jira/browse/HUDI-7503
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: compaction, table-service
>Reporter: Krishen Bhan
>    Assignee: sivabalan narayanan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.16.0, 1.0.0
>
>
> Some external workflow schedulers can accidentally (or) misbehave and 
> schedule duplicate executions of the same compaction plan. We need a way to 
> guard against this inside Hudi (vs user taking a lock externally). In such a 
> world,  2 instance of the job concurrently call 
> `org.apache.hudi.client.BaseHoodieTableServiceClient#compact` on the same 
> compaction instant. 
> This is since one writer might execute the instant and create an inflight, 
> while the other writer sees the inflight and tries to roll it back before 
> re-attempting to execute it (since it will assume said inflight was a 
> previously failed compaction attempt).
> This logic should be updated such that only one writer will actually execute 
> the compaction plan at a time (and the others will fail/abort).
> One approach is to use a transaction (base table lock) in conjunction with 
> heartbeating, to ensure that the writer triggers a heartbeat before executing 
> compaction, and any concurrent writers will use the heartbeat to check wether 
> the compaction is currently being executed by another writer. Specifically , 
> the compact API should execute the following steps
>  # Get the instant to compact C (as usual)
>  # Start a transaction
>  # Checks if C has an active heartbeat, if so finish transaction and throw 
> exception
>  # Start a heartbeat for C (this will implicitly re-start the heartbeat if it 
> has been started before by another job)
>  # Finish transaction
>  # Run the existing compact API logic on C 
>  # If execution succeeds, clean up heartbeat file . If it fails do nothing 
> (as the heartbeat will anyway be automatically expired later).
> Note that this approach only holds the table lock temporarily, when 
> checking/starting the heartbeat
> Also, this flow can be applied to execution of clean plans and other table 
> services



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-7503) Concurrent executions of table service plan should not corrupt dataset

2024-06-26 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-7503:
-

Assignee: sivabalan narayanan

> Concurrent executions of table service plan should not corrupt dataset
> --
>
> Key: HUDI-7503
> URL: https://issues.apache.org/jira/browse/HUDI-7503
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: compaction, table-service
>Reporter: Krishen Bhan
>    Assignee: sivabalan narayanan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.16.0, 1.0.0
>
>
> Some external workflow schedulers can accidentally (or) misbehave and 
> schedule duplicate executions of the same compaction plan. We need a way to 
> guard against this inside Hudi (vs user taking a lock externally). In such a 
> world,  2 instance of the job concurrently call 
> `org.apache.hudi.client.BaseHoodieTableServiceClient#compact` on the same 
> compaction instant. 
> This is since one writer might execute the instant and create an inflight, 
> while the other writer sees the inflight and tries to roll it back before 
> re-attempting to execute it (since it will assume said inflight was a 
> previously failed compaction attempt).
> This logic should be updated such that only one writer will actually execute 
> the compaction plan at a time (and the others will fail/abort).
> One approach is to use a transaction (base table lock) in conjunction with 
> heartbeating, to ensure that the writer triggers a heartbeat before executing 
> compaction, and any concurrent writers will use the heartbeat to check wether 
> the compaction is currently being executed by another writer. Specifically , 
> the compact API should execute the following steps
>  # Get the instant to compact C (as usual)
>  # Start a transaction
>  # Checks if C has an active heartbeat, if so finish transaction and throw 
> exception
>  # Start a heartbeat for C (this will implicitly re-start the heartbeat if it 
> has been started before by another job)
>  # Finish transaction
>  # Run the existing compact API logic on C 
>  # If execution succeeds, clean up heartbeat file . If it fails do nothing 
> (as the heartbeat will anyway be automatically expired later).
> Note that this approach only holds the table lock temporarily, when 
> checking/starting the heartbeat
> Also, this flow can be applied to execution of clean plans and other table 
> services



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

(hudi) branch branch-0.x updated: [HUDI-6508] Support compilation on Java 11 (#11513)

2024-06-25 Thread sivabalan

This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/branch-0.x by this push:
 new 538e6619ed5 [HUDI-6508] Support compilation on Java 11 (#11513)
538e6619ed5 is described below

commit 538e6619ed50cd64d12652058e4b5c68cfef0f99
Author: Y Ethan Guo 
AuthorDate: Tue Jun 25 18:56:01 2024 -0700

[HUDI-6508] Support compilation on Java 11 (#11513)
---
 .github/workflows/bot.yml  | 167 +++--
 .../hudi/table/TestHoodieMergeOnReadTable.java |   8 +-
 .../commit/TestCopyOnWriteActionExecutor.java  |  15 +-
 .../hudi/metadata/HoodieTableMetadataUtil.java |  21 ++-
 hudi-examples/hudi-examples-common/pom.xml |  14 --
 hudi-examples/hudi-examples-java/pom.xml   |  14 --
 .../org/apache/hudi/common/util/ParquetUtils.java  |  21 +--
 7 files changed, 186 insertions(+), 74 deletions(-)

diff --git a/.github/workflows/bot.yml b/.github/workflows/bot.yml
index 72200c4822d..5d659123f13 100644
--- a/.github/workflows/bot.yml
+++ b/.github/workflows/bot.yml
@@ -245,12 +245,6 @@ jobs:
   - scalaProfile: "scala-2.12"
 sparkProfile: "spark3.4"
 sparkModules: "hudi-spark-datasource/hudi-spark3.4.x"
-  - scalaProfile: "scala-2.12"
-sparkProfile: "spark3.5"
-sparkModules: "hudi-spark-datasource/hudi-spark3.5.x"
-  - scalaProfile: "scala-2.13"
-sparkProfile: "spark3.5"
-sparkModules: "hudi-spark-datasource/hudi-spark3.5.x"
 
 steps:
   - uses: actions/checkout@v3
@@ -285,7 +279,6 @@ jobs:
   SCALA_PROFILE: ${{ matrix.scalaProfile }}
   SPARK_PROFILE: ${{ matrix.sparkProfile }}
   SPARK_MODULES: ${{ matrix.sparkModules }}
-if: ${{ !endsWith(env.SPARK_PROFILE, '3.2') }} # skip test spark 3.2 
as it's covered by Azure CI
 run:
   mvn test -Punit-tests -Pjava17 -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" 
-DwildcardSuites=skipScalaTests -DfailIfNoTests=false -pl 
"hudi-common,$SPARK_COMMON_MODULES,$SPARK_MODULES" $MVN_ARGS
   - name: Java FT - Spark
@@ -293,7 +286,6 @@ jobs:
   SCALA_PROFILE: ${{ matrix.scalaProfile }}
   SPARK_PROFILE: ${{ matrix.sparkProfile }}
   SPARK_MODULES: ${{ matrix.sparkModules }}
-if: ${{ !endsWith(env.SPARK_PROFILE, '3.2') }} # skip test spark 3.2 
as it's covered by Azure CI
 run:
   mvn test -Pfunctional-tests -Pjava17 -D"$SCALA_PROFILE" 
-D"$SPARK_PROFILE" -pl "$SPARK_COMMON_MODULES,$SPARK_MODULES" $MVN_ARGS
 
@@ -308,6 +300,49 @@ jobs:
   - scalaProfile: "scala-2.12"
 sparkProfile: "spark3.4"
 sparkModules: "hudi-spark-datasource/hudi-spark3.4.x"
+
+steps:
+  - uses: actions/checkout@v3
+  - name: Set up JDK 8
+uses: actions/setup-java@v3
+with:
+  java-version: '8'
+  distribution: 'temurin'
+  architecture: x64
+  cache: maven
+  - name: Build Project
+env:
+  SCALA_PROFILE: ${{ matrix.scalaProfile }}
+  SPARK_PROFILE: ${{ matrix.sparkProfile }}
+run:
+  mvn clean install -T 2 -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" 
-DskipTests=true $MVN_ARGS -am -pl 
"hudi-examples/hudi-examples-spark,hudi-common,$SPARK_COMMON_MODULES,$SPARK_MODULES"
+  - name: Set up JDK 17
+uses: actions/setup-java@v3
+with:
+  java-version: '17'
+  distribution: 'temurin'
+  architecture: x64
+  cache: maven
+  - name: Scala UT - Common & Spark
+env:
+  SCALA_PROFILE: ${{ matrix.scalaProfile }}
+  SPARK_PROFILE: ${{ matrix.sparkProfile }}
+  SPARK_MODULES: ${{ matrix.sparkModules }}
+run:
+  mvn test -Punit-tests -Pjava17 -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" 
-Dtest=skipJavaTests -DfailIfNoTests=false -pl 
"hudi-common,$SPARK_COMMON_MODULES,$SPARK_MODULES" $MVN_ARGS
+  - name: Scala FT - Spark
+env:
+  SCALA_PROFILE: ${{ matrix.scalaProfile }}
+  SPARK_PROFILE: ${{ matrix.sparkProfile }}
+  SPARK_MODULES: ${{ matrix.sparkModules }}
+run:
+  mvn test -Pfunctional-tests -Pjava17 -D"$SCALA_PROFILE" 
-D"$SPARK_PROFILE" -Dtest=skipJavaTests -DfailIfNoTests=false -pl 
"$SPARK_COMMON_MODULES,$SPARK_MODULES" $MVN_ARGS
+
+  test-spark-java11-17-java-tests:
+runs-on: ubuntu-latest
+strategy:
+  matrix:
+include:
   - scalaProfile: "scala-2.12"
 sparkProfile: "spark3.5"
 spa

(hudi) branch master updated (3152e47876f -> 4b7e6e41573)

2024-06-25 Thread sivabalan

This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 3152e47876f [MINOR] Bump JUnit version to 5.8.2 (#11511)
 add 4b7e6e41573 [HUDI-7922] Add Hudi CLI bundle for Scala 2.13 (#11495)

No new revisions were added by this update.

Summary of changes:
 .github/workflows/bot.yml |  2 +-
 .../apache/hudi/cli/commands/ArchivedCommitsCommand.java  |  8 +---
 .../org/apache/hudi/cli/commands/CompactionCommand.java   | 15 ---
 .../scala/org/apache/hudi/util/JavaScalaConverters.scala  |  8 
 scripts/release/deploy_staging_jars.sh|  3 ++-
 scripts/release/validate_staged_bundles.sh|  2 +-
 6 files changed, 25 insertions(+), 13 deletions(-)

[jira] [Updated] (HUDI-7882) Umbrella ticket to track all changes required to support reading 1.x tables with 0.16.0

2024-06-25 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7882:
--
Sprint: 2024/06/17-30

> Umbrella ticket to track all changes required to support reading 1.x tables 
> with 0.16.0 
> 
>
> Key: HUDI-7882
> URL: https://issues.apache.org/jira/browse/HUDI-7882
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: reader-core
>    Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.16.0
>
>
> We wanted to support reading 1.x tables in 0.16.0 release. So, creating this 
> umbrella ticket to track all of them.
>  
> Changes required to be ported: 
> 0. Creating 0.16.0 branch
> 0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. 
>  
> 1. Timeline 
> 1.a Hoodie instant parsing should be able to read 1.x instants. 
> https://issues.apache.org/jira/browse/HUDI-7883 Sagar. 
> 1.b Commit metadata parsing is able to handle both json and avro formats. 
> Scope might be non-trivial.  https://issues.apache.org/jira/browse/HUDI-7866  
> Siva.
> 1.c HoodieDefaultTimeline able to read both timelines based on table version. 
>  https://issues.apache.org/jira/browse/HUDI-7884 Siva.
> 1.d Reading LSM timeline using 0.16.0 
> https://issues.apache.org/jira/browse/HUDI-7890 Siva. 
> 1.e Ensure 1.0 MDT timeline is readable by 0.16 - HUDI-7901
>  
> 2. Table property changes 
> 2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885  
> https://issues.apache.org/jira/browse/HUDI-7865 LJ
>  
> 3. MDT table changes
> 3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ
> 3.b MDT payload schema changes. 
> https://issues.apache.org/jira/browse/HUDI-7886 LJ
>  
> 4. Log format changes
> 4.a All metadata header types porting 
> https://issues.apache.org/jira/browse/HUDI-7887 Jon
> 4.b Meaningful error for incompatible features from 1.x 
> https://issues.apache.org/jira/browse/HUDI-7888 Jon
>  
> 5. Log file slice or grouping detection compatibility 
>  
> 5. Tests 
> 5.a Tests to validate that 1.x tables can be read w/ 0.16.0 
> https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. 
>  
> 6 Doc changes 
> 6.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. 
> https://issues.apache.org/jira/browse/HUDI-7889 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7882) Umbrella ticket to track all changes required to support reading 1.x tables with 0.16.0

2024-06-25 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7882:
--
Fix Version/s: 0.16.0

> Umbrella ticket to track all changes required to support reading 1.x tables 
> with 0.16.0 
> 
>
> Key: HUDI-7882
> URL: https://issues.apache.org/jira/browse/HUDI-7882
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: reader-core
>    Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.16.0
>
>
> We wanted to support reading 1.x tables in 0.16.0 release. So, creating this 
> umbrella ticket to track all of them.
>  
> Changes required to be ported: 
> 0. Creating 0.16.0 branch
> 0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. 
>  
> 1. Timeline 
> 1.a Hoodie instant parsing should be able to read 1.x instants. 
> https://issues.apache.org/jira/browse/HUDI-7883 Sagar. 
> 1.b Commit metadata parsing is able to handle both json and avro formats. 
> Scope might be non-trivial.  https://issues.apache.org/jira/browse/HUDI-7866  
> Siva.
> 1.c HoodieDefaultTimeline able to read both timelines based on table version. 
>  https://issues.apache.org/jira/browse/HUDI-7884 Siva.
> 1.d Reading LSM timeline using 0.16.0 
> https://issues.apache.org/jira/browse/HUDI-7890 Siva. 
> 1.e Ensure 1.0 MDT timeline is readable by 0.16 - HUDI-7901
>  
> 2. Table property changes 
> 2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885  
> https://issues.apache.org/jira/browse/HUDI-7865 LJ
>  
> 3. MDT table changes
> 3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ
> 3.b MDT payload schema changes. 
> https://issues.apache.org/jira/browse/HUDI-7886 LJ
>  
> 4. Log format changes
> 4.a All metadata header types porting 
> https://issues.apache.org/jira/browse/HUDI-7887 Jon
> 4.b Meaningful error for incompatible features from 1.x 
> https://issues.apache.org/jira/browse/HUDI-7888 Jon
>  
> 5. Log file slice or grouping detection compatibility 
>  
> 5. Tests 
> 5.a Tests to validate that 1.x tables can be read w/ 0.16.0 
> https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. 
>  
> 6 Doc changes 
> 6.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. 
> https://issues.apache.org/jira/browse/HUDI-7889 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-7882) Umbrella ticket to track all changes required to support reading 1.x tables with 0.16.0

2024-06-25 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-7882:
-

Assignee: sivabalan narayanan

> Umbrella ticket to track all changes required to support reading 1.x tables 
> with 0.16.0 
> 
>
> Key: HUDI-7882
> URL: https://issues.apache.org/jira/browse/HUDI-7882
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: reader-core
>    Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>
> We wanted to support reading 1.x tables in 0.16.0 release. So, creating this 
> umbrella ticket to track all of them.
>  
> Changes required to be ported: 
> 0. Creating 0.16.0 branch
> 0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. 
>  
> 1. Timeline 
> 1.a Hoodie instant parsing should be able to read 1.x instants. 
> https://issues.apache.org/jira/browse/HUDI-7883 Sagar. 
> 1.b Commit metadata parsing is able to handle both json and avro formats. 
> Scope might be non-trivial.  https://issues.apache.org/jira/browse/HUDI-7866  
> Siva.
> 1.c HoodieDefaultTimeline able to read both timelines based on table version. 
>  https://issues.apache.org/jira/browse/HUDI-7884 Siva.
> 1.d Reading LSM timeline using 0.16.0 
> https://issues.apache.org/jira/browse/HUDI-7890 Siva. 
> 1.e Ensure 1.0 MDT timeline is readable by 0.16 - HUDI-7901
>  
> 2. Table property changes 
> 2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885  
> https://issues.apache.org/jira/browse/HUDI-7865 LJ
>  
> 3. MDT table changes
> 3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ
> 3.b MDT payload schema changes. 
> https://issues.apache.org/jira/browse/HUDI-7886 LJ
>  
> 4. Log format changes
> 4.a All metadata header types porting 
> https://issues.apache.org/jira/browse/HUDI-7887 Jon
> 4.b Meaningful error for incompatible features from 1.x 
> https://issues.apache.org/jira/browse/HUDI-7888 Jon
>  
> 5. Log file slice or grouping detection compatibility 
>  
> 5. Tests 
> 5.a Tests to validate that 1.x tables can be read w/ 0.16.0 
> https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. 
>  
> 6 Doc changes 
> 6.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. 
> https://issues.apache.org/jira/browse/HUDI-7889 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7921) Chase down memory leaks in Writeclient with MDT enabled

2024-06-23 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7921:
-

 Summary: Chase down memory leaks in Writeclient with MDT enabled
 Key: HUDI-7921
 URL: https://issues.apache.org/jira/browse/HUDI-7921
 Project: Apache Hudi
  Issue Type: Improvement
  Components: metadata
Reporter: sivabalan narayanan


We see OOMs when deltastreamer is running continuously for days together. We 
suspect some memory leaks when metadata table is enabled. Lets try to chase 
down all of them and fix it. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-7921) Chase down memory leaks in Writeclient with MDT enabled

2024-06-23 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-7921:
-

Assignee: sivabalan narayanan

> Chase down memory leaks in Writeclient with MDT enabled
> ---
>
> Key: HUDI-7921
> URL: https://issues.apache.org/jira/browse/HUDI-7921
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata
>Reporter: sivabalan narayanan
>    Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 1.0.0
>
>
> We see OOMs when deltastreamer is running continuously for days together. We 
> suspect some memory leaks when metadata table is enabled. Lets try to chase 
> down all of them and fix it. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7921) Chase down memory leaks in Writeclient with MDT enabled

2024-06-23 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7921:
--
Fix Version/s: 1.0.0

> Chase down memory leaks in Writeclient with MDT enabled
> ---
>
> Key: HUDI-7921
> URL: https://issues.apache.org/jira/browse/HUDI-7921
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata
>Reporter: sivabalan narayanan
>Priority: Major
> Fix For: 1.0.0
>
>
> We see OOMs when deltastreamer is running continuously for days together. We 
> suspect some memory leaks when metadata table is enabled. Lets try to chase 
> down all of them and fix it. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

(hudi) branch master updated: [MINOR][DNM] Test disabling new HFile reader (#11488)

2024-06-21 Thread sivabalan

This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 1ce97bae116 [MINOR][DNM] Test disabling new HFile reader (#11488)
1ce97bae116 is described below

commit 1ce97bae11655c9a33f8665c3dd53116302686ee
Author: Y Ethan Guo 
AuthorDate: Fri Jun 21 18:44:15 2024 -0700

[MINOR][DNM] Test disabling new HFile reader (#11488)
---
 .../src/main/java/org/apache/hudi/common/config/HoodieReaderConfig.java | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/config/HoodieReaderConfig.java
 
b/hudi-common/src/main/java/org/apache/hudi/common/config/HoodieReaderConfig.java
index bb29e090ec3..a7e41098d66 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/config/HoodieReaderConfig.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/config/HoodieReaderConfig.java
@@ -31,7 +31,7 @@ import javax.annotation.concurrent.Immutable;
 public class HoodieReaderConfig extends HoodieConfig {
   public static final ConfigProperty USE_NATIVE_HFILE_READER = 
ConfigProperty
   .key("_hoodie.hfile.use.native.reader")
-  .defaultValue(true)
+  .defaultValue(false)
   .markAdvanced()
   .sinceVersion("1.0.0")
   .withDocumentation("When enabled, the native HFile reader is used to 
read HFiles.  This is an internal config.");

[jira] [Updated] (HUDI-7882) Umbrella ticket to track all changes required to support reading 1.x tables with 0.16.0

2024-06-18 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7882:
--
Description: 
We wanted to support reading 1.x tables in 0.16.0 release. So, creating this 
umbrella ticket to track all of them.

 

Changes required to be ported: 
0. Creating 0.16.0 branch

0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. 

 

1. Timeline 

1.a Hoodie instant parsing should be able to read 1.x instants. 
https://issues.apache.org/jira/browse/HUDI-7883 Sagar. 

1.b Commit metadata parsing is able to handle both json and avro formats. Scope 
might be non-trivial.  https://issues.apache.org/jira/browse/HUDI-7866  Siva.
1.c HoodieDefaultTimeline able to read both timelines based on table version.  
https://issues.apache.org/jira/browse/HUDI-7884 Siva.

1.d Reading LSM timeline using 0.16.0 
https://issues.apache.org/jira/browse/HUDI-7890 Siva. 

1.e Ensure 1.0 MDT timeline is readable by 0.16 - HUDI-7901

 

2. Table property changes 

2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885  
https://issues.apache.org/jira/browse/HUDI-7865 LJ

 

3. MDT table changes

3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ

3.b MDT payload schema changes. https://issues.apache.org/jira/browse/HUDI-7886 
LJ

 

4. Log format changes

4.a All metadata header types porting 
https://issues.apache.org/jira/browse/HUDI-7887 Jon

4.b Meaningful error for incompatible features from 1.x 
https://issues.apache.org/jira/browse/HUDI-7888 Jon

 

5. Log file slice or grouping detection compatibility 

 

5. Tests 

5.a Tests to validate that 1.x tables can be read w/ 0.16.0 
https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. 

 

6 Doc changes 

6.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. 
https://issues.apache.org/jira/browse/HUDI-7889 

  was:
We wanted to support reading 1.x tables in 0.16.0 release. So, creating this 
umbrella ticket to track all of them.

 

Changes required to be ported: 
0. Creating 0.16.0 branch

0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. 

 

1. Timeline 

1.a Hoodie instant parsing should be able to read 1.x instants. 
https://issues.apache.org/jira/browse/HUDI-7883 Sagar. 

1.b Commit metadata parsing is able to handle both json and avro formats. Scope 
might be non-trivial.  https://issues.apache.org/jira/browse/HUDI-7866  Siva.
1.c HoodieDefaultTimeline able to read both timelines based on table version.  
https://issues.apache.org/jira/browse/HUDI-7884 Siva.

1.d Reading LSM timeline using 0.16.0 
https://issues.apache.org/jira/browse/HUDI-7890 Siva. 

1.e Ensure 1.0 MDT timeline is readable by 0.16 - HUDI-7901

 

2. Table property changes 

2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885  
https://issues.apache.org/jira/browse/HUDI-7865 LJ

 

3. MDT table changes

3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ

3.b MDT payload schema changes. https://issues.apache.org/jira/browse/HUDI-7886 
LJ

 

4. Log format changes

4.a All metadata header types porting 
https://issues.apache.org/jira/browse/HUDI-7887 Jon

4.b Meaningful error for incompatible features from 1.x 
https://issues.apache.org/jira/browse/HUDI-7888 Jon

 

5. Tests 

5.a Tests to validate that 1.x tables can be read w/ 0.16.0 
https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. 

 

6 Doc changes 

6.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. 
https://issues.apache.org/jira/browse/HUDI-7889 


> Umbrella ticket to track all changes required to support reading 1.x tables 
> with 0.16.0 
> 
>
> Key: HUDI-7882
> URL: https://issues.apache.org/jira/browse/HUDI-7882
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: reader-core
>    Reporter: sivabalan narayanan
>Priority: Major
>
> We wanted to support reading 1.x tables in 0.16.0 release. So, creating this 
> umbrella ticket to track all of them.
>  
> Changes required to be ported: 
> 0. Creating 0.16.0 branch
> 0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. 
>  
> 1. Timeline 
> 1.a Hoodie instant parsing should be able to read 1.x instants. 
> https://issues.apache.org/jira/browse/HUDI-7883 Sagar. 
> 1.b Commit metadata parsing is able to handle both json and avro formats. 
> Scope might be non-trivial.  https://issues.apache.org/jira/browse/HUDI-7866  
> Siva.
> 1.c HoodieDefaultTimeline able to read both timelines based on table version. 
>  https://issues.apache.org/jira/browse/HUDI-7884 Siva.
> 1.d Reading LSM timeline using 0.16.0 
> https://i

[jira] [Updated] (HUDI-7882) Umbrella ticket to track all changes required to support reading 1.x tables with 0.16.0

2024-06-18 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7882:
--
Description: 
We wanted to support reading 1.x tables in 0.16.0 release. So, creating this 
umbrella ticket to track all of them.

 

Changes required to be ported: 
0. Creating 0.16.0 branch

0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. 

 

1. Timeline 

1.a Hoodie instant parsing should be able to read 1.x instants. 
https://issues.apache.org/jira/browse/HUDI-7883 Sagar. 

1.b Commit metadata parsing is able to handle both json and avro formats. Scope 
might be non-trivial.  https://issues.apache.org/jira/browse/HUDI-7866  Siva.
1.c HoodieDefaultTimeline able to read both timelines based on table version.  
https://issues.apache.org/jira/browse/HUDI-7884 Siva.

1.d Reading LSM timeline using 0.16.0 
https://issues.apache.org/jira/browse/HUDI-7890 Siva. 

1.e Ensure 1.0 MDT timeline is readable by 0.16 - HUDI-7901

 

2. Table property changes 

2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885  
https://issues.apache.org/jira/browse/HUDI-7865 LJ

 

3. MDT table changes

3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ

3.b MDT payload schema changes. https://issues.apache.org/jira/browse/HUDI-7886 
LJ

 

4. Log format changes

4.a All metadata header types porting 
https://issues.apache.org/jira/browse/HUDI-7887 Jon

4.b Meaningful error for incompatible features from 1.x 
https://issues.apache.org/jira/browse/HUDI-7888 Jon

 

5. Tests 

5.a Tests to validate that 1.x tables can be read w/ 0.16.0 
https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. 

 

6 Doc changes 

6.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. 
https://issues.apache.org/jira/browse/HUDI-7889 

  was:
We wanted to support reading 1.x tables in 0.16.0 release. So, creating this 
umbrella ticket to track all of them.

 

Changes required to be ported: 
0. Creating 0.16.0 branch

0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. 

 

1. Timeline 

1.a Commit instant parsing should be able to read 1.x instants. 
https://issues.apache.org/jira/browse/HUDI-7883 Sagar. 

1.b Commit metadata parsing is able to handle both json and avro formats. Scope 
might be non-trivial.  https://issues.apache.org/jira/browse/HUDI-7866  Siva.
1.c HoodieDefaultTimeline able to read both timelines based on table version.  
https://issues.apache.org/jira/browse/HUDI-7884 Siva.

1.d Reading LSM timeline using 0.16.0 
https://issues.apache.org/jira/browse/HUDI-7890 Siva. 

1.e Ensure 1.0 MDT timeline is readable by 0.16 - HUDI-7901

 

2. Table property changes 

2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885  
https://issues.apache.org/jira/browse/HUDI-7865 LJ

 

3. MDT table changes

3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ

3.b MDT payload schema changes. https://issues.apache.org/jira/browse/HUDI-7886 
LJ

 

4. Log format changes

4.a All metadata header types porting 
https://issues.apache.org/jira/browse/HUDI-7887 Jon

4.b Meaningful error for incompatible features from 1.x 
https://issues.apache.org/jira/browse/HUDI-7888 Jon

 

5. Tests 

5.a Tests to validate that 1.x tables can be read w/ 0.16.0 
https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. 

 

6 Doc changes 

6.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. 
https://issues.apache.org/jira/browse/HUDI-7889 


> Umbrella ticket to track all changes required to support reading 1.x tables 
> with 0.16.0 
> 
>
> Key: HUDI-7882
> URL: https://issues.apache.org/jira/browse/HUDI-7882
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: reader-core
>    Reporter: sivabalan narayanan
>Priority: Major
>
> We wanted to support reading 1.x tables in 0.16.0 release. So, creating this 
> umbrella ticket to track all of them.
>  
> Changes required to be ported: 
> 0. Creating 0.16.0 branch
> 0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. 
>  
> 1. Timeline 
> 1.a Hoodie instant parsing should be able to read 1.x instants. 
> https://issues.apache.org/jira/browse/HUDI-7883 Sagar. 
> 1.b Commit metadata parsing is able to handle both json and avro formats. 
> Scope might be non-trivial.  https://issues.apache.org/jira/browse/HUDI-7866  
> Siva.
> 1.c HoodieDefaultTimeline able to read both timelines based on table version. 
>  https://issues.apache.org/jira/browse/HUDI-7884 Siva.
> 1.d Reading LSM timeline using 0.16.0 
> https://issues.apache.org/jira/browse/HUDI-7890 Siva. 
> 1.e En

[jira] [Updated] (HUDI-7896) Add tests to validate 1.x tables are readable with 0.16.0 hudi

2024-06-17 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7896:
--
Description: 
We need tests added to 0.16.0 hudi to ensure it can read 1.x tables. 

 

We should have good coverage for all new log formats, diff table types, pending 
compactions, pending clustering, completed clustering, failed commits/delta 
commits, pending rollbacks, pending cleans etc. diff payloads, key gen, 
partitioned/non-partitioned. regular index and global index.

Cover all different writers:

a. spark ds writer

b. spark sql

c. deltastreamer 

d. spark streaming writer 

 

 

 

Reader side validation:

lets account for snapshot read, incremental read, time travel read. 

 

  was:
We need tests added to 0.16.0 hudi to ensure it can read 1.x tables. 

 

We should have good coverage for all new log formats, diff table types, pending 
compactions, pending clustering, completed clustering, failed commits/delta 
commits, pending rollbacks, pending cleans etc. diff payloads, key gen, 
partitioned/non-partitioned. 

Cover all different writers:

a. spark ds writer

b. spark sql

c. deltastreamer 

d. spark streaming writer 

 

Reader side validation:

lets account for snapshot read, incremental read, time travel read. 

 


> Add tests to validate 1.x tables are readable with 0.16.0 hudi
> --
>
> Key: HUDI-7896
> URL: https://issues.apache.org/jira/browse/HUDI-7896
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: tests-ci
>Reporter: sivabalan narayanan
>    Assignee: sivabalan narayanan
>Priority: Major
>
> We need tests added to 0.16.0 hudi to ensure it can read 1.x tables. 
>  
> We should have good coverage for all new log formats, diff table types, 
> pending compactions, pending clustering, completed clustering, failed 
> commits/delta commits, pending rollbacks, pending cleans etc. diff payloads, 
> key gen, partitioned/non-partitioned. regular index and global index.
> Cover all different writers:
> a. spark ds writer
> b. spark sql
> c. deltastreamer 
> d. spark streaming writer 
>  
>  
>  
> Reader side validation:
> lets account for snapshot read, incremental read, time travel read. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7894) Umbrella ticket for 0.16.0 to 1.0.0 upgrade

2024-06-17 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7894:
--
Fix Version/s: 1.0.0

> Umbrella ticket for 0.16.0 to 1.0.0 upgrade
> ---
>
> Key: HUDI-7894
> URL: https://issues.apache.org/jira/browse/HUDI-7894
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Major
> Fix For: 1.0.0
>
>
> # Rewrite archived timeline as LSM timeline - HUDI-7858
>  # Test backward compatibility for table properties/timeline/log format 
> changes - create 1.0 tables as test artifacts - HUDI-7898
>  # Per log file marker - HUDI-1517



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7863) Ensure full compaction before downgrade for unsupported features

2024-06-17 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7863:
--
Fix Version/s: 1.0.0

> Ensure full compaction before downgrade for unsupported features
> 
>
> Key: HUDI-7863
> URL: https://issues.apache.org/jira/browse/HUDI-7863
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Major
> Fix For: 1.0.0
>
>
> While downgrading from 1.0 to 0.16.0, lets check for log headers for presence 
> of unsupported features (partial updates, deletion vectors), and trigger full 
> compaction. 
> We need a custom compaction strategy as well as part of it. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7863) Ensure full compaction before downgrade for unsupported features

2024-06-17 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7863:
--
Description: 
While downgrading from 1.0 to 0.16.0, lets check for log headers for presence 
of unsupported features (partial updates, deletion vectors), and trigger full 
compaction. 

We need a custom compaction strategy as well as part of it. 

> Ensure full compaction before downgrade for unsupported features
> 
>
> Key: HUDI-7863
> URL: https://issues.apache.org/jira/browse/HUDI-7863
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Major
>
> While downgrading from 1.0 to 0.16.0, lets check for log headers for presence 
> of unsupported features (partial updates, deletion vectors), and trigger full 
> compaction. 
> We need a custom compaction strategy as well as part of it. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7863) Ensure full compaction before downgrade for unsupported features

2024-06-17 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7863:
--
Summary: Ensure full compaction before downgrade for unsupported features  
(was: Ensure full compaction before downgrade)

> Ensure full compaction before downgrade for unsupported features
> 
>
> Key: HUDI-7863
> URL: https://issues.apache.org/jira/browse/HUDI-7863
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7896) Add tests to validate 1.x tables are readable with 0.16.0 hudi

2024-06-17 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7896:
--
Description: 
We need tests added to 0.16.0 hudi to ensure it can read 1.x tables. 

 

We should have good coverage for all new log formats, diff table types, pending 
compactions, pending clustering, completed clustering, failed commits/delta 
commits, pending rollbacks, pending cleans etc. diff payloads, key gen, 
partitioned/non-partitioned. 

Cover all different writers:

a. spark ds writer

b. spark sql

c. deltastreamer 

d. spark streaming writer 

 

Reader side validation:

lets account for snapshot read, incremental read, time travel read. 

 

  was:
We need tests added to 0.16.0 hudi to ensure it can read 1.x tables. 

 

We should have good coverage for all new log formats, diff table types, pending 
compactions, pending clustering, completed clustering, failed commits/delta 
commits, pending rollbacks, pending cleans etc. 

 

Reader side validation:

lets account for snapshot read, incremental read, time travel read. 

 


> Add tests to validate 1.x tables are readable with 0.16.0 hudi
> --
>
> Key: HUDI-7896
> URL: https://issues.apache.org/jira/browse/HUDI-7896
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: tests-ci
>Reporter: sivabalan narayanan
>    Assignee: sivabalan narayanan
>Priority: Major
>
> We need tests added to 0.16.0 hudi to ensure it can read 1.x tables. 
>  
> We should have good coverage for all new log formats, diff table types, 
> pending compactions, pending clustering, completed clustering, failed 
> commits/delta commits, pending rollbacks, pending cleans etc. diff payloads, 
> key gen, partitioned/non-partitioned. 
> Cover all different writers:
> a. spark ds writer
> b. spark sql
> c. deltastreamer 
> d. spark streaming writer 
>  
> Reader side validation:
> lets account for snapshot read, incremental read, time travel read. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7898) Add tests to validate 0.x tables are readable with 1.x hudi

2024-06-17 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7898:
--
Description: 
We need tests added to 1.x hudi to ensure it can read 0.16.0 tables. 

 

We should have good coverage for all new log formats, diff table types, pending 
compactions, pending clustering, completed clustering, failed commits/delta 
commits, pending rollbacks, pending cleans etc. diff payloads, key gen, 
partitioned/non-partitioned. 

Cover all different writers:

a. spark ds writer

b. spark sql

c. deltastreamer 

d. spark streaming writer 

 

Reader side validation:

lets account for snapshot read, incremental read, time travel read. 

 

  was:
We need tests added to 1.x hudi to ensure it can read 0.16.0 tables. 

 

We should have good coverage for all new log formats, diff table types, pending 
compactions, pending clustering, completed clustering, failed commits/delta 
commits, pending rollbacks, pending cleans etc. diff payloads. 

 

Reader side validation:

lets account for snapshot read, incremental read, time travel read. 

 


> Add tests to validate 0.x tables are readable with 1.x hudi
> ---
>
> Key: HUDI-7898
> URL: https://issues.apache.org/jira/browse/HUDI-7898
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: tests-ci
>Reporter: sivabalan narayanan
>    Assignee: sivabalan narayanan
>Priority: Major
>
> We need tests added to 1.x hudi to ensure it can read 0.16.0 tables. 
>  
> We should have good coverage for all new log formats, diff table types, 
> pending compactions, pending clustering, completed clustering, failed 
> commits/delta commits, pending rollbacks, pending cleans etc. diff payloads, 
> key gen, partitioned/non-partitioned. 
> Cover all different writers:
> a. spark ds writer
> b. spark sql
> c. deltastreamer 
> d. spark streaming writer 
>  
> Reader side validation:
> lets account for snapshot read, incremental read, time travel read. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7898) Add tests to validate 0.x tables are readable with 1.x hudi

2024-06-17 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7898:
--
Description: 
We need tests added to 1.x hudi to ensure it can read 0.16.0 tables. 

 

We should have good coverage for all new log formats, diff table types, pending 
compactions, pending clustering, completed clustering, failed commits/delta 
commits, pending rollbacks, pending cleans etc. 

 

Reader side validation:

lets account for snapshot read, incremental read, time travel read. 

 

  was:
We need tests added to 0.16.0 hudi to ensure it can read 1.x tables. 

 

We should have good coverage for all new log formats, diff table types, pending 
compactions, pending clustering, completed clustering, failed commits/delta 
commits, pending rollbacks, pending cleans etc. 

 

Reader side validation:

lets account for snapshot read, incremental read, time travel read. 

 


> Add tests to validate 0.x tables are readable with 1.x hudi
> ---
>
> Key: HUDI-7898
> URL: https://issues.apache.org/jira/browse/HUDI-7898
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: tests-ci
>Reporter: sivabalan narayanan
>    Assignee: sivabalan narayanan
>Priority: Major
>
> We need tests added to 1.x hudi to ensure it can read 0.16.0 tables. 
>  
> We should have good coverage for all new log formats, diff table types, 
> pending compactions, pending clustering, completed clustering, failed 
> commits/delta commits, pending rollbacks, pending cleans etc. 
>  
> Reader side validation:
> lets account for snapshot read, incremental read, time travel read. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7898) Add tests to validate 0.x tables are readable with 1.x hudi

2024-06-17 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7898:
-

 Summary: Add tests to validate 0.x tables are readable with 1.x 
hudi
 Key: HUDI-7898
 URL: https://issues.apache.org/jira/browse/HUDI-7898
 Project: Apache Hudi
  Issue Type: Improvement
  Components: tests-ci
Reporter: sivabalan narayanan
Assignee: sivabalan narayanan


We need tests added to 0.16.0 hudi to ensure it can read 1.x tables. 

 

We should have good coverage for all new log formats, diff table types, pending 
compactions, pending clustering, completed clustering, failed commits/delta 
commits, pending rollbacks, pending cleans etc. 

 

Reader side validation:

lets account for snapshot read, incremental read, time travel read. 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7898) Add tests to validate 0.x tables are readable with 1.x hudi

2024-06-17 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7898:
--
Description: 
We need tests added to 1.x hudi to ensure it can read 0.16.0 tables. 

 

We should have good coverage for all new log formats, diff table types, pending 
compactions, pending clustering, completed clustering, failed commits/delta 
commits, pending rollbacks, pending cleans etc. diff payloads. 

 

Reader side validation:

lets account for snapshot read, incremental read, time travel read. 

 

  was:
We need tests added to 1.x hudi to ensure it can read 0.16.0 tables. 

 

We should have good coverage for all new log formats, diff table types, pending 
compactions, pending clustering, completed clustering, failed commits/delta 
commits, pending rollbacks, pending cleans etc. 

 

Reader side validation:

lets account for snapshot read, incremental read, time travel read. 

 


> Add tests to validate 0.x tables are readable with 1.x hudi
> ---
>
> Key: HUDI-7898
> URL: https://issues.apache.org/jira/browse/HUDI-7898
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: tests-ci
>Reporter: sivabalan narayanan
>    Assignee: sivabalan narayanan
>Priority: Major
>
> We need tests added to 1.x hudi to ensure it can read 0.16.0 tables. 
>  
> We should have good coverage for all new log formats, diff table types, 
> pending compactions, pending clustering, completed clustering, failed 
> commits/delta commits, pending rollbacks, pending cleans etc. diff payloads. 
>  
> Reader side validation:
> lets account for snapshot read, incremental read, time travel read. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7882) Umbrella ticket to track all changes required to support reading 1.x tables with 0.16.0

2024-06-17 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7882:
--
Description: 
We wanted to support reading 1.x tables in 0.16.0 release. So, creating this 
umbrella ticket to track all of them.

 

Changes required to be ported: 
0. Creating 0.16.0 branch

0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. 

 

1. Timeline 

1.a Commit instant parsing should be able to read 1.x instants. 
https://issues.apache.org/jira/browse/HUDI-7883 Sagar. 

1.b Commit metadata parsing is able to handle both json and avro formats. Scope 
might be non-trivial.  https://issues.apache.org/jira/browse/HUDI-7866  Siva.
1.c HoodieDefaultTimeline able to read both timelines based on table version.  
https://issues.apache.org/jira/browse/HUDI-7884 Siva.

1.d Reading LSM timeline using 0.16.0 
https://issues.apache.org/jira/browse/HUDI-7890 Siva. 

 

2. Table property changes 

2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885  
https://issues.apache.org/jira/browse/HUDI-7865 LJ

 

3. MDT table changes

3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ

3.b MDT payload schema changes. https://issues.apache.org/jira/browse/HUDI-7886 
LJ

 

4. Log format changes

4.a All metadata header types porting 
https://issues.apache.org/jira/browse/HUDI-7887 Jon

4.b Meaningful error for incompatible features from 1.x 
https://issues.apache.org/jira/browse/HUDI-7888 Jon

 

5. Tests 

5.a Tests to validate that 1.x tables can be read w/ 0.16.0 
https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. 

 

6 Doc changes 

6.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. 
https://issues.apache.org/jira/browse/HUDI-7889 

  was:
We wanted to support reading 1.x tables in 0.16.0 release. So, creating this 
umbrella ticket to track all of them.

 

Changes required to be ported: 
0. Creating 0.16.0 branch

0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. 

 

1. Timeline 

1.a Commit instant parsing should be able to read 1.x instants. 
https://issues.apache.org/jira/browse/HUDI-7883 Sagar. 

1.b Commit metadata parsing is able to handle both json and avro formats. Scope 
might be non-trivial.  https://issues.apache.org/jira/browse/HUDI-7866  Siva.
1.c HoodieDefaultTimeline able to read both timelines based on table version.  
https://issues.apache.org/jira/browse/HUDI-7884 Siva.

1.d Reading LSM timeline using 0.16.0 
https://issues.apache.org/jira/browse/HUDI-7890 Siva. 

 

2. Table property changes 

2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885  
https://issues.apache.org/jira/browse/HUDI-7865 LJ

 

3. MDT table changes

3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ

3.b MDT payload schema changes. https://issues.apache.org/jira/browse/HUDI-7886 
LJ

 

4. Log format changes

4.a All metadata header types porting 
https://issues.apache.org/jira/browse/HUDI-7887 Jon

4.b Meaningful error for incompatible features from 1.x 
https://issues.apache.org/jira/browse/HUDI-7888 Jon

 

5. Tests 

5.a Tests to validate that 1.x tables can be read w/ 0.16.0 
https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. 

 

6 Doc changes 

5.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. 
https://issues.apache.org/jira/browse/HUDI-7889 


> Umbrella ticket to track all changes required to support reading 1.x tables 
> with 0.16.0 
> 
>
> Key: HUDI-7882
> URL: https://issues.apache.org/jira/browse/HUDI-7882
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: reader-core
>    Reporter: sivabalan narayanan
>Priority: Major
>
> We wanted to support reading 1.x tables in 0.16.0 release. So, creating this 
> umbrella ticket to track all of them.
>  
> Changes required to be ported: 
> 0. Creating 0.16.0 branch
> 0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. 
>  
> 1. Timeline 
> 1.a Commit instant parsing should be able to read 1.x instants. 
> https://issues.apache.org/jira/browse/HUDI-7883 Sagar. 
> 1.b Commit metadata parsing is able to handle both json and avro formats. 
> Scope might be non-trivial.  https://issues.apache.org/jira/browse/HUDI-7866  
> Siva.
> 1.c HoodieDefaultTimeline able to read both timelines based on table version. 
>  https://issues.apache.org/jira/browse/HUDI-7884 Siva.
> 1.d Reading LSM timeline using 0.16.0 
> https://issues.apache.org/jira/browse/HUDI-7890 Siva. 
>  
> 2. Table property changes 
> 2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885  
> https://

[jira] [Updated] (HUDI-7884) Design HoodieDefaultTimeline to be able to read timeline both from 0.16.0 and from 1.0

2024-06-17 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7884:
--
Description: 
We wanted to support reading 1.x tables in 0.16.0 reader. 

Either entire timeline will be in 0.x or in 1.x. So, lets design a 
HoodieDefaultTimeline to be able to parse either timeline based on table 
version. We could define Hoodie0.xDefaultTimeline and Hoodie1.xDefaultTimeline 
and internally(w/n HoodieDefaultTimeline) to delegate to either of them. 

 

Related PR links:

[HUDI-6776] Replace JSON with Avro bytes for commit metadata 
[https://github.com/apache/hudi/pull/9579] 

  was:
We wanted to support reading 1.x tables in 0.16.0 reader. 

Either entire timeline will be in 0.x or in 1.x. So, lets design a 
HoodieDefaultTimeline to be able to parse either timeline based on table 
version. We could define Hoodie0.xDefaultTimeline and Hoodie1.xDefaultTimeline 
and internally(w/n HoodieDefaultTimeline) to delegate to either of them. 

 


> Design HoodieDefaultTimeline to be able to read timeline both from 0.16.0 and 
> from 1.0
> --
>
> Key: HUDI-7884
> URL: https://issues.apache.org/jira/browse/HUDI-7884
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: reader-core
>    Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>
> We wanted to support reading 1.x tables in 0.16.0 reader. 
> Either entire timeline will be in 0.x or in 1.x. So, lets design a 
> HoodieDefaultTimeline to be able to parse either timeline based on table 
> version. We could define Hoodie0.xDefaultTimeline and 
> Hoodie1.xDefaultTimeline and internally(w/n HoodieDefaultTimeline) to 
> delegate to either of them. 
>  
> Related PR links:
> [HUDI-6776] Replace JSON with Avro bytes for commit metadata 
> [https://github.com/apache/hudi/pull/9579] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-7890) Ensure 0.16.0 reader can read LSM timeline from 1.x

2024-06-17 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-7890:
-

Assignee: sivabalan narayanan

> Ensure 0.16.0 reader can read LSM timeline from 1.x
> ---
>
> Key: HUDI-7890
> URL: https://issues.apache.org/jira/browse/HUDI-7890
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: reader-core
>Reporter: sivabalan narayanan
>    Assignee: sivabalan narayanan
>Priority: Major
>
> We are looking to support reading 1.x tables from 0.16.0 readers. 
>  
> so, lets ensure LSM timeline reader is supported in 0.16.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7888) Throw meaningful error when reading partial update or DV written in 1.x from 0.16.0 reader

2024-06-17 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7888:
--
Description: 
We wanted to support reading 1.x tables in 0.16.0 reader.   

 

If 1.x table does not have any new features enabled which are backwards 
incompatible we are good. If not, if someone has enabled partial update feature 
or deletion vector support, we should parse and throw a meaningful error from 
0.16.0 reader. Lets also comb for any other additional features in 1.x and 
throw meaningful error. 

 

  was:
We wanted to support reading 1.x tables in 0.16.0 reader.   

 

If 1.x table does not have any new features enabled which are backwards 
incompatible we are good. If not, if someone has enabled partial update feature 
or deletion vector support, we should parse and throw a meaningful error from 
0.16.0 reader 

 


> Throw meaningful error when reading partial update or DV written in 1.x from 
> 0.16.0 reader
> --
>
> Key: HUDI-7888
> URL: https://issues.apache.org/jira/browse/HUDI-7888
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: reader-core
>    Reporter: sivabalan narayanan
>Assignee: Jonathan Vexler
>Priority: Major
>
> We wanted to support reading 1.x tables in 0.16.0 reader.   
>  
> If 1.x table does not have any new features enabled which are backwards 
> incompatible we are good. If not, if someone has enabled partial update 
> feature or deletion vector support, we should parse and throw a meaningful 
> error from 0.16.0 reader. Lets also comb for any other additional features in 
> 1.x and throw meaningful error. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 9014 matches

Mail list logo