(hudi) branch master updated: [HUDI-7979] Adjusting defaults with spillable map memory (#11611)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 9ee96655094 [HUDI-7979] Adjusting defaults with spillable map memory (#11611) 9ee96655094 is described below commit 9ee9665509451c58736b37daa686510fbd1ad536 Author: Sivabalan Narayanan AuthorDate: Thu Jul 11 22:11:19 2024 -0700 [HUDI-7979] Adjusting defaults with spillable map memory (#11611) --- .../apache/hudi/config/TestHoodieWriteConfig.java | 21 +++ .../common/config/SerializableConfiguration.java | 69 ++ .../table/view/FileSystemViewStorageConfig.java| 11 ++-- .../org/apache/hudi/exception/HoodieException.java | 48 +++ .../apache/hudi/exception/HoodieIOException.java | 44 ++ .../HoodieIncompatibleSchemaException.java | 33 +++ 6 files changed, 221 insertions(+), 5 deletions(-) diff --git a/hudi-client/hudi-client-common/src/test/java/org/apache/hudi/config/TestHoodieWriteConfig.java b/hudi-client/hudi-client-common/src/test/java/org/apache/hudi/config/TestHoodieWriteConfig.java index 00c9bfbd163..b39ac241407 100644 --- a/hudi-client/hudi-client-common/src/test/java/org/apache/hudi/config/TestHoodieWriteConfig.java +++ b/hudi-client/hudi-client-common/src/test/java/org/apache/hudi/config/TestHoodieWriteConfig.java @@ -30,6 +30,7 @@ import org.apache.hudi.common.model.HoodieTableType; import org.apache.hudi.common.model.WriteConcurrencyMode; import org.apache.hudi.common.table.HoodieTableConfig; import org.apache.hudi.common.table.marker.MarkerType; +import org.apache.hudi.common.table.view.FileSystemViewStorageConfig; import org.apache.hudi.config.HoodieWriteConfig.Builder; import org.apache.hudi.index.HoodieIndex; import org.apache.hudi.keygen.constant.KeyGeneratorOptions; @@ -553,6 +554,26 @@ public class TestHoodieWriteConfig { "Non-blocking concurrency control requires the MOR table with simple bucket index"); } + @Test + public void testFileSystemViewStorageConfigDefaults() { +HoodieWriteConfig writeConfig = HoodieWriteConfig.newBuilder().withPath("/tmp").build(); +assertEquals(FileSystemViewStorageConfig.SPILLABLE_MEMORY.defaultValue() * FileSystemViewStorageConfig.BOOTSTRAP_BASE_FILE_MEM_FRACTION.defaultValue(), +writeConfig.getViewStorageConfig().getMaxMemoryForBootstrapBaseFile()); +assertEquals(FileSystemViewStorageConfig.SPILLABLE_MEMORY.defaultValue() * FileSystemViewStorageConfig.SPILLABLE_COMPACTION_MEM_FRACTION.defaultValue(), +writeConfig.getViewStorageConfig().getMaxMemoryForPendingCompaction()); +assertEquals(FileSystemViewStorageConfig.SPILLABLE_MEMORY.defaultValue() * FileSystemViewStorageConfig.SPILLABLE_LOG_COMPACTION_MEM_FRACTION.defaultValue(), + writeConfig.getViewStorageConfig().getMaxMemoryForPendingLogCompaction()); +assertEquals(FileSystemViewStorageConfig.SPILLABLE_MEMORY.defaultValue() * FileSystemViewStorageConfig.SPILLABLE_CLUSTERING_MEM_FRACTION.defaultValue(), + writeConfig.getViewStorageConfig().getMaxMemoryForPendingClusteringFileGroups()); +assertEquals(FileSystemViewStorageConfig.SPILLABLE_MEMORY.defaultValue() * FileSystemViewStorageConfig.SPILLABLE_REPLACED_MEM_FRACTION.defaultValue(), + writeConfig.getViewStorageConfig().getMaxMemoryForReplacedFileGroups()); +assertEquals(FileSystemViewStorageConfig.SPILLABLE_MEMORY.defaultValue() - writeConfig.getViewStorageConfig().getMaxMemoryForBootstrapBaseFile() +- writeConfig.getViewStorageConfig().getMaxMemoryForPendingCompaction() - writeConfig.getViewStorageConfig().getMaxMemoryForPendingLogCompaction() +- writeConfig.getViewStorageConfig().getMaxMemoryForPendingClusteringFileGroups() +- writeConfig.getViewStorageConfig().getMaxMemoryForReplacedFileGroups(), +writeConfig.getViewStorageConfig().getMaxMemoryForFileGroupMap()); + } + private HoodieWriteConfig createWriteConfig(Map configs) { final Properties properties = new Properties(); configs.forEach(properties::setProperty); diff --git a/hudi-common/src/main/java/org/apache/hudi/common/config/SerializableConfiguration.java b/hudi-common/src/main/java/org/apache/hudi/common/config/SerializableConfiguration.java new file mode 100644 index 000..23a22e01822 --- /dev/null +++ b/hudi-common/src/main/java/org/apache/hudi/common/config/SerializableConfiguration.java @@ -0,0 +1,69 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance
[jira] [Assigned] (HUDI-7972) Add fallback for deletion vector in 0.16.x reader while reading 1.x tables
[ https://issues.apache.org/jira/browse/HUDI-7972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan reassigned HUDI-7972: - Assignee: sivabalan narayanan > Add fallback for deletion vector in 0.16.x reader while reading 1.x tables > -- > > Key: HUDI-7972 > URL: https://issues.apache.org/jira/browse/HUDI-7972 > Project: Apache Hudi > Issue Type: Sub-task > Components: reader-core >Reporter: sivabalan narayanan > Assignee: sivabalan narayanan >Priority: Major > Labels: 1.0-migration > Fix For: 1.0.0 > > > If 0.16.x reader is used to read a 1.x table with deletion vector, we should > fallback to using key based merges instead of position based merges. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7971) Test and Certify 0.14.x to 0.16.x tables are readable in 1.x Hudi reader
[ https://issues.apache.org/jira/browse/HUDI-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7971: -- Description: Lets ensure 1.x reader is fully compatible w/ reading any of 0.14.x to 0.16.x tables Readers : 1.x # Spark SQL # Spark Datasource # Trino/Presto # Hive # Flink Writer: 0.16 Table State: * COW ** few write commits ** Pending clustering ** Completed Clustering ** Failed writes with no rollbacks ** Insert overwrite table/partition ** Savepoint for Time-travel query * MOR ** Same as COW ** Pending and completed async compaction (with log-files and no base file) ** Custom Payloads (for MOR snapshot queries) (e:g SQL Expression Payload) ** Log block formats - DELETE, rollback block Other knobs: # Metadata enabled/disabled (all combinations) # Column Stats enabled/disabled and data-skipping enabled/disabled # RLI enabled with eq/IN queries # Non-Partitioned dataset (all combinations) # CDC Reads # Incremental Reads # Time-travel query What to test ? # Query Results Correctness # Performance : See the benefit of # Partition Pruning # Metadata table - col stats, RLI, Corner Case Testing: # Schema Evolution with different file-groups having different generation of schema # Dynamic Partition Pruning # Does Column Projection work correctly for log files reading was: Lets ensure 1.x reader is fully compatible w/ reading any of 0.14.x to 0.16.x tables Readers : 1.x # Spark SQL # Spark Datasource # Trino/Presto # Hive # Flink Writer: 0.16 Table State: * COW ** few write commits ** Pending clustering ** Completed Clustering ** Failed writes with no rollbacks ** Insert overwrite table/partition ** Savepoint for Time-travel query * MOR ** Same as COW ** Pending and completed async compaction (with log-files and no base file) ** Custom Payloads (for MOR snapshot queries) (e:g SQL Expression Payload) ** Log block formats - DELETE, rollback block Other knobs: # Metadata enabled/disabled # Column Stats enabled/disabled and data-skipping enabled/disabled # RLI enabled with eq/IN queries # Non-Partitioned dataset # CDC Reads # Incremental Reads # Time-travel query What to test ? # Query Results Correctness # Performance : See the benefit of # Partition Pruning # Metadata table - col stats, RLI, Corner Case Testing: # Schema Evolution with different file-groups having different generation of schema # Dynamic Partition Pruning # Does Column Projection work correctly for log files reading > Test and Certify 0.14.x to 0.16.x tables are readable in 1.x Hudi reader > - > > Key: HUDI-7971 > URL: https://issues.apache.org/jira/browse/HUDI-7971 > Project: Apache Hudi > Issue Type: Sub-task > Reporter: sivabalan narayanan >Priority: Major > Fix For: 1.0.0 > > > Lets ensure 1.x reader is fully compatible w/ reading any of 0.14.x to 0.16.x > tables > > Readers : 1.x > # Spark SQL > # Spark Datasource > # Trino/Presto > # Hive > # Flink > Writer: 0.16 > Table State: > * COW > ** few write commits > ** Pending clustering > ** Completed Clustering > ** Failed writes with no rollbacks > ** Insert overwrite table/partition > ** Savepoint for Time-travel query > * MOR > ** Same as COW > ** Pending and completed async compaction (with log-files and no base file) > ** Custom Payloads (for MOR snapshot queries) (e:g SQL Expression Payload) > ** Log block formats - DELETE, rollback block > Other knobs: > # Metadata enabled/disabled (all combinations) > # Column Stats enabled/disabled and data-skipping enabled/disabled > # RLI enabled with eq/IN queries > # Non-Partitioned dataset (all combinations) > # CDC Reads > # Incremental Reads > # Time-travel query > > What to test ? > # Query Results Correctness > # Performance : See the benefit of > # Partition Pruning > # Metadata table - col stats, RLI, > > Corner Case Testing: > > # Schema Evolution with different file-groups having different generation of > schema > # Dynamic Partition Pruning > # Does Column Projection work correctly for log files reading -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7971) Test and Certify 0.14.x to 0.16.x tables are readable in 1.x Hudi reader
[ https://issues.apache.org/jira/browse/HUDI-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7971: -- Description: Lets ensure 1.x reader is fully compatible w/ reading any of 0.14.x to 0.16.x tables Readers : 1.x # Spark SQL # Spark Datasource # Trino/Presto # Hive # Flink Writer: 0.16 Table State: * COW ** few write commits ** Pending clustering ** Completed Clustering ** Failed writes with no rollbacks ** Insert overwrite table/partition ** Savepoint for Time-travel query * MOR ** Same as COW ** Pending and completed async compaction (with log-files and no base file) ** Custom Payloads (for MOR snapshot queries) (e:g SQL Expression Payload) ** Log block formats - DELETE, rollback block Other knobs: # Metadata enabled/disabled # Column Stats enabled/disabled and data-skipping enabled/disabled # RLI enabled with eq/IN queries # Non-Partitioned dataset # CDC Reads # Incremental Reads # Time-travel query What to test ? # Query Results Correctness # Performance : See the benefit of # Partition Pruning # Metadata table - col stats, RLI, Corner Case Testing: # Schema Evolution with different file-groups having different generation of schema # Dynamic Partition Pruning # Does Column Projection work correctly for log files reading was: Lets ensure 1.x reader is fully compatible w/ reading any of 0.14.x to 0.16.x tables Readers : 1.x # Spark SQL # Spark Datasource # Trino/Presto # Hive # Flink Writer: 0.16 Table State: * COW ** few write commits ** Pending clustering ** Completed Clustering ** Failed writes with no rollbacks ** Insert overwrite table/partition ** Savepoint for Time-travel query * MOR ** Same as COW ** Pending and completed async compaction (with log-files and no base file) ** Custom Payloads (for MOR snapshot queries) (e:g SQL Expression Payload) ** Rollback formats - DELETE, rollback block Other knobs: # Metadata enabled/disabled # Column Stats enabled/disabled and data-skipping enabled/disabled # RLI enabled with eq/IN queries # Non-Partitioned dataset # CDC Reads # Incremental Reads # Time-travel query What to test ? # Query Results Correctness # Performance : See the benefit of # Partition Pruning # Metadata table - col stats, RLI, Corner Case Testing: # Schema Evolution with different file-groups having different generation of schema # Dynamic Partition Pruning # Does Column Projection work correctly for log files reading > Test and Certify 0.14.x to 0.16.x tables are readable in 1.x Hudi reader > - > > Key: HUDI-7971 > URL: https://issues.apache.org/jira/browse/HUDI-7971 > Project: Apache Hudi > Issue Type: Sub-task > Reporter: sivabalan narayanan >Priority: Major > Fix For: 1.0.0 > > > Lets ensure 1.x reader is fully compatible w/ reading any of 0.14.x to 0.16.x > tables > > Readers : 1.x > # Spark SQL > # Spark Datasource > # Trino/Presto > # Hive > # Flink > Writer: 0.16 > Table State: > * COW > ** few write commits > ** Pending clustering > ** Completed Clustering > ** Failed writes with no rollbacks > ** Insert overwrite table/partition > ** Savepoint for Time-travel query > * MOR > ** Same as COW > ** Pending and completed async compaction (with log-files and no base file) > ** Custom Payloads (for MOR snapshot queries) (e:g SQL Expression Payload) > ** Log block formats - DELETE, rollback block > Other knobs: > # Metadata enabled/disabled > # Column Stats enabled/disabled and data-skipping enabled/disabled > # RLI enabled with eq/IN queries > # Non-Partitioned dataset > # CDC Reads > # Incremental Reads > # Time-travel query > > What to test ? > # Query Results Correctness > # Performance : See the benefit of > # Partition Pruning > # Metadata table - col stats, RLI, > > Corner Case Testing: > > # Schema Evolution with different file-groups having different generation of > schema > # Dynamic Partition Pruning > # Does Column Projection work correctly for log files reading -- This message was sent by Atlassian Jira (v8.20.10#820010)
(hudi) branch asf-site updated: [HUDI-7978][DOCS] Add a note on field oldering in partitioned by clause of create sql (#11610)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new 488438fbb6a [HUDI-7978][DOCS] Add a note on field oldering in partitioned by clause of create sql (#11610) 488438fbb6a is described below commit 488438fbb6ae2f8dfcc9257016c66a38c0352171 Author: Sagar Sumit AuthorDate: Thu Jul 11 08:38:42 2024 +0530 [HUDI-7978][DOCS] Add a note on field oldering in partitioned by clause of create sql (#11610) --- website/docs/sql_ddl.md| 5 - website/versioned_docs/version-0.11.0/quick-start-guide.md | 7 +++ website/versioned_docs/version-0.11.1/quick-start-guide.md | 7 +++ website/versioned_docs/version-0.12.0/quick-start-guide.md | 7 +++ website/versioned_docs/version-0.12.1/quick-start-guide.md | 7 +++ website/versioned_docs/version-0.12.2/quick-start-guide.md | 7 +++ website/versioned_docs/version-0.12.3/quick-start-guide.md | 7 +++ website/versioned_docs/version-0.13.0/quick-start-guide.md | 7 +++ website/versioned_docs/version-0.13.1/quick-start-guide.md | 7 +++ website/versioned_docs/version-0.14.0/sql_ddl.md | 5 - website/versioned_docs/version-0.14.1/sql_ddl.md | 5 - website/versioned_docs/version-0.15.0/sql_ddl.md | 5 - 12 files changed, 72 insertions(+), 4 deletions(-) diff --git a/website/docs/sql_ddl.md b/website/docs/sql_ddl.md index 61e7d33cd7f..a85d8a7bb04 100644 --- a/website/docs/sql_ddl.md +++ b/website/docs/sql_ddl.md @@ -67,7 +67,10 @@ PARTITIONED BY (dt); ``` :::note -You can also create a table partitioned by multiple fields by supplying comma-separated field names. For, e.g., "partitioned by dt, hh" +You can also create a table partitioned by multiple fields by supplying comma-separated field names. +When creating a table partitioned by multiple fields, ensure that you specify the columns in the `PARTITIONED BY` clause +in the same order as they appear in the `CREATE TABLE` schema. For example, for the above table, the partition fields +should be specified as `PARTITIONED BY (dt, hh)`. ::: ### Create table with record keys and ordering fields diff --git a/website/versioned_docs/version-0.11.0/quick-start-guide.md b/website/versioned_docs/version-0.11.0/quick-start-guide.md index 9f670871f6a..35336d3f4d4 100644 --- a/website/versioned_docs/version-0.11.0/quick-start-guide.md +++ b/website/versioned_docs/version-0.11.0/quick-start-guide.md @@ -298,6 +298,13 @@ partitioned by (dt, hh) location '/tmp/hudi/hudi_cow_pt_tbl'; ``` +:::note +You can also create a table partitioned by multiple fields by supplying comma-separated field names. +When creating a table partitioned by multiple fields, ensure that you specify the columns in the `PARTITIONED BY` clause +in the same order as they appear in the `CREATE TABLE` schema. For example, for the above table, the partition fields +should be specified as `PARTITIONED BY (dt, hh)`. +::: + **Create Table for an existing Hudi Table** We can create a table on an existing hudi table(created with spark-shell or deltastreamer). This is useful to diff --git a/website/versioned_docs/version-0.11.1/quick-start-guide.md b/website/versioned_docs/version-0.11.1/quick-start-guide.md index d45b535ef42..d0c32790d5a 100644 --- a/website/versioned_docs/version-0.11.1/quick-start-guide.md +++ b/website/versioned_docs/version-0.11.1/quick-start-guide.md @@ -296,6 +296,13 @@ partitioned by (dt, hh) location '/tmp/hudi/hudi_cow_pt_tbl'; ``` +:::note +You can also create a table partitioned by multiple fields by supplying comma-separated field names. +When creating a table partitioned by multiple fields, ensure that you specify the columns in the `PARTITIONED BY` clause +in the same order as they appear in the `CREATE TABLE` schema. For example, for the above table, the partition fields +should be specified as `PARTITIONED BY (dt, hh)`. +::: + **Create Table for an existing Hudi Table** We can create a table on an existing hudi table(created with spark-shell or deltastreamer). This is useful to diff --git a/website/versioned_docs/version-0.12.0/quick-start-guide.md b/website/versioned_docs/version-0.12.0/quick-start-guide.md index aac9a9bd048..9fc3a0414f5 100644 --- a/website/versioned_docs/version-0.12.0/quick-start-guide.md +++ b/website/versioned_docs/version-0.12.0/quick-start-guide.md @@ -322,6 +322,13 @@ partitioned by (dt, hh) location '/tmp/hudi/hudi_cow_pt_tbl'; ``` +:::note +You can also create a table partitioned by multiple fields by supplying comma-separated field names. +When creating a table partitioned by multiple fields, ensure that you specify the columns in the `PARTITIONED BY` clause +in the same order as they appear in the `CREATE TABLE` schema. For example, for the a
(hudi) branch master updated: [HUDI-7921] Fixing file system view closures in MDT (#11496)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 3789840be3d [HUDI-7921] Fixing file system view closures in MDT (#11496) 3789840be3d is described below commit 3789840be3d041cbcfc6b24786740210e4e6d6ac Author: Sivabalan Narayanan AuthorDate: Wed Jul 10 19:25:41 2024 -0700 [HUDI-7921] Fixing file system view closures in MDT (#11496) --- .../metadata/HoodieBackedTableMetadataWriter.java | 55 ++-- .../common/testutils/HoodieMetadataTestTable.java | 6 + .../java/org/apache/hudi/table/TestCleaner.java| 326 +++-- .../table/functional/TestCleanPlanExecutor.java| 325 ++-- .../hudi/testutils/HoodieCleanerTestBase.java | 31 +- .../hudi/metadata/HoodieBackedTableMetadata.java | 4 + .../hudi/metadata/HoodieTableMetadataUtil.java | 48 +-- .../hudi/common/testutils/HoodieTestTable.java | 8 +- 8 files changed, 440 insertions(+), 363 deletions(-) diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java index 89d21e79b22..c38a68e37cf 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java @@ -1081,9 +1081,8 @@ public abstract class HoodieBackedTableMetadataWriter implements HoodieTableM private HoodieData getFunctionalIndexUpdates(HoodieCommitMetadata commitMetadata, String indexPartition, String instantTime) throws Exception { HoodieIndexDefinition indexDefinition = getFunctionalIndexDefinition(indexPartition); List> partitionFileSlicePairs = new ArrayList<>(); -HoodieTableFileSystemView fsView = HoodieTableMetadataUtil.getFileSystemView(dataMetaClient); commitMetadata.getPartitionToWriteStats().forEach((dataPartition, value) -> { - List fileSlices = getPartitionLatestFileSlicesIncludingInflight(dataMetaClient, Option.ofNullable(fsView), dataPartition); + List fileSlices = getPartitionLatestFileSlicesIncludingInflight(dataMetaClient, Option.empty(), dataPartition); fileSlices.forEach(fileSlice -> { // Filter log files for the instant time and add to this partition fileSlice pairs List logFilesForInstant = fileSlice.getLogFiles() @@ -1411,35 +1410,35 @@ public abstract class HoodieBackedTableMetadataWriter implements HoodieTableM HoodieData> partitionRecordsMap) { // The result set HoodieData allPartitionRecords = engineContext.emptyHoodieData(); +try (HoodieTableFileSystemView fsView = HoodieTableMetadataUtil.getFileSystemView(metadataMetaClient)) { + for (Map.Entry> entry : partitionRecordsMap.entrySet()) { +final String partitionName = HoodieIndexUtils.getPartitionNameFromPartitionType(entry.getKey(), dataMetaClient, dataWriteConfig.getIndexingConfig().getIndexName()); +HoodieData records = entry.getValue(); + +List fileSlices = + HoodieTableMetadataUtil.getPartitionLatestFileSlices(metadataMetaClient, Option.ofNullable(fsView), partitionName); +if (fileSlices.isEmpty()) { + // scheduling of INDEX only initializes the file group and not add commit + // so if there are no committed file slices, look for inflight slices + fileSlices = getPartitionLatestFileSlicesIncludingInflight(metadataMetaClient, Option.ofNullable(fsView), partitionName); +} +final int fileGroupCount = fileSlices.size(); +ValidationUtils.checkArgument(fileGroupCount > 0, String.format("FileGroup count for MDT partition %s should be >0", partitionName)); + +List finalFileSlices = fileSlices; +HoodieData rddSinglePartitionRecords = records.map(r -> { + FileSlice slice = finalFileSlices.get(HoodieTableMetadataUtil.mapRecordKeyToFileGroupIndex(r.getRecordKey(), + fileGroupCount)); + r.unseal(); + r.setCurrentLocation(new HoodieRecordLocation(slice.getBaseInstantTime(), slice.getFileId())); + r.seal(); + return r; +}); -HoodieTableFileSystemView fsView = HoodieTableMetadataUtil.getFileSystemView(metadataMetaClient); -for (Map.Entry> entry : partitionRecordsMap.entrySet()) { - final String partitionName = HoodieIndexUtils.getPartitionNameFromPartitionType(entry.getKey(), dataMetaClient, dataWriteConfig.getIndexingConfig().getIndexName()); - HoodieData records = entry.getValue(); - - List fileSlices = - HoodieTableMetadataUtil.getPartitionLatestFileSlices(me
[jira] [Created] (HUDI-7979) Fix out of the box defaults with spillable memory configs
sivabalan narayanan created HUDI-7979: - Summary: Fix out of the box defaults with spillable memory configs Key: HUDI-7979 URL: https://issues.apache.org/jira/browse/HUDI-7979 Project: Apache Hudi Issue Type: Improvement Components: reader-core, writer-core Reporter: sivabalan narayanan Looks like we are very conservative wrt memory configs used for spillable map based FSV. For eg, we are only allocating 15Mb out of the box to file groups when using spillable map based FSV. public long getMaxMemoryForFileGroupMap() \{ long totalMemory = getLong(SPILLABLE_MEMORY); return totalMemory - getMaxMemoryForPendingCompaction() - getMaxMemoryForBootstrapBaseFile(); } SPILLABLE_MEMORY = default is 100Mb. getMaxMemoryForPendingCompaction = 80% of 100MB. getMaxMemoryForBootstrapBaseFile = 5% of 100Mb. so, overall, out of the box we are allocating only 15Mb for getMaxMemoryForFileGroupMap. ref: [https://github.com/apache/hudi/blob/bb0621edee97507cf2460e8cb57b5307510b917e/hudi-[…]/apache/hudi/common/table/view/FileSystemViewStorageConfig.java|https://github.com/apache/hudi/blob/bb0621edee97507cf2460e8cb57b5307510b917e/hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewStorageConfig.java#L224] Wondering do we even need 80% for pending compaction tracker in our FSV. I am thinking to make it 15%. so that we can give more memory to actual file groups. We may not have lot of pending compactions for a given table. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7971) Test and Certify 0.14.x to 0.16.x tables are readable in 1.x Hudi reader
[ https://issues.apache.org/jira/browse/HUDI-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7971: -- Description: Lets ensure 1.x reader is fully compatible w/ reading any of 0.14.x to 0.16.x tables Readers : 1.x # Spark SQL # Spark Datasource # Trino/Presto # Hive # Flink Writer: 0.16 Table State: * COW ** few write commits ** Pending clustering ** Completed Clustering ** Failed writes with no rollbacks ** Insert overwrite table/partition ** Savepoint for Time-travel query * MOR ** Same as COW ** Pending and completed async compaction (with log-files and no base file) ** Custom Payloads (for MOR snapshot queries) (e:g SQL Expression Payload) ** Rollback formats - DELETE, rollback block Other knobs: # Metadata enabled/disabled # Column Stats enabled/disabled and data-skipping enabled/disabled # RLI enabled with eq/IN queries # Non-Partitioned dataset # CDC Reads # Incremental Reads # Time-travel query What to test ? # Query Results Correctness # Performance : See the benefit of # Partition Pruning # Metadata table - col stats, RLI, Corner Case Testing: # Schema Evolution with different file-groups having different generation of schema # Dynamic Partition Pruning # Does Column Projection work correctly for log files reading was: Lets ensure 1.x reader is fully compatible w/ reading any of 0.14.x to 0.16.x tables Readers : 1.x # Spark SQL # Spark Datasource # Trino/Presto # Hive # Flink Writer: 0.16 Table State: * COW * Pending clustering * Completed Clustering * Failed writes with no rollbacks * Insert overwrite table/partition * Savepoint for Time-travel query * MOR * Same as COW * Pending and completed async compaction (with log-files and no base file) * Custom Payloads (for MOR snapshot queries) (e:g SQL Expression Payload) * Rollback formats - DELETE, rollback block Other knobs: # Metadata enabled/disabled # Column Stats enabled/disabled and data-skipping enabled/disabled # RLI enabled with eq/IN queries # Non-Partitioned dataset # CDC Reads # Incremental Reads # Time-travel query What to test ? # Query Results Correctness # Performance : See the benefit of # Partition Pruning # Metadata table - col stats, RLI, Corner Case Testing: # Schema Evolution with different file-groups having different generation of schema # Dynamic Partition Pruning # Does Column Projection work correctly for log files reading > Test and Certify 0.14.x to 0.16.x tables are readable in 1.x Hudi reader > - > > Key: HUDI-7971 > URL: https://issues.apache.org/jira/browse/HUDI-7971 > Project: Apache Hudi > Issue Type: Sub-task > Reporter: sivabalan narayanan >Priority: Major > Fix For: 1.0.0 > > > Lets ensure 1.x reader is fully compatible w/ reading any of 0.14.x to 0.16.x > tables > > Readers : 1.x > # Spark SQL > # Spark Datasource > # Trino/Presto > # Hive > # Flink > Writer: 0.16 > Table State: > * COW > ** few write commits > ** Pending clustering > ** Completed Clustering > ** Failed writes with no rollbacks > ** Insert overwrite table/partition > ** Savepoint for Time-travel query > * MOR > ** Same as COW > ** Pending and completed async compaction (with log-files and no base file) > ** Custom Payloads (for MOR snapshot queries) (e:g SQL Expression Payload) > ** Rollback formats - DELETE, rollback block > Other knobs: > # Metadata enabled/disabled > # Column Stats enabled/disabled and data-skipping enabled/disabled > # RLI enabled with eq/IN queries > # Non-Partitioned dataset > # CDC Reads > # Incremental Reads > # Time-travel query > > What to test ? > # Query Results Correctness > # Performance : See the benefit of > # Partition Pruning > # Metadata table - col stats, RLI, > > Corner Case Testing: > > # Schema Evolution with different file-groups having different generation of > schema > # Dynamic Partition Pruning > # Does Column Projection work correctly for log files reading -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7866) Pull commit metadata changes in bridge release.
[ https://issues.apache.org/jira/browse/HUDI-7866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7866: -- Epic Link: (was: HUDI-7856) > Pull commit metadata changes in bridge release. > --- > > Key: HUDI-7866 > URL: https://issues.apache.org/jira/browse/HUDI-7866 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Sagar Sumit >Assignee: sivabalan narayanan >Priority: Major > Fix For: 0.16.0, 1.0.0 > > > In 1.0.0, we changed some commit metadata to be written in avro. The scope > this task is to ensure that bridge release should be able to read commit > metadata written by 1.0.0. > > Scope could be lot more. > We could try to parse commit metadata at lot of adhoc places like compaction > planning, clean execution etc. So, we need to ensure we account for both > formats (json and avro) with 0.16.0 reader since we do not know if commit > metadata is from 0.16.0 or from 1.0. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7866) Pull commit metadata changes in bridge release.
[ https://issues.apache.org/jira/browse/HUDI-7866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan reassigned HUDI-7866: - Assignee: Balaji Varadarajan (was: sivabalan narayanan) > Pull commit metadata changes in bridge release. > --- > > Key: HUDI-7866 > URL: https://issues.apache.org/jira/browse/HUDI-7866 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Sagar Sumit >Assignee: Balaji Varadarajan >Priority: Major > Fix For: 0.16.0, 1.0.0 > > > In 1.0.0, we changed some commit metadata to be written in avro. The scope > this task is to ensure that bridge release should be able to read commit > metadata written by 1.0.0. > > Scope could be lot more. > We could try to parse commit metadata at lot of adhoc places like compaction > planning, clean execution etc. So, we need to ensure we account for both > formats (json and avro) with 0.16.0 reader since we do not know if commit > metadata is from 0.16.0 or from 1.0. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7866) Pull commit metadata changes in bridge release.
[ https://issues.apache.org/jira/browse/HUDI-7866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7866: -- Parent: HUDI-7882 Issue Type: Sub-task (was: Task) > Pull commit metadata changes in bridge release. > --- > > Key: HUDI-7866 > URL: https://issues.apache.org/jira/browse/HUDI-7866 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Sagar Sumit >Assignee: sivabalan narayanan >Priority: Major > Fix For: 0.16.0, 1.0.0 > > > In 1.0.0, we changed some commit metadata to be written in avro. The scope > this task is to ensure that bridge release should be able to read commit > metadata written by 1.0.0. > > Scope could be lot more. > We could try to parse commit metadata at lot of adhoc places like compaction > planning, clean execution etc. So, we need to ensure we account for both > formats (json and avro) with 0.16.0 reader since we do not know if commit > metadata is from 0.16.0 or from 1.0. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7973) Add table property to track list of columns being indexed in col stats
[ https://issues.apache.org/jira/browse/HUDI-7973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7973: -- Fix Version/s: 1.0.0 > Add table property to track list of columns being indexed in col stats > --- > > Key: HUDI-7973 > URL: https://issues.apache.org/jira/browse/HUDI-7973 > Project: Apache Hudi > Issue Type: Sub-task > Components: metadata >Reporter: sivabalan narayanan >Priority: Major > Fix For: 1.0.0 > > > we need to add a new table property to track what cols are being indexed in > col stats. > If not for table property, could be a aux folder or somewhere. but we need to > store this state somewhere. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7973) Add table property to track list of columns being indexed in col stats
sivabalan narayanan created HUDI-7973: - Summary: Add table property to track list of columns being indexed in col stats Key: HUDI-7973 URL: https://issues.apache.org/jira/browse/HUDI-7973 Project: Apache Hudi Issue Type: Improvement Components: metadata Reporter: sivabalan narayanan we need to add a new table property to track what cols are being indexed in col stats. If not for table property, could be a aux folder or somewhere. but we need to store this state somewhere. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7973) Add table property to track list of columns being indexed in col stats
[ https://issues.apache.org/jira/browse/HUDI-7973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7973: -- Epic Link: (was: HUDI-7856) > Add table property to track list of columns being indexed in col stats > --- > > Key: HUDI-7973 > URL: https://issues.apache.org/jira/browse/HUDI-7973 > Project: Apache Hudi > Issue Type: Sub-task > Components: metadata >Reporter: sivabalan narayanan >Priority: Major > > we need to add a new table property to track what cols are being indexed in > col stats. > If not for table property, could be a aux folder or somewhere. but we need to > store this state somewhere. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7973) Add table property to track list of columns being indexed in col stats
[ https://issues.apache.org/jira/browse/HUDI-7973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7973: -- Epic Link: HUDI-7856 > Add table property to track list of columns being indexed in col stats > --- > > Key: HUDI-7973 > URL: https://issues.apache.org/jira/browse/HUDI-7973 > Project: Apache Hudi > Issue Type: Improvement > Components: metadata >Reporter: sivabalan narayanan >Priority: Major > > we need to add a new table property to track what cols are being indexed in > col stats. > If not for table property, could be a aux folder or somewhere. but we need to > store this state somewhere. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7973) Add table property to track list of columns being indexed in col stats
[ https://issues.apache.org/jira/browse/HUDI-7973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7973: -- Parent: HUDI-7882 Issue Type: Sub-task (was: Improvement) > Add table property to track list of columns being indexed in col stats > --- > > Key: HUDI-7973 > URL: https://issues.apache.org/jira/browse/HUDI-7973 > Project: Apache Hudi > Issue Type: Sub-task > Components: metadata >Reporter: sivabalan narayanan >Priority: Major > > we need to add a new table property to track what cols are being indexed in > col stats. > If not for table property, could be a aux folder or somewhere. but we need to > store this state somewhere. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7888) Throw meaningful error when reading partial update or DV written in 1.x from 0.16.0 reader
[ https://issues.apache.org/jira/browse/HUDI-7888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7888: -- Fix Version/s: 1.0.0 > Throw meaningful error when reading partial update or DV written in 1.x from > 0.16.0 reader > -- > > Key: HUDI-7888 > URL: https://issues.apache.org/jira/browse/HUDI-7888 > Project: Apache Hudi > Issue Type: Sub-task > Components: reader-core > Reporter: sivabalan narayanan >Assignee: Jonathan Vexler >Priority: Major > Fix For: 1.0.0 > > > If 0.16.x reader is used to read 1.x table having partial updates/merges > enabled, we need to throw meaningful error to end user. > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7888) Throw meaningful error when reading partial update or DV written in 1.x from 0.16.0 reader
[ https://issues.apache.org/jira/browse/HUDI-7888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7888: -- Description: If 0.16.x reader is used to read 1.x table having partial updates/merges enabled, we need to throw meaningful error to end user. was: We wanted to support reading 1.x tables in 0.16.0 reader. If 1.x table does not have any new features enabled which are backwards incompatible we are good. If not, if someone has enabled partial update feature or deletion vector support, we should parse and throw a meaningful error from 0.16.0 reader. Lets also comb for any other additional features in 1.x and throw meaningful error. > Throw meaningful error when reading partial update or DV written in 1.x from > 0.16.0 reader > -- > > Key: HUDI-7888 > URL: https://issues.apache.org/jira/browse/HUDI-7888 > Project: Apache Hudi > Issue Type: Sub-task > Components: reader-core > Reporter: sivabalan narayanan >Assignee: Jonathan Vexler >Priority: Major > > If 0.16.x reader is used to read 1.x table having partial updates/merges > enabled, we need to throw meaningful error to end user. > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7972) Add fallback for deletion vector in 0.16.x reader while reading 1.x tables
[ https://issues.apache.org/jira/browse/HUDI-7972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7972: -- Parent: HUDI-7882 Issue Type: Sub-task (was: Improvement) > Add fallback for deletion vector in 0.16.x reader while reading 1.x tables > -- > > Key: HUDI-7972 > URL: https://issues.apache.org/jira/browse/HUDI-7972 > Project: Apache Hudi > Issue Type: Sub-task > Components: reader-core >Reporter: sivabalan narayanan >Priority: Major > Labels: 1.0-migration > Fix For: 1.0.0 > > > If 0.16.x reader is used to read a 1.x table with deletion vector, we should > fallback to using key based merges instead of position based merges. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7972) Add fallback for deletion vector in 0.16.x reader while reading 1.x tables
sivabalan narayanan created HUDI-7972: - Summary: Add fallback for deletion vector in 0.16.x reader while reading 1.x tables Key: HUDI-7972 URL: https://issues.apache.org/jira/browse/HUDI-7972 Project: Apache Hudi Issue Type: Improvement Components: reader-core Reporter: sivabalan narayanan If 0.16.x reader is used to read a 1.x table with deletion vector, we should fallback to using key based merges instead of position based merges. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7865) Pull table properties changes in bridge release
[ https://issues.apache.org/jira/browse/HUDI-7865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7865: -- Parent: HUDI-7882 Issue Type: Sub-task (was: Task) > Pull table properties changes in bridge release > --- > > Key: HUDI-7865 > URL: https://issues.apache.org/jira/browse/HUDI-7865 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Sagar Sumit >Assignee: Balaji Varadarajan >Priority: Major > Fix For: 0.16.0, 1.0.0 > > > In 1.0.0, we changed some table properties to have nums as value instead of > classnames and then added infer functions. The scope of this task is to > ensure that bridge release should be able to read hoodie.properties written > by 1.0.0. > a. Payload enum change reference - > [https://github.com/apache/hudi/pull/9590/files] > b. hoodie.record.merge.mode : ref links : #9894, #11439. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7888) Throw meaningful error when reading partial update or DV written in 1.x from 0.16.0 reader
[ https://issues.apache.org/jira/browse/HUDI-7888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7888: -- Epic Link: (was: HUDI-7856) > Throw meaningful error when reading partial update or DV written in 1.x from > 0.16.0 reader > -- > > Key: HUDI-7888 > URL: https://issues.apache.org/jira/browse/HUDI-7888 > Project: Apache Hudi > Issue Type: Sub-task > Components: reader-core > Reporter: sivabalan narayanan >Assignee: Jonathan Vexler >Priority: Major > > We wanted to support reading 1.x tables in 0.16.0 reader. > > If 1.x table does not have any new features enabled which are backwards > incompatible we are good. If not, if someone has enabled partial update > feature or deletion vector support, we should parse and throw a meaningful > error from 0.16.0 reader. Lets also comb for any other additional features in > 1.x and throw meaningful error. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7865) Pull table properties changes in bridge release
[ https://issues.apache.org/jira/browse/HUDI-7865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7865: -- Epic Link: (was: HUDI-7856) > Pull table properties changes in bridge release > --- > > Key: HUDI-7865 > URL: https://issues.apache.org/jira/browse/HUDI-7865 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Sagar Sumit >Assignee: Balaji Varadarajan >Priority: Major > Fix For: 0.16.0, 1.0.0 > > > In 1.0.0, we changed some table properties to have nums as value instead of > classnames and then added infer functions. The scope of this task is to > ensure that bridge release should be able to read hoodie.properties written > by 1.0.0. > a. Payload enum change reference - > [https://github.com/apache/hudi/pull/9590/files] > b. hoodie.record.merge.mode : ref links : #9894, #11439. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7888) Throw meaningful error when reading partial update or DV written in 1.x from 0.16.0 reader
[ https://issues.apache.org/jira/browse/HUDI-7888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7888: -- Parent: HUDI-7882 Issue Type: Sub-task (was: Improvement) > Throw meaningful error when reading partial update or DV written in 1.x from > 0.16.0 reader > -- > > Key: HUDI-7888 > URL: https://issues.apache.org/jira/browse/HUDI-7888 > Project: Apache Hudi > Issue Type: Sub-task > Components: reader-core > Reporter: sivabalan narayanan >Assignee: Jonathan Vexler >Priority: Major > > We wanted to support reading 1.x tables in 0.16.0 reader. > > If 1.x table does not have any new features enabled which are backwards > incompatible we are good. If not, if someone has enabled partial update > feature or deletion vector support, we should parse and throw a meaningful > error from 0.16.0 reader. Lets also comb for any other additional features in > 1.x and throw meaningful error. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7972) Add fallback for deletion vector in 0.16.x reader while reading 1.x tables
[ https://issues.apache.org/jira/browse/HUDI-7972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7972: -- Fix Version/s: 1.0.0 > Add fallback for deletion vector in 0.16.x reader while reading 1.x tables > -- > > Key: HUDI-7972 > URL: https://issues.apache.org/jira/browse/HUDI-7972 > Project: Apache Hudi > Issue Type: Improvement > Components: reader-core >Reporter: sivabalan narayanan >Priority: Major > Labels: 1.0-migration > Fix For: 1.0.0 > > > If 0.16.x reader is used to read a 1.x table with deletion vector, we should > fallback to using key based merges instead of position based merges. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7972) Add fallback for deletion vector in 0.16.x reader while reading 1.x tables
[ https://issues.apache.org/jira/browse/HUDI-7972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7972: -- Labels: 1.0-migration (was: ) > Add fallback for deletion vector in 0.16.x reader while reading 1.x tables > -- > > Key: HUDI-7972 > URL: https://issues.apache.org/jira/browse/HUDI-7972 > Project: Apache Hudi > Issue Type: Improvement > Components: reader-core >Reporter: sivabalan narayanan >Priority: Major > Labels: 1.0-migration > > If 0.16.x reader is used to read a 1.x table with deletion vector, we should > fallback to using key based merges instead of position based merges. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7887) Any log format header types changes need to be ported to 0.16.0 from 1.x
[ https://issues.apache.org/jira/browse/HUDI-7887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7887: -- Epic Link: (was: HUDI-7856) > Any log format header types changes need to be ported to 0.16.0 from 1.x > > > Key: HUDI-7887 > URL: https://issues.apache.org/jira/browse/HUDI-7887 > Project: Apache Hudi > Issue Type: Sub-task > Components: reader-core >Reporter: sivabalan narayanan >Assignee: Jonathan Vexler >Priority: Major > > We wanted to support reading 1.x tables in 0.16.0 reader. > > Port any new log header metadata types introduced in 1.x to 0.16.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7887) Any log format header types changes need to be ported to 0.16.0 from 1.x
[ https://issues.apache.org/jira/browse/HUDI-7887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7887: -- Parent: HUDI-7882 Issue Type: Sub-task (was: Improvement) > Any log format header types changes need to be ported to 0.16.0 from 1.x > > > Key: HUDI-7887 > URL: https://issues.apache.org/jira/browse/HUDI-7887 > Project: Apache Hudi > Issue Type: Sub-task > Components: reader-core >Reporter: sivabalan narayanan >Assignee: Jonathan Vexler >Priority: Major > > We wanted to support reading 1.x tables in 0.16.0 reader. > > Port any new log header metadata types introduced in 1.x to 0.16.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7886) Make metadata payload from 1.x readable in 0.16.0
[ https://issues.apache.org/jira/browse/HUDI-7886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7886: -- Parent: HUDI-7882 Issue Type: Sub-task (was: Improvement) > Make metadata payload from 1.x readable in 0.16.0 > - > > Key: HUDI-7886 > URL: https://issues.apache.org/jira/browse/HUDI-7886 > Project: Apache Hudi > Issue Type: Sub-task > Components: metadata >Reporter: sivabalan narayanan >Assignee: Lokesh Jain >Priority: Major > > We wanted to support reading 1.x tables in 0.16.0 reader. > > So, lets port over all metadata payload schema changes to 0.16.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7886) Make metadata payload from 1.x readable in 0.16.0
[ https://issues.apache.org/jira/browse/HUDI-7886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7886: -- Epic Link: (was: HUDI-7856) > Make metadata payload from 1.x readable in 0.16.0 > - > > Key: HUDI-7886 > URL: https://issues.apache.org/jira/browse/HUDI-7886 > Project: Apache Hudi > Issue Type: Sub-task > Components: metadata >Reporter: sivabalan narayanan >Assignee: Lokesh Jain >Priority: Major > > We wanted to support reading 1.x tables in 0.16.0 reader. > > So, lets port over all metadata payload schema changes to 0.16.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7971) Test and Certify 0.14.x to 0.16.x tables are readable in 1.x Hudi reader
[ https://issues.apache.org/jira/browse/HUDI-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7971: -- Epic Link: (was: HUDI-7856) > Test and Certify 0.14.x to 0.16.x tables are readable in 1.x Hudi reader > - > > Key: HUDI-7971 > URL: https://issues.apache.org/jira/browse/HUDI-7971 > Project: Apache Hudi > Issue Type: Sub-task > Reporter: sivabalan narayanan >Priority: Major > Fix For: 1.0.0 > > > Lets ensure 1.x reader is fully compatible w/ reading any of 0.14.x to 0.16.x > tables > > Readers : 1.x > # Spark SQL > # Spark Datasource > # Trino/Presto > # Hive > # Flink > Writer: 0.16 > Table State: > * COW > * Pending clustering > * Completed Clustering > * Failed writes with no rollbacks > * Insert overwrite table/partition > * Savepoint for Time-travel query > * MOR > * Same as COW > * Pending and completed async compaction (with log-files and no base file) > * Custom Payloads (for MOR snapshot queries) (e:g SQL Expression Payload) > * Rollback formats - DELETE, rollback block > Other knobs: > # Metadata enabled/disabled > # Column Stats enabled/disabled and data-skipping enabled/disabled > # RLI enabled with eq/IN queries > # Non-Partitioned dataset > # CDC Reads > # Incremental Reads > # Time-travel query > > What to test ? > # Query Results Correctness > # Performance : See the benefit of > # Partition Pruning > # Metadata table - col stats, RLI, > > Corner Case Testing: > > # Schema Evolution with different file-groups having different generation of > schema > # Dynamic Partition Pruning > # Does Column Projection work correctly for log files reading -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7971) Test and Certify 0.14.x to 0.16.x tables are readable in 1.x Hudi reader
[ https://issues.apache.org/jira/browse/HUDI-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7971: -- Parent: HUDI-7882 Issue Type: Sub-task (was: Test) > Test and Certify 0.14.x to 0.16.x tables are readable in 1.x Hudi reader > - > > Key: HUDI-7971 > URL: https://issues.apache.org/jira/browse/HUDI-7971 > Project: Apache Hudi > Issue Type: Sub-task > Reporter: sivabalan narayanan >Priority: Major > Fix For: 1.0.0 > > > Lets ensure 1.x reader is fully compatible w/ reading any of 0.14.x to 0.16.x > tables > > Readers : 1.x > # Spark SQL > # Spark Datasource > # Trino/Presto > # Hive > # Flink > Writer: 0.16 > Table State: > * COW > * Pending clustering > * Completed Clustering > * Failed writes with no rollbacks > * Insert overwrite table/partition > * Savepoint for Time-travel query > * MOR > * Same as COW > * Pending and completed async compaction (with log-files and no base file) > * Custom Payloads (for MOR snapshot queries) (e:g SQL Expression Payload) > * Rollback formats - DELETE, rollback block > Other knobs: > # Metadata enabled/disabled > # Column Stats enabled/disabled and data-skipping enabled/disabled > # RLI enabled with eq/IN queries > # Non-Partitioned dataset > # CDC Reads > # Incremental Reads > # Time-travel query > > What to test ? > # Query Results Correctness > # Performance : See the benefit of > # Partition Pruning > # Metadata table - col stats, RLI, > > Corner Case Testing: > > # Schema Evolution with different file-groups having different generation of > schema > # Dynamic Partition Pruning > # Does Column Projection work correctly for log files reading -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7971) Test and Certify 0.14.x to 0.16.x tables are readable in 1.x Hudi reader
sivabalan narayanan created HUDI-7971: - Summary: Test and Certify 0.14.x to 0.16.x tables are readable in 1.x Hudi reader Key: HUDI-7971 URL: https://issues.apache.org/jira/browse/HUDI-7971 Project: Apache Hudi Issue Type: Test Reporter: sivabalan narayanan Lets ensure 1.x reader is fully compatible w/ reading any of 0.14.x to 0.16.x tables Readers : 1.x # Spark SQL # Spark Datasource # Trino/Presto # Hive # Flink Writer: 0.16 Table State: * COW * Pending clustering * Completed Clustering * Failed writes with no rollbacks * Insert overwrite table/partition * Savepoint for Time-travel query * MOR * Same as COW * Pending and completed async compaction (with log-files and no base file) * Custom Payloads (for MOR snapshot queries) (e:g SQL Expression Payload) * Rollback formats - DELETE, rollback block Other knobs: # Metadata enabled/disabled # Column Stats enabled/disabled and data-skipping enabled/disabled # RLI enabled with eq/IN queries # Non-Partitioned dataset # CDC Reads # Incremental Reads # Time-travel query What to test ? # Query Results Correctness # Performance : See the benefit of # Partition Pruning # Metadata table - col stats, RLI, Corner Case Testing: # Schema Evolution with different file-groups having different generation of schema # Dynamic Partition Pruning # Does Column Projection work correctly for log files reading -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7971) Test and Certify 0.14.x to 0.16.x tables are readable in 1.x Hudi reader
[ https://issues.apache.org/jira/browse/HUDI-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7971: -- Fix Version/s: 1.0.0 > Test and Certify 0.14.x to 0.16.x tables are readable in 1.x Hudi reader > - > > Key: HUDI-7971 > URL: https://issues.apache.org/jira/browse/HUDI-7971 > Project: Apache Hudi > Issue Type: Test > Reporter: sivabalan narayanan >Priority: Major > Fix For: 1.0.0 > > > Lets ensure 1.x reader is fully compatible w/ reading any of 0.14.x to 0.16.x > tables > > Readers : 1.x > # Spark SQL > # Spark Datasource > # Trino/Presto > # Hive > # Flink > Writer: 0.16 > Table State: > * COW > * Pending clustering > * Completed Clustering > * Failed writes with no rollbacks > * Insert overwrite table/partition > * Savepoint for Time-travel query > * MOR > * Same as COW > * Pending and completed async compaction (with log-files and no base file) > * Custom Payloads (for MOR snapshot queries) (e:g SQL Expression Payload) > * Rollback formats - DELETE, rollback block > Other knobs: > # Metadata enabled/disabled > # Column Stats enabled/disabled and data-skipping enabled/disabled > # RLI enabled with eq/IN queries > # Non-Partitioned dataset > # CDC Reads > # Incremental Reads > # Time-travel query > > What to test ? > # Query Results Correctness > # Performance : See the benefit of > # Partition Pruning > # Metadata table - col stats, RLI, > > Corner Case Testing: > > # Schema Evolution with different file-groups having different generation of > schema > # Dynamic Partition Pruning > # Does Column Projection work correctly for log files reading -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7882) Umbrella ticket for 1.x tables and 0.16.x compatibility
[ https://issues.apache.org/jira/browse/HUDI-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7882: -- Description: We have 4 major goals w/ this umbrella ticket. a. 1.x reader should be capable of reading any of 0.14.x to 0.16.x tables for all query types. b. 0.16.x should be capable of reading 1.x tables for most features c. Upgrade 0.16.x to 1.x d. Downgrade 1.x to 0.16.0. We wanted to support reading 1.x tables in 0.16.0 release. So, creating this umbrella ticket to track all of them. RFC in progress: [https://github.com/apache/hudi/pull/11514] Changes required to be ported: 0. Creating 0.16.0 branch 0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. 1. Timeline 1.a Hoodie instant parsing should be able to read 1.x instants. https://issues.apache.org/jira/browse/HUDI-7883 Sagar. 1.b Commit metadata parsing is able to handle both json and avro formats. Scope might be non-trivial. https://issues.apache.org/jira/browse/HUDI-7866 Siva. 1.c HoodieDefaultTimeline able to read both timelines based on table version. https://issues.apache.org/jira/browse/HUDI-7884 Siva. 1.d Reading LSM timeline using 0.16.0 https://issues.apache.org/jira/browse/HUDI-7890 Siva. 1.e Ensure 1.0 MDT timeline is readable by 0.16 - HUDI-7901 2. Table property changes 2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885 https://issues.apache.org/jira/browse/HUDI-7865 LJ 3. MDT table changes 3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ 3.b MDT payload schema changes. https://issues.apache.org/jira/browse/HUDI-7886 LJ 4. Log format changes 4.a All metadata header types porting https://issues.apache.org/jira/browse/HUDI-7887 Jon 4.b Meaningful error for incompatible features from 1.x https://issues.apache.org/jira/browse/HUDI-7888 Jon 5. Log file slice or grouping detection compatibility 5. Tests 5.a Tests to validate that 1.x tables can be read w/ 0.16.0 https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. 6 Doc changes 6.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. https://issues.apache.org/jira/browse/HUDI-7889 was: We wanted to support reading 1.x tables in 0.16.0 release. So, creating this umbrella ticket to track all of them. RFC in progress: [https://github.com/apache/hudi/pull/11514] Changes required to be ported: 0. Creating 0.16.0 branch 0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. 1. Timeline 1.a Hoodie instant parsing should be able to read 1.x instants. https://issues.apache.org/jira/browse/HUDI-7883 Sagar. 1.b Commit metadata parsing is able to handle both json and avro formats. Scope might be non-trivial. https://issues.apache.org/jira/browse/HUDI-7866 Siva. 1.c HoodieDefaultTimeline able to read both timelines based on table version. https://issues.apache.org/jira/browse/HUDI-7884 Siva. 1.d Reading LSM timeline using 0.16.0 https://issues.apache.org/jira/browse/HUDI-7890 Siva. 1.e Ensure 1.0 MDT timeline is readable by 0.16 - HUDI-7901 2. Table property changes 2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885 https://issues.apache.org/jira/browse/HUDI-7865 LJ 3. MDT table changes 3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ 3.b MDT payload schema changes. https://issues.apache.org/jira/browse/HUDI-7886 LJ 4. Log format changes 4.a All metadata header types porting https://issues.apache.org/jira/browse/HUDI-7887 Jon 4.b Meaningful error for incompatible features from 1.x https://issues.apache.org/jira/browse/HUDI-7888 Jon 5. Log file slice or grouping detection compatibility 5. Tests 5.a Tests to validate that 1.x tables can be read w/ 0.16.0 https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. 6 Doc changes 6.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. https://issues.apache.org/jira/browse/HUDI-7889 > Umbrella ticket for 1.x tables and 0.16.x compatibility > --- > > Key: HUDI-7882 > URL: https://issues.apache.org/jira/browse/HUDI-7882 > Project: Apache Hudi > Issue Type: Improvement > Components: reader-core >Reporter: sivabalan narayanan > Assignee: sivabalan narayanan >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0, 1.0.0 > > > We have 4 major goals w/ this umbrella ticket. > a. 1.x reader should be capable of reading any of 0.14.x to 0.16.x tables for > all query types. > b. 0.16.x should be capable of reading 1.x tables for most features > c. Upgrade 0.16.x to 1.x &
[jira] [Updated] (HUDI-7882) Umbrella ticket for 1.x tables and 0.16.x compatibility
[ https://issues.apache.org/jira/browse/HUDI-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7882: -- Summary: Umbrella ticket for 1.x tables and 0.16.x compatibility (was: Umbrella ticket 1.x tables and 0.16.x compatibility) > Umbrella ticket for 1.x tables and 0.16.x compatibility > --- > > Key: HUDI-7882 > URL: https://issues.apache.org/jira/browse/HUDI-7882 > Project: Apache Hudi > Issue Type: Improvement > Components: reader-core >Reporter: sivabalan narayanan > Assignee: sivabalan narayanan >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0, 1.0.0 > > > We wanted to support reading 1.x tables in 0.16.0 release. So, creating this > umbrella ticket to track all of them. > > RFC in progress: [https://github.com/apache/hudi/pull/11514] > > Changes required to be ported: > 0. Creating 0.16.0 branch > 0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. > > 1. Timeline > 1.a Hoodie instant parsing should be able to read 1.x instants. > https://issues.apache.org/jira/browse/HUDI-7883 Sagar. > 1.b Commit metadata parsing is able to handle both json and avro formats. > Scope might be non-trivial. https://issues.apache.org/jira/browse/HUDI-7866 > Siva. > 1.c HoodieDefaultTimeline able to read both timelines based on table version. > https://issues.apache.org/jira/browse/HUDI-7884 Siva. > 1.d Reading LSM timeline using 0.16.0 > https://issues.apache.org/jira/browse/HUDI-7890 Siva. > 1.e Ensure 1.0 MDT timeline is readable by 0.16 - HUDI-7901 > > 2. Table property changes > 2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885 > https://issues.apache.org/jira/browse/HUDI-7865 LJ > > 3. MDT table changes > 3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ > 3.b MDT payload schema changes. > https://issues.apache.org/jira/browse/HUDI-7886 LJ > > 4. Log format changes > 4.a All metadata header types porting > https://issues.apache.org/jira/browse/HUDI-7887 Jon > 4.b Meaningful error for incompatible features from 1.x > https://issues.apache.org/jira/browse/HUDI-7888 Jon > > 5. Log file slice or grouping detection compatibility > > 5. Tests > 5.a Tests to validate that 1.x tables can be read w/ 0.16.0 > https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. > > 6 Doc changes > 6.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. > https://issues.apache.org/jira/browse/HUDI-7889 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7882) Umbrella ticket 1.x tables and 0.16.x compatibility
[ https://issues.apache.org/jira/browse/HUDI-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7882: -- Summary: Umbrella ticket 1.x tables and 0.16.x compatibility (was: Umbrella ticket to track all changes required to support reading 1.x tables with 0.16.0 ) > Umbrella ticket 1.x tables and 0.16.x compatibility > --- > > Key: HUDI-7882 > URL: https://issues.apache.org/jira/browse/HUDI-7882 > Project: Apache Hudi > Issue Type: Improvement > Components: reader-core >Reporter: sivabalan narayanan > Assignee: sivabalan narayanan >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0, 1.0.0 > > > We wanted to support reading 1.x tables in 0.16.0 release. So, creating this > umbrella ticket to track all of them. > > RFC in progress: [https://github.com/apache/hudi/pull/11514] > > Changes required to be ported: > 0. Creating 0.16.0 branch > 0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. > > 1. Timeline > 1.a Hoodie instant parsing should be able to read 1.x instants. > https://issues.apache.org/jira/browse/HUDI-7883 Sagar. > 1.b Commit metadata parsing is able to handle both json and avro formats. > Scope might be non-trivial. https://issues.apache.org/jira/browse/HUDI-7866 > Siva. > 1.c HoodieDefaultTimeline able to read both timelines based on table version. > https://issues.apache.org/jira/browse/HUDI-7884 Siva. > 1.d Reading LSM timeline using 0.16.0 > https://issues.apache.org/jira/browse/HUDI-7890 Siva. > 1.e Ensure 1.0 MDT timeline is readable by 0.16 - HUDI-7901 > > 2. Table property changes > 2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885 > https://issues.apache.org/jira/browse/HUDI-7865 LJ > > 3. MDT table changes > 3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ > 3.b MDT payload schema changes. > https://issues.apache.org/jira/browse/HUDI-7886 LJ > > 4. Log format changes > 4.a All metadata header types porting > https://issues.apache.org/jira/browse/HUDI-7887 Jon > 4.b Meaningful error for incompatible features from 1.x > https://issues.apache.org/jira/browse/HUDI-7888 Jon > > 5. Log file slice or grouping detection compatibility > > 5. Tests > 5.a Tests to validate that 1.x tables can be read w/ 0.16.0 > https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. > > 6 Doc changes > 6.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. > https://issues.apache.org/jira/browse/HUDI-7889 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7865) Pull table properties changes in bridge release
[ https://issues.apache.org/jira/browse/HUDI-7865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan reassigned HUDI-7865: - Assignee: Balaji Varadarajan (was: Lokesh Jain) > Pull table properties changes in bridge release > --- > > Key: HUDI-7865 > URL: https://issues.apache.org/jira/browse/HUDI-7865 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Balaji Varadarajan >Priority: Major > Fix For: 0.16.0, 1.0.0 > > > In 1.0.0, we changed some table properties to have nums as value instead of > classnames and then added infer functions. The scope of this task is to > ensure that bridge release should be able to read hoodie.properties written > by 1.0.0. > a. Payload enum change reference - > [https://github.com/apache/hudi/pull/9590/files] > b. hoodie.record.merge.mode : ref links : #9894, #11439. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7865) Pull table properties changes in bridge release
[ https://issues.apache.org/jira/browse/HUDI-7865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7865: -- Description: In 1.0.0, we changed some table properties to have nums as value instead of classnames and then added infer functions. The scope of this task is to ensure that bridge release should be able to read hoodie.properties written by 1.0.0. a. Changes made to table properties in 1.0.0 - [https://github.com/apache/hudi/pull/9590/files] b. hoodie.record.merge.mode : ref links : #9894, #11439. Lets also check for any new table property changes in 1.x compared to 0.16.x. and discuss w/ the team on how to go about it. was: In 1.0.0, we changed some table properties to have nums as value instead of classnames and then added infer functions. The scope of this task is to ensure that bridge release should be able to read hoodie.properties written by 1.0.0. Changes made to table properties in 1.0.0 - [https://github.com/apache/hudi/pull/9590/files] Lets also check for any new table property changes in 1.x compared to 0.16.x. and discuss w/ the team on how to go about it. > Pull table properties changes in bridge release > --- > > Key: HUDI-7865 > URL: https://issues.apache.org/jira/browse/HUDI-7865 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Lokesh Jain >Priority: Major > Fix For: 0.16.0, 1.0.0 > > > In 1.0.0, we changed some table properties to have nums as value instead of > classnames and then added infer functions. The scope of this task is to > ensure that bridge release should be able to read hoodie.properties written > by 1.0.0. > a. Changes made to table properties in 1.0.0 - > [https://github.com/apache/hudi/pull/9590/files] > b. hoodie.record.merge.mode : ref links : #9894, #11439. > > Lets also check for any new table property changes in 1.x compared to 0.16.x. > and discuss w/ the team on how to go about it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7865) Pull table properties changes in bridge release
[ https://issues.apache.org/jira/browse/HUDI-7865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7865: -- Description: In 1.0.0, we changed some table properties to have nums as value instead of classnames and then added infer functions. The scope of this task is to ensure that bridge release should be able to read hoodie.properties written by 1.0.0. a. Payload enum change reference - [https://github.com/apache/hudi/pull/9590/files] b. hoodie.record.merge.mode : ref links : #9894, #11439. was: In 1.0.0, we changed some table properties to have nums as value instead of classnames and then added infer functions. The scope of this task is to ensure that bridge release should be able to read hoodie.properties written by 1.0.0. a. Changes made to table properties in 1.0.0 - [https://github.com/apache/hudi/pull/9590/files] b. hoodie.record.merge.mode : ref links : #9894, #11439. Lets also check for any new table property changes in 1.x compared to 0.16.x. and discuss w/ the team on how to go about it. > Pull table properties changes in bridge release > --- > > Key: HUDI-7865 > URL: https://issues.apache.org/jira/browse/HUDI-7865 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Lokesh Jain >Priority: Major > Fix For: 0.16.0, 1.0.0 > > > In 1.0.0, we changed some table properties to have nums as value instead of > classnames and then added infer functions. The scope of this task is to > ensure that bridge release should be able to read hoodie.properties written > by 1.0.0. > a. Payload enum change reference - > [https://github.com/apache/hudi/pull/9590/files] > b. hoodie.record.merge.mode : ref links : #9894, #11439. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7968) RFC for robust handling of spark task failures and retries
[ https://issues.apache.org/jira/browse/HUDI-7968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7968: -- Status: In Progress (was: Open) > RFC for robust handling of spark task failures and retries > -- > > Key: HUDI-7968 > URL: https://issues.apache.org/jira/browse/HUDI-7968 > Project: Apache Hudi > Issue Type: Improvement > Components: writer-core >Reporter: sivabalan narayanan > Assignee: sivabalan narayanan >Priority: Major > Labels: RobustWrites, pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7507) ongoing concurrent writers with smaller timestamp can cause issues with table services
[ https://issues.apache.org/jira/browse/HUDI-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7507: -- Status: Patch Available (was: In Progress) > ongoing concurrent writers with smaller timestamp can cause issues with > table services > --- > > Key: HUDI-7507 > URL: https://issues.apache.org/jira/browse/HUDI-7507 > Project: Apache Hudi > Issue Type: Improvement > Components: table-service >Reporter: Krishen Bhan >Assignee: sivabalan narayanan >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0 > > Attachments: Flowchart (1).png, Flowchart.png > > > *Scenarios:* > Although HUDI operations hold a table lock when creating a .requested > instant, because HUDI writers do not generate a timestamp and create a > .requsted plan in the same transaction, there can be a scenario where > # Job 1 starts, chooses timestamp (x) , Job 2 starts and chooses timestamp > (x - 1) > # Job 1 schedules and creates requested file with instant timestamp (x) > # Job 2 schedules and creates requested file with instant timestamp (x-1) > # Both jobs continue running > If one job is writing a commit and the other is a table service, this can > cause issues: > * > ** If Job 2 is ingestion commit and Job 1 is compaction/log compaction, then > when Job 1 runs before Job 2 and can create a compaction plan for all instant > times (up to (x) ) that doesn’t include instant time (x-1) . Later Job 2 > will create instant time (x-1), but timeline will be in a corrupted state > since compaction plan was supposed to include (x-1) > ** There is a similar issue with clean. If Job2 is a long-running commit > (that was stuck/delayed for a while before creating its .requested plan) and > Job 1 is a clean, then Job 1 can perform a clean that updates the > earliest-commit-to-retain without waiting for the inflight instant by Job 2 > at (x-1) to complete. This causes Job2 to be "skipped" by clean. > ** If the completed commit files include som sort of "checkpointing" with > another "downstream job" performing incremental reads on this dataset (such > as Hoodie Streamer/DeltaSync) then there may be incorrect behavior, such as > the incremental reader skipping some completed commits (that have a smaller > instant timestamp than latest completed commit but were created after). > [Edit] I added a diagram to visualize the issue, specifically the second > scenario with clean > !Flowchart (1).png! > *Proposed approach:* > One way this can be resolved is by combining the operations of generating > instant time and creating a requested file in the same HUDI table > transaction. Specifically, executing the following steps whenever any instant > (commit, table service, etc) is scheduled > Approach A > # Acquire table lock > # Look at the latest instant C on the active timeline (completed or not). > Generate a timestamp after C > # Create the plan and requested file using this new timestamp ( that is > greater than C) > # Release table lock > Unfortunately (A) has the following drawbacks > * Every operation must now hold the table lock when computing its plan even > if it's an expensive operation and will take a while > * Users of HUDI cannot easily set their own instant time of an operation, > and this restriction would break any public APIs that allow this and would > require deprecating those APIs. > > An alternate approach is to have every operation abort creating a .requested > file unless it has the latest timestamp. Specifically, for any instant type, > whenever an operation is about to create a .requested plan on timeline, it > should take the table lock and assert that there are no other instants on > timeline that are greater than it that could cause a conflict. If that > assertion fails, then throw a retry-able conflict resolution exception. > Specifically, the following steps should be followed whenever any instant > (commit, table service, etc) is scheduled > Approach B > # Acquire table lock. Assume that the desired instant time C and requested > file plan metadata have already been created, regardless of wether it was > before this step or right after acquiring the table lock. > # If there are any instants on the timeline that are greater than C > (regardless of their operation type or sate status) then release table lock > and throw an exception > # Create requested plan on timeline
[jira] [Assigned] (HUDI-7967) Robust handling of spark task failures and retries
[ https://issues.apache.org/jira/browse/HUDI-7967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan reassigned HUDI-7967: - Assignee: sivabalan narayanan > Robust handling of spark task failures and retries > --- > > Key: HUDI-7967 > URL: https://issues.apache.org/jira/browse/HUDI-7967 > Project: Apache Hudi > Issue Type: Epic > Components: reader-core, writer-core >Reporter: sivabalan narayanan > Assignee: sivabalan narayanan >Priority: Major > Labels: RobustWrites, pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7503) Concurrent executions of table service plan should not corrupt dataset
[ https://issues.apache.org/jira/browse/HUDI-7503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7503: -- Status: Patch Available (was: In Progress) > Concurrent executions of table service plan should not corrupt dataset > -- > > Key: HUDI-7503 > URL: https://issues.apache.org/jira/browse/HUDI-7503 > Project: Apache Hudi > Issue Type: Improvement > Components: compaction, table-service >Reporter: Krishen Bhan > Assignee: sivabalan narayanan >Priority: Minor > Labels: pull-request-available > Fix For: 0.16.0, 1.0.0 > > > Some external workflow schedulers can accidentally (or) misbehave and > schedule duplicate executions of the same compaction plan. We need a way to > guard against this inside Hudi (vs user taking a lock externally). In such a > world, 2 instance of the job concurrently call > `org.apache.hudi.client.BaseHoodieTableServiceClient#compact` on the same > compaction instant. > This is since one writer might execute the instant and create an inflight, > while the other writer sees the inflight and tries to roll it back before > re-attempting to execute it (since it will assume said inflight was a > previously failed compaction attempt). > This logic should be updated such that only one writer will actually execute > the compaction plan at a time (and the others will fail/abort). > One approach is to use a transaction (base table lock) in conjunction with > heartbeating, to ensure that the writer triggers a heartbeat before executing > compaction, and any concurrent writers will use the heartbeat to check wether > the compaction is currently being executed by another writer. Specifically , > the compact API should execute the following steps > # Get the instant to compact C (as usual) > # Start a transaction > # Checks if C has an active heartbeat, if so finish transaction and throw > exception > # Start a heartbeat for C (this will implicitly re-start the heartbeat if it > has been started before by another job) > # Finish transaction > # Run the existing compact API logic on C > # If execution succeeds, clean up heartbeat file . If it fails do nothing > (as the heartbeat will anyway be automatically expired later). > Note that this approach only holds the table lock temporarily, when > checking/starting the heartbeat > Also, this flow can be applied to execution of clean plans and other table > services -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7968) RFC for robust handling of spark task failures and retries
[ https://issues.apache.org/jira/browse/HUDI-7968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7968: -- Sprint: 2024/06/17-30 > RFC for robust handling of spark task failures and retries > -- > > Key: HUDI-7968 > URL: https://issues.apache.org/jira/browse/HUDI-7968 > Project: Apache Hudi > Issue Type: Improvement > Components: writer-core >Reporter: sivabalan narayanan > Assignee: sivabalan narayanan >Priority: Major > Labels: RobustWrites, pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7961) Optimize UpsertPartitioner for prepped write operations
[ https://issues.apache.org/jira/browse/HUDI-7961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7961: -- Status: In Progress (was: Open) > Optimize UpsertPartitioner for prepped write operations > --- > > Key: HUDI-7961 > URL: https://issues.apache.org/jira/browse/HUDI-7961 > Project: Apache Hudi > Issue Type: Improvement > Components: writer-core >Reporter: sivabalan narayanan > Assignee: sivabalan narayanan >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > We have avg record size calculation etc in UpsertPartitioner which does not > makes sense for prepped write operations. also, w/ MDT, we can optimize > these. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-7644) Add record key info with RLI validation in MDT Validator
[ https://issues.apache.org/jira/browse/HUDI-7644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan closed HUDI-7644. - Resolution: Fixed > Add record key info with RLI validation in MDT Validator > > > Key: HUDI-7644 > URL: https://issues.apache.org/jira/browse/HUDI-7644 > Project: Apache Hudi > Issue Type: Improvement > Components: metadata, tests-ci >Reporter: sivabalan narayanan > Assignee: sivabalan narayanan >Priority: Major > Fix For: 0.16.0, 1.0.0 > > > Add record key info with RLI validation in MDT Validator -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7961) Optimize UpsertPartitioner for prepped write operations
[ https://issues.apache.org/jira/browse/HUDI-7961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7961: -- Status: Patch Available (was: In Progress) > Optimize UpsertPartitioner for prepped write operations > --- > > Key: HUDI-7961 > URL: https://issues.apache.org/jira/browse/HUDI-7961 > Project: Apache Hudi > Issue Type: Improvement > Components: writer-core >Reporter: sivabalan narayanan > Assignee: sivabalan narayanan >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > We have avg record size calculation etc in UpsertPartitioner which does not > makes sense for prepped write operations. also, w/ MDT, we can optimize > these. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7961) Optimize UpsertPartitioner for prepped write operations
[ https://issues.apache.org/jira/browse/HUDI-7961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7961: -- Fix Version/s: 1.0.0 > Optimize UpsertPartitioner for prepped write operations > --- > > Key: HUDI-7961 > URL: https://issues.apache.org/jira/browse/HUDI-7961 > Project: Apache Hudi > Issue Type: Improvement > Components: writer-core >Reporter: sivabalan narayanan > Assignee: sivabalan narayanan >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > We have avg record size calculation etc in UpsertPartitioner which does not > makes sense for prepped write operations. also, w/ MDT, we can optimize > these. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7961) Optimize UpsertPartitioner for prepped write operations
[ https://issues.apache.org/jira/browse/HUDI-7961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan reassigned HUDI-7961: - Assignee: sivabalan narayanan > Optimize UpsertPartitioner for prepped write operations > --- > > Key: HUDI-7961 > URL: https://issues.apache.org/jira/browse/HUDI-7961 > Project: Apache Hudi > Issue Type: Improvement > Components: writer-core >Reporter: sivabalan narayanan > Assignee: sivabalan narayanan >Priority: Major > Labels: pull-request-available > > We have avg record size calculation etc in UpsertPartitioner which does not > makes sense for prepped write operations. also, w/ MDT, we can optimize > these. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7968) RFC for robust handling of spark task failures and retries
[ https://issues.apache.org/jira/browse/HUDI-7968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan reassigned HUDI-7968: - Assignee: sivabalan narayanan > RFC for robust handling of spark task failures and retries > -- > > Key: HUDI-7968 > URL: https://issues.apache.org/jira/browse/HUDI-7968 > Project: Apache Hudi > Issue Type: Improvement > Components: writer-core >Reporter: sivabalan narayanan > Assignee: sivabalan narayanan >Priority: Major > Labels: RobustWrites, pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7967) Robust handling of spark task failures and retries
[ https://issues.apache.org/jira/browse/HUDI-7967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7967: -- Labels: RobustWrites (was: ) > Robust handling of spark task failures and retries > --- > > Key: HUDI-7967 > URL: https://issues.apache.org/jira/browse/HUDI-7967 > Project: Apache Hudi > Issue Type: Epic > Components: reader-core, writer-core >Reporter: sivabalan narayanan >Priority: Major > Labels: RobustWrites > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7968) RFC for robust handling of spark task failures and retries
[ https://issues.apache.org/jira/browse/HUDI-7968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7968: -- Labels: RobustWrites (was: ) > RFC for robust handling of spark task failures and retries > -- > > Key: HUDI-7968 > URL: https://issues.apache.org/jira/browse/HUDI-7968 > Project: Apache Hudi > Issue Type: Improvement > Components: writer-core >Reporter: sivabalan narayanan >Priority: Major > Labels: RobustWrites > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7968) RFC for robust handling of spark task failures and retries
sivabalan narayanan created HUDI-7968: - Summary: RFC for robust handling of spark task failures and retries Key: HUDI-7968 URL: https://issues.apache.org/jira/browse/HUDI-7968 Project: Apache Hudi Issue Type: Improvement Components: writer-core Reporter: sivabalan narayanan -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6416) Completion Markers for handling spark retries
[ https://issues.apache.org/jira/browse/HUDI-6416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-6416: -- Epic Link: HUDI-7967 > Completion Markers for handling spark retries > - > > Key: HUDI-6416 > URL: https://issues.apache.org/jira/browse/HUDI-6416 > Project: Apache Hudi > Issue Type: Bug >Reporter: Balajee Nagasubramaniam >Assignee: sivabalan narayanan >Priority: Major > Labels: pull-request-available > > During spark stage retries, spark driver may have all the information to > reconcile the commit and proceed with next steps, while a stray executor may > still be writing to a data file and complete later (before the JVM exit). > Extra files left on the dataset, excluded from reconcile commit step could > show up as data quality issue for query engines with duplicate records. > This change brings completion markers which tries to prevent the dataset from > experiencing data quality issues, in such corner case scenarios. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7967) Robust handling of spark task failures and retries
[ https://issues.apache.org/jira/browse/HUDI-7967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7967: -- Summary: Robust handling of spark task failures and retries (was: RFC for Robust handling of spark task failures and retries ) > Robust handling of spark task failures and retries > --- > > Key: HUDI-7967 > URL: https://issues.apache.org/jira/browse/HUDI-7967 > Project: Apache Hudi > Issue Type: Epic > Components: reader-core, writer-core >Reporter: sivabalan narayanan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7967) RFC for Robust handling of spark task failures and retries
[ https://issues.apache.org/jira/browse/HUDI-7967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7967: -- Issue Type: Epic (was: Improvement) > RFC for Robust handling of spark task failures and retries > --- > > Key: HUDI-7967 > URL: https://issues.apache.org/jira/browse/HUDI-7967 > Project: Apache Hudi > Issue Type: Epic > Components: reader-core, writer-core >Reporter: sivabalan narayanan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7967) RFC for Robust handling of spark task failures and retries
sivabalan narayanan created HUDI-7967: - Summary: RFC for Robust handling of spark task failures and retries Key: HUDI-7967 URL: https://issues.apache.org/jira/browse/HUDI-7967 Project: Apache Hudi Issue Type: Improvement Components: reader-core, writer-core Reporter: sivabalan narayanan -- This message was sent by Atlassian Jira (v8.20.10#820010)
(hudi) branch master updated: Revert "[HUDI-7709] ClassCastException while reading the data using `TimestampBasedKeyGenerator` (#11501)" (#11586)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 26ac119ee25 Revert "[HUDI-7709] ClassCastException while reading the data using `TimestampBasedKeyGenerator` (#11501)" (#11586) 26ac119ee25 is described below commit 26ac119ee25f03ff079bb396b5f397ee1264c406 Author: Sagar Sumit AuthorDate: Mon Jul 8 10:40:34 2024 +0530 Revert "[HUDI-7709] ClassCastException while reading the data using `TimestampBasedKeyGenerator` (#11501)" (#11586) This reverts commit ae1ee05ab8c2bd732e57bee11c8748926b05ec4b. --- .../org/apache/hudi/BaseHoodieTableFileIndex.java | 24 +--- .../hudi/common/table/HoodieTableConfig.java | 2 - .../main/scala/org/apache/hudi/DefaultSource.scala | 3 + .../TestSparkSqlWithTimestampKeyGenerator.scala| 148 - 4 files changed, 9 insertions(+), 168 deletions(-) diff --git a/hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java b/hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java index 9cdf1adf971..5a0fd79fcc4 100644 --- a/hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java +++ b/hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java @@ -19,7 +19,6 @@ package org.apache.hudi; import org.apache.hudi.common.config.HoodieMetadataConfig; -import org.apache.hudi.common.config.TimestampKeyGeneratorConfig; import org.apache.hudi.common.config.TypedProperties; import org.apache.hudi.common.engine.HoodieEngineContext; import org.apache.hudi.common.fs.FSUtils; @@ -27,7 +26,6 @@ import org.apache.hudi.common.model.BaseFile; import org.apache.hudi.common.model.FileSlice; import org.apache.hudi.common.model.HoodieLogFile; import org.apache.hudi.common.model.HoodieTableQueryType; -import org.apache.hudi.common.table.HoodieTableConfig; import org.apache.hudi.common.table.HoodieTableMetaClient; import org.apache.hudi.common.table.timeline.HoodieInstant; import org.apache.hudi.common.table.timeline.HoodieTimeline; @@ -42,7 +40,6 @@ import org.apache.hudi.exception.HoodieException; import org.apache.hudi.exception.HoodieIOException; import org.apache.hudi.expression.Expression; import org.apache.hudi.internal.schema.Types; -import org.apache.hudi.keygen.constant.KeyGeneratorType; import org.apache.hudi.metadata.HoodieTableMetadata; import org.apache.hudi.metadata.HoodieTableMetadataUtil; import org.apache.hudi.storage.HoodieStorage; @@ -356,22 +353,13 @@ public abstract class BaseHoodieTableFileIndex implements AutoCloseable { } private Object[] parsePartitionColumnValues(String[] partitionColumns, String partitionPath) { -HoodieTableConfig tableConfig = metaClient.getTableConfig(); -Object[] partitionColumnValues; -if (null != tableConfig.getKeyGeneratorClassName() -&& tableConfig.getKeyGeneratorClassName().equals(KeyGeneratorType.TIMESTAMP.getClassName()) -&& tableConfig.propsMap().get(TimestampKeyGeneratorConfig.TIMESTAMP_TYPE_FIELD.key()).matches("SCALAR|UNIX_TIMESTAMP|EPOCHMILLISECONDS")) { - // For TIMESTAMP key generator when TYPE is SCALAR, UNIX_TIMESTAMP or EPOCHMILLISECONDS, - // we couldn't reconstruct initial partition column values from partition paths due to lost data after formatting in most cases - partitionColumnValues = new Object[partitionColumns.length]; -} else { - partitionColumnValues = doParsePartitionColumnValues(partitionColumns, partitionPath); - if (shouldListLazily && partitionColumnValues.length != partitionColumns.length) { -throw new HoodieException("Failed to parse partition column values from the partition-path:" -+ " likely non-encoded slashes being used in partition column's values. You can try to" -+ " work this around by switching listing mode to eager"); - } +Object[] partitionColumnValues = doParsePartitionColumnValues(partitionColumns, partitionPath); +if (shouldListLazily && partitionColumnValues.length != partitionColumns.length) { + throw new HoodieException("Failed to parse partition column values from the partition-path:" + + " likely non-encoded slashes being used in partition column's values. You can try to" + + " work this around by switching listing mode to eager"); } + return partitionColumnValues; } diff --git a/hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java b/hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java index 6053278d831..117b64ba29d 100644 --- a/hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig
(hudi) branch master updated: [HUDI-7966] Handle NPE from AvroSchemaUtils.createNewSchemaFromFieldsWithReference (#11585)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 6d01bcfa6a2 [HUDI-7966] Handle NPE from AvroSchemaUtils.createNewSchemaFromFieldsWithReference (#11585) 6d01bcfa6a2 is described below commit 6d01bcfa6a26fb890d64802f6e4a16debb0bb11f Author: Sagar Sumit AuthorDate: Mon Jul 8 10:38:23 2024 +0530 [HUDI-7966] Handle NPE from AvroSchemaUtils.createNewSchemaFromFieldsWithReference (#11585) --- .../java/org/apache/hudi/avro/AvroSchemaUtils.java | 15 +++- .../read/HoodieFileGroupReaderSchemaHandler.java | 8 ++--- .../org/apache/hudi/avro/TestAvroSchemaUtils.java | 40 ++ 3 files changed, 56 insertions(+), 7 deletions(-) diff --git a/hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java b/hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java index 642624e0ed3..59917f6c652 100644 --- a/hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java +++ b/hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java @@ -27,6 +27,8 @@ import org.apache.hudi.exception.SchemaCompatibilityException; import org.apache.avro.Schema; import org.apache.avro.SchemaCompatibility; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; import java.util.ArrayDeque; import java.util.ArrayList; @@ -46,6 +48,8 @@ import static org.apache.hudi.common.util.ValidationUtils.checkState; */ public class AvroSchemaUtils { + private static final Logger LOG = LoggerFactory.getLogger(AvroSchemaUtils.class); + private AvroSchemaUtils() {} /** @@ -302,8 +306,17 @@ public class AvroSchemaUtils { * @return schema with fields from fields, and metadata from schema */ public static Schema createNewSchemaFromFieldsWithReference(Schema schema, List fields) { +if (schema == null) { + throw new IllegalArgumentException("Schema must not be null"); +} Schema newSchema = Schema.createRecord(schema.getName(), schema.getDoc(), schema.getNamespace(), schema.isError()); -for (Map.Entry prop : schema.getObjectProps().entrySet()) { +Map schemaProps = Collections.emptyMap(); +try { + schemaProps = schema.getObjectProps(); +} catch (Exception e) { + LOG.warn("Error while getting object properties from schema: {}", schema, e); +} +for (Map.Entry prop : schemaProps.entrySet()) { newSchema.addProp(prop.getKey(), prop.getValue()); } newSchema.setFields(fields); diff --git a/hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupReaderSchemaHandler.java b/hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupReaderSchemaHandler.java index 85d0cef2893..8a962fdc2a7 100644 --- a/hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupReaderSchemaHandler.java +++ b/hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupReaderSchemaHandler.java @@ -40,6 +40,7 @@ import java.util.stream.Collectors; import java.util.stream.Stream; import static org.apache.hudi.avro.AvroSchemaUtils.appendFieldsToSchemaDedupNested; +import static org.apache.hudi.avro.AvroSchemaUtils.createNewSchemaFromFieldsWithReference; import static org.apache.hudi.avro.AvroSchemaUtils.findNestedField; /** @@ -178,11 +179,6 @@ public class HoodieFileGroupReaderSchemaHandler { Schema.Field curr = fields.get(i); fields.set(i, new Schema.Field(curr.name(), curr.schema(), curr.doc(), curr.defaultVal())); } -Schema newSchema = Schema.createRecord(dataSchema.getName(), dataSchema.getDoc(), dataSchema.getNamespace(), dataSchema.isError()); -for (Map.Entry prop : dataSchema.getObjectProps().entrySet()) { - newSchema.addProp(prop.getKey(), prop.getValue()); -} -newSchema.setFields(fields); -return newSchema; +return createNewSchemaFromFieldsWithReference(dataSchema, fields); } } diff --git a/hudi-common/src/test/java/org/apache/hudi/avro/TestAvroSchemaUtils.java b/hudi-common/src/test/java/org/apache/hudi/avro/TestAvroSchemaUtils.java index 37cea0d93a9..a11338d4017 100644 --- a/hudi-common/src/test/java/org/apache/hudi/avro/TestAvroSchemaUtils.java +++ b/hudi-common/src/test/java/org/apache/hudi/avro/TestAvroSchemaUtils.java @@ -115,6 +115,46 @@ public class TestAvroSchemaUtils { + " ]\n" + "}\n"; + @Test + public void testCreateNewSchemaFromFieldsWithReference_NullSchema() { +// This test should throw an IllegalArgumentException +assertThrows(IllegalArgumentException.class, () -> AvroSchemaUtils.createNewSchemaFromFieldsWithReference(null, Collections.emptyList())); + } + + @Test + public void testCreateNewSchemaFromFieldsWithReference_NullObjectProps() { +// Create a schema without any object p
[jira] [Created] (HUDI-7963) Avoid generating RLI records when disabled w/ MDT
sivabalan narayanan created HUDI-7963: - Summary: Avoid generating RLI records when disabled w/ MDT Key: HUDI-7963 URL: https://issues.apache.org/jira/browse/HUDI-7963 Project: Apache Hudi Issue Type: Improvement Components: metadata Reporter: sivabalan narayanan -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7961) Optimize UpsertPartitioner for prepped write operations
sivabalan narayanan created HUDI-7961: - Summary: Optimize UpsertPartitioner for prepped write operations Key: HUDI-7961 URL: https://issues.apache.org/jira/browse/HUDI-7961 Project: Apache Hudi Issue Type: Improvement Components: writer-core Reporter: sivabalan narayanan We have avg record size calculation etc in UpsertPartitioner which does not makes sense for prepped write operations. also, w/ MDT, we can optimize these. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7882) Umbrella ticket to track all changes required to support reading 1.x tables with 0.16.0
[ https://issues.apache.org/jira/browse/HUDI-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7882: -- Status: Patch Available (was: In Progress) > Umbrella ticket to track all changes required to support reading 1.x tables > with 0.16.0 > > > Key: HUDI-7882 > URL: https://issues.apache.org/jira/browse/HUDI-7882 > Project: Apache Hudi > Issue Type: Improvement > Components: reader-core > Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0, 1.0.0 > > > We wanted to support reading 1.x tables in 0.16.0 release. So, creating this > umbrella ticket to track all of them. > > RFC in progress: [https://github.com/apache/hudi/pull/11514] > > Changes required to be ported: > 0. Creating 0.16.0 branch > 0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. > > 1. Timeline > 1.a Hoodie instant parsing should be able to read 1.x instants. > https://issues.apache.org/jira/browse/HUDI-7883 Sagar. > 1.b Commit metadata parsing is able to handle both json and avro formats. > Scope might be non-trivial. https://issues.apache.org/jira/browse/HUDI-7866 > Siva. > 1.c HoodieDefaultTimeline able to read both timelines based on table version. > https://issues.apache.org/jira/browse/HUDI-7884 Siva. > 1.d Reading LSM timeline using 0.16.0 > https://issues.apache.org/jira/browse/HUDI-7890 Siva. > 1.e Ensure 1.0 MDT timeline is readable by 0.16 - HUDI-7901 > > 2. Table property changes > 2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885 > https://issues.apache.org/jira/browse/HUDI-7865 LJ > > 3. MDT table changes > 3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ > 3.b MDT payload schema changes. > https://issues.apache.org/jira/browse/HUDI-7886 LJ > > 4. Log format changes > 4.a All metadata header types porting > https://issues.apache.org/jira/browse/HUDI-7887 Jon > 4.b Meaningful error for incompatible features from 1.x > https://issues.apache.org/jira/browse/HUDI-7888 Jon > > 5. Log file slice or grouping detection compatibility > > 5. Tests > 5.a Tests to validate that 1.x tables can be read w/ 0.16.0 > https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. > > 6 Doc changes > 6.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. > https://issues.apache.org/jira/browse/HUDI-7889 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7507) ongoing concurrent writers with smaller timestamp can cause issues with table services
[ https://issues.apache.org/jira/browse/HUDI-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7507: -- Status: In Progress (was: Open) > ongoing concurrent writers with smaller timestamp can cause issues with > table services > --- > > Key: HUDI-7507 > URL: https://issues.apache.org/jira/browse/HUDI-7507 > Project: Apache Hudi > Issue Type: Improvement > Components: table-service >Reporter: Krishen Bhan >Assignee: sivabalan narayanan >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0 > > Attachments: Flowchart (1).png, Flowchart.png > > > *Scenarios:* > Although HUDI operations hold a table lock when creating a .requested > instant, because HUDI writers do not generate a timestamp and create a > .requsted plan in the same transaction, there can be a scenario where > # Job 1 starts, chooses timestamp (x) , Job 2 starts and chooses timestamp > (x - 1) > # Job 1 schedules and creates requested file with instant timestamp (x) > # Job 2 schedules and creates requested file with instant timestamp (x-1) > # Both jobs continue running > If one job is writing a commit and the other is a table service, this can > cause issues: > * > ** If Job 2 is ingestion commit and Job 1 is compaction/log compaction, then > when Job 1 runs before Job 2 and can create a compaction plan for all instant > times (up to (x) ) that doesn’t include instant time (x-1) . Later Job 2 > will create instant time (x-1), but timeline will be in a corrupted state > since compaction plan was supposed to include (x-1) > ** There is a similar issue with clean. If Job2 is a long-running commit > (that was stuck/delayed for a while before creating its .requested plan) and > Job 1 is a clean, then Job 1 can perform a clean that updates the > earliest-commit-to-retain without waiting for the inflight instant by Job 2 > at (x-1) to complete. This causes Job2 to be "skipped" by clean. > ** If the completed commit files include som sort of "checkpointing" with > another "downstream job" performing incremental reads on this dataset (such > as Hoodie Streamer/DeltaSync) then there may be incorrect behavior, such as > the incremental reader skipping some completed commits (that have a smaller > instant timestamp than latest completed commit but were created after). > [Edit] I added a diagram to visualize the issue, specifically the second > scenario with clean > !Flowchart (1).png! > *Proposed approach:* > One way this can be resolved is by combining the operations of generating > instant time and creating a requested file in the same HUDI table > transaction. Specifically, executing the following steps whenever any instant > (commit, table service, etc) is scheduled > Approach A > # Acquire table lock > # Look at the latest instant C on the active timeline (completed or not). > Generate a timestamp after C > # Create the plan and requested file using this new timestamp ( that is > greater than C) > # Release table lock > Unfortunately (A) has the following drawbacks > * Every operation must now hold the table lock when computing its plan even > if it's an expensive operation and will take a while > * Users of HUDI cannot easily set their own instant time of an operation, > and this restriction would break any public APIs that allow this and would > require deprecating those APIs. > > An alternate approach is to have every operation abort creating a .requested > file unless it has the latest timestamp. Specifically, for any instant type, > whenever an operation is about to create a .requested plan on timeline, it > should take the table lock and assert that there are no other instants on > timeline that are greater than it that could cause a conflict. If that > assertion fails, then throw a retry-able conflict resolution exception. > Specifically, the following steps should be followed whenever any instant > (commit, table service, etc) is scheduled > Approach B > # Acquire table lock. Assume that the desired instant time C and requested > file plan metadata have already been created, regardless of wether it was > before this step or right after acquiring the table lock. > # If there are any instants on the timeline that are greater than C > (regardless of their operation type or sate status) then release table lock > and throw an exception > # Create requested plan on timeline (As usual)
[jira] [Updated] (HUDI-6416) Completion Markers for handling spark retries
[ https://issues.apache.org/jira/browse/HUDI-6416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-6416: -- Status: In Progress (was: Open) > Completion Markers for handling spark retries > - > > Key: HUDI-6416 > URL: https://issues.apache.org/jira/browse/HUDI-6416 > Project: Apache Hudi > Issue Type: Bug >Reporter: Balajee Nagasubramaniam >Assignee: sivabalan narayanan >Priority: Major > Labels: pull-request-available > > During spark stage retries, spark driver may have all the information to > reconcile the commit and proceed with next steps, while a stray executor may > still be writing to a data file and complete later (before the JVM exit). > Extra files left on the dataset, excluded from reconcile commit step could > show up as data quality issue for query engines with duplicate records. > This change brings completion markers which tries to prevent the dataset from > experiencing data quality issues, in such corner case scenarios. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7882) Umbrella ticket to track all changes required to support reading 1.x tables with 0.16.0
[ https://issues.apache.org/jira/browse/HUDI-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7882: -- Description: We wanted to support reading 1.x tables in 0.16.0 release. So, creating this umbrella ticket to track all of them. RFC in progress: [https://github.com/apache/hudi/pull/11514] Changes required to be ported: 0. Creating 0.16.0 branch 0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. 1. Timeline 1.a Hoodie instant parsing should be able to read 1.x instants. https://issues.apache.org/jira/browse/HUDI-7883 Sagar. 1.b Commit metadata parsing is able to handle both json and avro formats. Scope might be non-trivial. https://issues.apache.org/jira/browse/HUDI-7866 Siva. 1.c HoodieDefaultTimeline able to read both timelines based on table version. https://issues.apache.org/jira/browse/HUDI-7884 Siva. 1.d Reading LSM timeline using 0.16.0 https://issues.apache.org/jira/browse/HUDI-7890 Siva. 1.e Ensure 1.0 MDT timeline is readable by 0.16 - HUDI-7901 2. Table property changes 2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885 https://issues.apache.org/jira/browse/HUDI-7865 LJ 3. MDT table changes 3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ 3.b MDT payload schema changes. https://issues.apache.org/jira/browse/HUDI-7886 LJ 4. Log format changes 4.a All metadata header types porting https://issues.apache.org/jira/browse/HUDI-7887 Jon 4.b Meaningful error for incompatible features from 1.x https://issues.apache.org/jira/browse/HUDI-7888 Jon 5. Log file slice or grouping detection compatibility 5. Tests 5.a Tests to validate that 1.x tables can be read w/ 0.16.0 https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. 6 Doc changes 6.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. https://issues.apache.org/jira/browse/HUDI-7889 was: We wanted to support reading 1.x tables in 0.16.0 release. So, creating this umbrella ticket to track all of them. Changes required to be ported: 0. Creating 0.16.0 branch 0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. 1. Timeline 1.a Hoodie instant parsing should be able to read 1.x instants. https://issues.apache.org/jira/browse/HUDI-7883 Sagar. 1.b Commit metadata parsing is able to handle both json and avro formats. Scope might be non-trivial. https://issues.apache.org/jira/browse/HUDI-7866 Siva. 1.c HoodieDefaultTimeline able to read both timelines based on table version. https://issues.apache.org/jira/browse/HUDI-7884 Siva. 1.d Reading LSM timeline using 0.16.0 https://issues.apache.org/jira/browse/HUDI-7890 Siva. 1.e Ensure 1.0 MDT timeline is readable by 0.16 - HUDI-7901 2. Table property changes 2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885 https://issues.apache.org/jira/browse/HUDI-7865 LJ 3. MDT table changes 3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ 3.b MDT payload schema changes. https://issues.apache.org/jira/browse/HUDI-7886 LJ 4. Log format changes 4.a All metadata header types porting https://issues.apache.org/jira/browse/HUDI-7887 Jon 4.b Meaningful error for incompatible features from 1.x https://issues.apache.org/jira/browse/HUDI-7888 Jon 5. Log file slice or grouping detection compatibility 5. Tests 5.a Tests to validate that 1.x tables can be read w/ 0.16.0 https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. 6 Doc changes 6.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. https://issues.apache.org/jira/browse/HUDI-7889 > Umbrella ticket to track all changes required to support reading 1.x tables > with 0.16.0 > > > Key: HUDI-7882 > URL: https://issues.apache.org/jira/browse/HUDI-7882 > Project: Apache Hudi > Issue Type: Improvement > Components: reader-core > Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Major > Fix For: 0.16.0, 1.0.0 > > > We wanted to support reading 1.x tables in 0.16.0 release. So, creating this > umbrella ticket to track all of them. > > RFC in progress: [https://github.com/apache/hudi/pull/11514] > > Changes required to be ported: > 0. Creating 0.16.0 branch > 0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. > > 1. Timeline > 1.a Hoodie instant parsing should be able to read 1.x instants. > https://issues.apache.org/jira/browse/HUDI-7883 Sagar. > 1.b Commit metadata parsing is able to handle both json and avro formats. > Sco
[jira] [Updated] (HUDI-7507) ongoing concurrent writers with smaller timestamp can cause issues with table services
[ https://issues.apache.org/jira/browse/HUDI-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7507: -- Sprint: 2024/06/17-30 > ongoing concurrent writers with smaller timestamp can cause issues with > table services > --- > > Key: HUDI-7507 > URL: https://issues.apache.org/jira/browse/HUDI-7507 > Project: Apache Hudi > Issue Type: Improvement > Components: table-service >Reporter: Krishen Bhan >Assignee: sivabalan narayanan >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0 > > Attachments: Flowchart (1).png, Flowchart.png > > > *Scenarios:* > Although HUDI operations hold a table lock when creating a .requested > instant, because HUDI writers do not generate a timestamp and create a > .requsted plan in the same transaction, there can be a scenario where > # Job 1 starts, chooses timestamp (x) , Job 2 starts and chooses timestamp > (x - 1) > # Job 1 schedules and creates requested file with instant timestamp (x) > # Job 2 schedules and creates requested file with instant timestamp (x-1) > # Both jobs continue running > If one job is writing a commit and the other is a table service, this can > cause issues: > * > ** If Job 2 is ingestion commit and Job 1 is compaction/log compaction, then > when Job 1 runs before Job 2 and can create a compaction plan for all instant > times (up to (x) ) that doesn’t include instant time (x-1) . Later Job 2 > will create instant time (x-1), but timeline will be in a corrupted state > since compaction plan was supposed to include (x-1) > ** There is a similar issue with clean. If Job2 is a long-running commit > (that was stuck/delayed for a while before creating its .requested plan) and > Job 1 is a clean, then Job 1 can perform a clean that updates the > earliest-commit-to-retain without waiting for the inflight instant by Job 2 > at (x-1) to complete. This causes Job2 to be "skipped" by clean. > ** If the completed commit files include som sort of "checkpointing" with > another "downstream job" performing incremental reads on this dataset (such > as Hoodie Streamer/DeltaSync) then there may be incorrect behavior, such as > the incremental reader skipping some completed commits (that have a smaller > instant timestamp than latest completed commit but were created after). > [Edit] I added a diagram to visualize the issue, specifically the second > scenario with clean > !Flowchart (1).png! > *Proposed approach:* > One way this can be resolved is by combining the operations of generating > instant time and creating a requested file in the same HUDI table > transaction. Specifically, executing the following steps whenever any instant > (commit, table service, etc) is scheduled > Approach A > # Acquire table lock > # Look at the latest instant C on the active timeline (completed or not). > Generate a timestamp after C > # Create the plan and requested file using this new timestamp ( that is > greater than C) > # Release table lock > Unfortunately (A) has the following drawbacks > * Every operation must now hold the table lock when computing its plan even > if it's an expensive operation and will take a while > * Users of HUDI cannot easily set their own instant time of an operation, > and this restriction would break any public APIs that allow this and would > require deprecating those APIs. > > An alternate approach is to have every operation abort creating a .requested > file unless it has the latest timestamp. Specifically, for any instant type, > whenever an operation is about to create a .requested plan on timeline, it > should take the table lock and assert that there are no other instants on > timeline that are greater than it that could cause a conflict. If that > assertion fails, then throw a retry-able conflict resolution exception. > Specifically, the following steps should be followed whenever any instant > (commit, table service, etc) is scheduled > Approach B > # Acquire table lock. Assume that the desired instant time C and requested > file plan metadata have already been created, regardless of wether it was > before this step or right after acquiring the table lock. > # If there are any instants on the timeline that are greater than C > (regardless of their operation type or sate status) then release table lock > and throw an exception > # Create requested plan on timeline (As usual) > #
[jira] [Assigned] (HUDI-7507) ongoing concurrent writers with smaller timestamp can cause issues with table services
[ https://issues.apache.org/jira/browse/HUDI-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan reassigned HUDI-7507: - Assignee: sivabalan narayanan > ongoing concurrent writers with smaller timestamp can cause issues with > table services > --- > > Key: HUDI-7507 > URL: https://issues.apache.org/jira/browse/HUDI-7507 > Project: Apache Hudi > Issue Type: Improvement > Components: table-service >Reporter: Krishen Bhan >Assignee: sivabalan narayanan >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0 > > Attachments: Flowchart (1).png, Flowchart.png > > > *Scenarios:* > Although HUDI operations hold a table lock when creating a .requested > instant, because HUDI writers do not generate a timestamp and create a > .requsted plan in the same transaction, there can be a scenario where > # Job 1 starts, chooses timestamp (x) , Job 2 starts and chooses timestamp > (x - 1) > # Job 1 schedules and creates requested file with instant timestamp (x) > # Job 2 schedules and creates requested file with instant timestamp (x-1) > # Both jobs continue running > If one job is writing a commit and the other is a table service, this can > cause issues: > * > ** If Job 2 is ingestion commit and Job 1 is compaction/log compaction, then > when Job 1 runs before Job 2 and can create a compaction plan for all instant > times (up to (x) ) that doesn’t include instant time (x-1) . Later Job 2 > will create instant time (x-1), but timeline will be in a corrupted state > since compaction plan was supposed to include (x-1) > ** There is a similar issue with clean. If Job2 is a long-running commit > (that was stuck/delayed for a while before creating its .requested plan) and > Job 1 is a clean, then Job 1 can perform a clean that updates the > earliest-commit-to-retain without waiting for the inflight instant by Job 2 > at (x-1) to complete. This causes Job2 to be "skipped" by clean. > ** If the completed commit files include som sort of "checkpointing" with > another "downstream job" performing incremental reads on this dataset (such > as Hoodie Streamer/DeltaSync) then there may be incorrect behavior, such as > the incremental reader skipping some completed commits (that have a smaller > instant timestamp than latest completed commit but were created after). > [Edit] I added a diagram to visualize the issue, specifically the second > scenario with clean > !Flowchart (1).png! > *Proposed approach:* > One way this can be resolved is by combining the operations of generating > instant time and creating a requested file in the same HUDI table > transaction. Specifically, executing the following steps whenever any instant > (commit, table service, etc) is scheduled > Approach A > # Acquire table lock > # Look at the latest instant C on the active timeline (completed or not). > Generate a timestamp after C > # Create the plan and requested file using this new timestamp ( that is > greater than C) > # Release table lock > Unfortunately (A) has the following drawbacks > * Every operation must now hold the table lock when computing its plan even > if it's an expensive operation and will take a while > * Users of HUDI cannot easily set their own instant time of an operation, > and this restriction would break any public APIs that allow this and would > require deprecating those APIs. > > An alternate approach is to have every operation abort creating a .requested > file unless it has the latest timestamp. Specifically, for any instant type, > whenever an operation is about to create a .requested plan on timeline, it > should take the table lock and assert that there are no other instants on > timeline that are greater than it that could cause a conflict. If that > assertion fails, then throw a retry-able conflict resolution exception. > Specifically, the following steps should be followed whenever any instant > (commit, table service, etc) is scheduled > Approach B > # Acquire table lock. Assume that the desired instant time C and requested > file plan metadata have already been created, regardless of wether it was > before this step or right after acquiring the table lock. > # If there are any instants on the timeline that are greater than C > (regardless of their operation type or sate status) then release table lock > and throw an exception > # Create requested plan on timeline
[jira] [Assigned] (HUDI-6416) Completion Markers for handling spark retries
[ https://issues.apache.org/jira/browse/HUDI-6416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan reassigned HUDI-6416: - Assignee: sivabalan narayanan > Completion Markers for handling spark retries > - > > Key: HUDI-6416 > URL: https://issues.apache.org/jira/browse/HUDI-6416 > Project: Apache Hudi > Issue Type: Bug >Reporter: Balajee Nagasubramaniam >Assignee: sivabalan narayanan >Priority: Major > Labels: pull-request-available > > During spark stage retries, spark driver may have all the information to > reconcile the commit and proceed with next steps, while a stray executor may > still be writing to a data file and complete later (before the JVM exit). > Extra files left on the dataset, excluded from reconcile commit step could > show up as data quality issue for query engines with duplicate records. > This change brings completion markers which tries to prevent the dataset from > experiencing data quality issues, in such corner case scenarios. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6416) Completion Markers for handling spark retries
[ https://issues.apache.org/jira/browse/HUDI-6416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-6416: -- Sprint: 2024/06/17-30 > Completion Markers for handling spark retries > - > > Key: HUDI-6416 > URL: https://issues.apache.org/jira/browse/HUDI-6416 > Project: Apache Hudi > Issue Type: Bug >Reporter: Balajee Nagasubramaniam >Assignee: sivabalan narayanan >Priority: Major > Labels: pull-request-available > > During spark stage retries, spark driver may have all the information to > reconcile the commit and proceed with next steps, while a stray executor may > still be writing to a data file and complete later (before the JVM exit). > Extra files left on the dataset, excluded from reconcile commit step could > show up as data quality issue for query engines with duplicate records. > This change brings completion markers which tries to prevent the dataset from > experiencing data quality issues, in such corner case scenarios. -- This message was sent by Atlassian Jira (v8.20.10#820010)
(hudi) branch master updated: [HUDI-7903] Fix storage partition stats index to skip data (#11472)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new eeafa734a6c [HUDI-7903] Fix storage partition stats index to skip data (#11472) eeafa734a6c is described below commit eeafa734a6cfb431b608cc67f8dcfe66249d7ec3 Author: Sagar Sumit AuthorDate: Sat Jun 29 20:08:11 2024 +0530 [HUDI-7903] Fix storage partition stats index to skip data (#11472) * [HUDI-7903] Fix storage partition stats index to skip data * Fix SQL and checkstyle * handle npe in partition stats records * Remove decimal field due to Spark 2.4 cast issue * Address test comments - Co-authored-by: Y Ethan Guo --- .../hudi/metadata/HoodieMetadataPayload.java | 5 +- .../apache/hudi/PartitionStatsIndexSupport.scala | 4 +- .../TestPartitionStatsIndexWithSql.scala | 363 + 3 files changed, 302 insertions(+), 70 deletions(-) diff --git a/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java b/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java index 7b9359d5dbb..36ed57c87f5 100644 --- a/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java +++ b/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java @@ -68,6 +68,7 @@ import java.util.stream.Stream; import static org.apache.hudi.avro.HoodieAvroUtils.unwrapAvroValueWrapper; import static org.apache.hudi.avro.HoodieAvroUtils.wrapValueIntoAvro; +import static org.apache.hudi.common.util.StringUtils.nonEmpty; import static org.apache.hudi.common.util.TypeUtils.unsafeCast; import static org.apache.hudi.common.util.ValidationUtils.checkArgument; import static org.apache.hudi.common.util.ValidationUtils.checkState; @@ -686,10 +687,11 @@ public class HoodieMetadataPayload implements HoodieRecordPayload { HoodieKey key = new HoodieKey(getPartitionStatsIndexKey(partitionPath, columnRangeMetadata.getColumnName()), MetadataPartitionType.PARTITION_STATS.getPartitionPath()); + String fileName = nonEmpty(columnRangeMetadata.getFilePath()) ? new StoragePath(columnRangeMetadata.getFilePath()).getName() : null; HoodieMetadataPayload payload = new HoodieMetadataPayload(key.getRecordKey(), HoodieMetadataColumnStats.newBuilder() - .setFileName(null) + .setFileName(fileName) .setColumnName(columnRangeMetadata.getColumnName()) .setMinValue(wrapValueIntoAvro(columnRangeMetadata.getMinValue())) .setMaxValue(wrapValueIntoAvro(columnRangeMetadata.getMaxValue())) @@ -713,7 +715,6 @@ public class HoodieMetadataPayload implements HoodieRecordPayload new ColumnIndexID(colName).asBase64EncodedString()) +logDebug(s"Loading column stats for columns: ${targetColumns.mkString(", ")}, Encoded column names: ${encodedTargetColumnNames.mkString(", ")}") val metadataRecords: HoodieData[HoodieRecord[HoodieMetadataPayload]] = metadataTable.getRecordsByKeyPrefixes(encodedTargetColumnNames.asJava, HoodieTableMetadataUtil.PARTITION_NAME_PARTITION_STATS, shouldReadInMemory) val columnStatsRecords: HoodieData[HoodieMetadataColumnStats] = diff --git a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestPartitionStatsIndexWithSql.scala b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestPartitionStatsIndexWithSql.scala index 0e337cc3995..2be4a37c915 100644 --- a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestPartitionStatsIndexWithSql.scala +++ b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestPartitionStatsIndexWithSql.scala @@ -19,11 +19,20 @@ package org.apache.hudi.functional -import org.apache.hudi.common.model.WriteOperationType +import org.apache.hudi.client.common.HoodieSparkEngineContext +import org.apache.hudi.common.config.HoodieMetadataConfig +import org.apache.hudi.common.model.{FileSlice, WriteOperationType} import org.apache.hudi.common.table.{HoodieTableConfig, HoodieTableMetaClient} import org.apache.hudi.common.testutils.HoodieTestUtils - +import org.apache.hudi.metadata.HoodieMetadataFileSystemView +import org.apache.hudi.metadata.MetadataPartitionType.PARTITION_STATS +import org.apache.hudi.util.JFunction +import org.apache.hudi.{DataSourceReadOptions, HoodieFileIndex} +import org.apache.spark.api.java.JavaSparkContext +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, Expression, GreaterThan, LessThan, Literal} import org.apache.spark.sql.hudi.common.HoodieSparkSqlTestBase +import org.apache.spark.sql.types.{IntegerType, StringType} +import org.junit.jupiter.api
(hudi) branch master updated: [HUDI-7908] HoodieFileGroupReader fails if preCombine and partition fields are the same (#11473)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 107cffca4af [HUDI-7908] HoodieFileGroupReader fails if preCombine and partition fields are the same (#11473) 107cffca4af is described below commit 107cffca4aff52140dc8a59e120cf9bf695e4979 Author: Vova Kolmakov AuthorDate: Sat Jun 29 12:42:57 2024 +0700 [HUDI-7908] HoodieFileGroupReader fails if preCombine and partition fields are the same (#11473) Co-authored-by: Vova Kolmakov --- ...odieFileGroupReaderBasedParquetFileFormat.scala | 4 +- .../spark/sql/hudi/dml/TestInsertTable.scala | 44 ++ 2 files changed, 47 insertions(+), 1 deletion(-) diff --git a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala index 0ad2bda9cf2..f7c09b007be 100644 --- a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala +++ b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala @@ -107,7 +107,9 @@ class HoodieFileGroupReaderBasedParquetFileFormat(tableState: HoodieTableState, hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow] = { //dataSchema is not always right due to spark bugs val partitionColumns = partitionSchema.fieldNames -val dataSchema = StructType(tableSchema.structTypeSchema.fields.filterNot(f => partitionColumns.contains(f.name))) +val preCombineField = options.getOrElse(HoodieTableConfig.PRECOMBINE_FIELD.key, "") +val dataSchema = StructType(tableSchema.structTypeSchema.fields.filterNot(f => partitionColumns.contains(f.name) + && preCombineField.equals(f.name))) val outputSchema = StructType(requiredSchema.fields ++ partitionSchema.fields) val isCount = requiredSchema.isEmpty && !isMOR && !isIncremental val augmentedStorageConf = new HadoopStorageConfiguration(hadoopConf).getInline diff --git a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/TestInsertTable.scala b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/TestInsertTable.scala index b46723e97bf..0b1d4ca8999 100644 --- a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/TestInsertTable.scala +++ b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/TestInsertTable.scala @@ -2587,6 +2587,50 @@ class TestInsertTable extends HoodieSparkSqlTestBase { }) } + test(s"Test INSERT INTO with upsert operation type") { +if (HoodieSparkUtils.gteqSpark3_2) { + withTempDir { tmp => +Seq("mor").foreach { tableType => + val tableName = generateTableName + spark.sql( +s""" + |create table $tableName ( + | id int, + | name string, + | ts long, + | price int + |) using hudi + |partitioned by (ts) + |tblproperties ( + | type = '$tableType', + | primaryKey = 'id', + | preCombineField = 'ts' + |) + |location '${tmp.getCanonicalPath}/$tableName' + |""".stripMargin + ) + + // Test insert into with upsert operation type + spark.sql( +s""" + | insert into $tableName + | values (1, 'a1', 1000, 10), (2, 'a2', 2000, 20), (3, 'a3', 3000, 30), (4, 'a4', 2000, 10), (5, 'a5', 3000, 20), (6, 'a6', 4000, 30) + | """.stripMargin + ) + checkAnswer(s"select id, name, price, ts from $tableName where price > 3000")( +Seq(6, "a6", 4000, 30) + ) + + // Test update + spark.sql(s"update $tableName set price = price + 1 where id = 6") + checkAnswer(s"select id, name, price, ts from $tableName where price > 3000")( +Seq(6, "a6", 4001, 30) + ) +} + } +} + } + test("Test Insert Into with extraMetadata") { withTempDir { tmp => val tableName = generateTableName
[jira] [Assigned] (HUDI-7503) Concurrent executions of table service plan should not corrupt dataset
[ https://issues.apache.org/jira/browse/HUDI-7503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan reassigned HUDI-7503: - Assignee: (was: sivabalan narayanan) > Concurrent executions of table service plan should not corrupt dataset > -- > > Key: HUDI-7503 > URL: https://issues.apache.org/jira/browse/HUDI-7503 > Project: Apache Hudi > Issue Type: Improvement > Components: compaction, table-service >Reporter: Krishen Bhan >Priority: Minor > Labels: pull-request-available > Fix For: 0.16.0, 1.0.0 > > > Some external workflow schedulers can accidentally (or) misbehave and > schedule duplicate executions of the same compaction plan. We need a way to > guard against this inside Hudi (vs user taking a lock externally). In such a > world, 2 instance of the job concurrently call > `org.apache.hudi.client.BaseHoodieTableServiceClient#compact` on the same > compaction instant. > This is since one writer might execute the instant and create an inflight, > while the other writer sees the inflight and tries to roll it back before > re-attempting to execute it (since it will assume said inflight was a > previously failed compaction attempt). > This logic should be updated such that only one writer will actually execute > the compaction plan at a time (and the others will fail/abort). > One approach is to use a transaction (base table lock) in conjunction with > heartbeating, to ensure that the writer triggers a heartbeat before executing > compaction, and any concurrent writers will use the heartbeat to check wether > the compaction is currently being executed by another writer. Specifically , > the compact API should execute the following steps > # Get the instant to compact C (as usual) > # Start a transaction > # Checks if C has an active heartbeat, if so finish transaction and throw > exception > # Start a heartbeat for C (this will implicitly re-start the heartbeat if it > has been started before by another job) > # Finish transaction > # Run the existing compact API logic on C > # If execution succeeds, clean up heartbeat file . If it fails do nothing > (as the heartbeat will anyway be automatically expired later). > Note that this approach only holds the table lock temporarily, when > checking/starting the heartbeat > Also, this flow can be applied to execution of clean plans and other table > services -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7503) Concurrent executions of table service plan should not corrupt dataset
[ https://issues.apache.org/jira/browse/HUDI-7503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan reassigned HUDI-7503: - Assignee: sivabalan narayanan > Concurrent executions of table service plan should not corrupt dataset > -- > > Key: HUDI-7503 > URL: https://issues.apache.org/jira/browse/HUDI-7503 > Project: Apache Hudi > Issue Type: Improvement > Components: compaction, table-service >Reporter: Krishen Bhan > Assignee: sivabalan narayanan >Priority: Minor > Labels: pull-request-available > Fix For: 0.16.0, 1.0.0 > > > Some external workflow schedulers can accidentally (or) misbehave and > schedule duplicate executions of the same compaction plan. We need a way to > guard against this inside Hudi (vs user taking a lock externally). In such a > world, 2 instance of the job concurrently call > `org.apache.hudi.client.BaseHoodieTableServiceClient#compact` on the same > compaction instant. > This is since one writer might execute the instant and create an inflight, > while the other writer sees the inflight and tries to roll it back before > re-attempting to execute it (since it will assume said inflight was a > previously failed compaction attempt). > This logic should be updated such that only one writer will actually execute > the compaction plan at a time (and the others will fail/abort). > One approach is to use a transaction (base table lock) in conjunction with > heartbeating, to ensure that the writer triggers a heartbeat before executing > compaction, and any concurrent writers will use the heartbeat to check wether > the compaction is currently being executed by another writer. Specifically , > the compact API should execute the following steps > # Get the instant to compact C (as usual) > # Start a transaction > # Checks if C has an active heartbeat, if so finish transaction and throw > exception > # Start a heartbeat for C (this will implicitly re-start the heartbeat if it > has been started before by another job) > # Finish transaction > # Run the existing compact API logic on C > # If execution succeeds, clean up heartbeat file . If it fails do nothing > (as the heartbeat will anyway be automatically expired later). > Note that this approach only holds the table lock temporarily, when > checking/starting the heartbeat > Also, this flow can be applied to execution of clean plans and other table > services -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7503) Concurrent executions of table service plan should not corrupt dataset
[ https://issues.apache.org/jira/browse/HUDI-7503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan reassigned HUDI-7503: - Assignee: sivabalan narayanan > Concurrent executions of table service plan should not corrupt dataset > -- > > Key: HUDI-7503 > URL: https://issues.apache.org/jira/browse/HUDI-7503 > Project: Apache Hudi > Issue Type: Improvement > Components: compaction, table-service >Reporter: Krishen Bhan > Assignee: sivabalan narayanan >Priority: Minor > Labels: pull-request-available > Fix For: 0.16.0, 1.0.0 > > > Some external workflow schedulers can accidentally (or) misbehave and > schedule duplicate executions of the same compaction plan. We need a way to > guard against this inside Hudi (vs user taking a lock externally). In such a > world, 2 instance of the job concurrently call > `org.apache.hudi.client.BaseHoodieTableServiceClient#compact` on the same > compaction instant. > This is since one writer might execute the instant and create an inflight, > while the other writer sees the inflight and tries to roll it back before > re-attempting to execute it (since it will assume said inflight was a > previously failed compaction attempt). > This logic should be updated such that only one writer will actually execute > the compaction plan at a time (and the others will fail/abort). > One approach is to use a transaction (base table lock) in conjunction with > heartbeating, to ensure that the writer triggers a heartbeat before executing > compaction, and any concurrent writers will use the heartbeat to check wether > the compaction is currently being executed by another writer. Specifically , > the compact API should execute the following steps > # Get the instant to compact C (as usual) > # Start a transaction > # Checks if C has an active heartbeat, if so finish transaction and throw > exception > # Start a heartbeat for C (this will implicitly re-start the heartbeat if it > has been started before by another job) > # Finish transaction > # Run the existing compact API logic on C > # If execution succeeds, clean up heartbeat file . If it fails do nothing > (as the heartbeat will anyway be automatically expired later). > Note that this approach only holds the table lock temporarily, when > checking/starting the heartbeat > Also, this flow can be applied to execution of clean plans and other table > services -- This message was sent by Atlassian Jira (v8.20.10#820010)
(hudi) branch branch-0.x updated: [HUDI-6508] Support compilation on Java 11 (#11513)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a commit to branch branch-0.x in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/branch-0.x by this push: new 538e6619ed5 [HUDI-6508] Support compilation on Java 11 (#11513) 538e6619ed5 is described below commit 538e6619ed50cd64d12652058e4b5c68cfef0f99 Author: Y Ethan Guo AuthorDate: Tue Jun 25 18:56:01 2024 -0700 [HUDI-6508] Support compilation on Java 11 (#11513) --- .github/workflows/bot.yml | 167 +++-- .../hudi/table/TestHoodieMergeOnReadTable.java | 8 +- .../commit/TestCopyOnWriteActionExecutor.java | 15 +- .../hudi/metadata/HoodieTableMetadataUtil.java | 21 ++- hudi-examples/hudi-examples-common/pom.xml | 14 -- hudi-examples/hudi-examples-java/pom.xml | 14 -- .../org/apache/hudi/common/util/ParquetUtils.java | 21 +-- 7 files changed, 186 insertions(+), 74 deletions(-) diff --git a/.github/workflows/bot.yml b/.github/workflows/bot.yml index 72200c4822d..5d659123f13 100644 --- a/.github/workflows/bot.yml +++ b/.github/workflows/bot.yml @@ -245,12 +245,6 @@ jobs: - scalaProfile: "scala-2.12" sparkProfile: "spark3.4" sparkModules: "hudi-spark-datasource/hudi-spark3.4.x" - - scalaProfile: "scala-2.12" -sparkProfile: "spark3.5" -sparkModules: "hudi-spark-datasource/hudi-spark3.5.x" - - scalaProfile: "scala-2.13" -sparkProfile: "spark3.5" -sparkModules: "hudi-spark-datasource/hudi-spark3.5.x" steps: - uses: actions/checkout@v3 @@ -285,7 +279,6 @@ jobs: SCALA_PROFILE: ${{ matrix.scalaProfile }} SPARK_PROFILE: ${{ matrix.sparkProfile }} SPARK_MODULES: ${{ matrix.sparkModules }} -if: ${{ !endsWith(env.SPARK_PROFILE, '3.2') }} # skip test spark 3.2 as it's covered by Azure CI run: mvn test -Punit-tests -Pjava17 -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" -DwildcardSuites=skipScalaTests -DfailIfNoTests=false -pl "hudi-common,$SPARK_COMMON_MODULES,$SPARK_MODULES" $MVN_ARGS - name: Java FT - Spark @@ -293,7 +286,6 @@ jobs: SCALA_PROFILE: ${{ matrix.scalaProfile }} SPARK_PROFILE: ${{ matrix.sparkProfile }} SPARK_MODULES: ${{ matrix.sparkModules }} -if: ${{ !endsWith(env.SPARK_PROFILE, '3.2') }} # skip test spark 3.2 as it's covered by Azure CI run: mvn test -Pfunctional-tests -Pjava17 -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" -pl "$SPARK_COMMON_MODULES,$SPARK_MODULES" $MVN_ARGS @@ -308,6 +300,49 @@ jobs: - scalaProfile: "scala-2.12" sparkProfile: "spark3.4" sparkModules: "hudi-spark-datasource/hudi-spark3.4.x" + +steps: + - uses: actions/checkout@v3 + - name: Set up JDK 8 +uses: actions/setup-java@v3 +with: + java-version: '8' + distribution: 'temurin' + architecture: x64 + cache: maven + - name: Build Project +env: + SCALA_PROFILE: ${{ matrix.scalaProfile }} + SPARK_PROFILE: ${{ matrix.sparkProfile }} +run: + mvn clean install -T 2 -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" -DskipTests=true $MVN_ARGS -am -pl "hudi-examples/hudi-examples-spark,hudi-common,$SPARK_COMMON_MODULES,$SPARK_MODULES" + - name: Set up JDK 17 +uses: actions/setup-java@v3 +with: + java-version: '17' + distribution: 'temurin' + architecture: x64 + cache: maven + - name: Scala UT - Common & Spark +env: + SCALA_PROFILE: ${{ matrix.scalaProfile }} + SPARK_PROFILE: ${{ matrix.sparkProfile }} + SPARK_MODULES: ${{ matrix.sparkModules }} +run: + mvn test -Punit-tests -Pjava17 -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" -Dtest=skipJavaTests -DfailIfNoTests=false -pl "hudi-common,$SPARK_COMMON_MODULES,$SPARK_MODULES" $MVN_ARGS + - name: Scala FT - Spark +env: + SCALA_PROFILE: ${{ matrix.scalaProfile }} + SPARK_PROFILE: ${{ matrix.sparkProfile }} + SPARK_MODULES: ${{ matrix.sparkModules }} +run: + mvn test -Pfunctional-tests -Pjava17 -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" -Dtest=skipJavaTests -DfailIfNoTests=false -pl "$SPARK_COMMON_MODULES,$SPARK_MODULES" $MVN_ARGS + + test-spark-java11-17-java-tests: +runs-on: ubuntu-latest +strategy: + matrix: +include: - scalaProfile: "scala-2.12" sparkProfile: "spark3.5" spa
(hudi) branch master updated (3152e47876f -> 4b7e6e41573)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from 3152e47876f [MINOR] Bump JUnit version to 5.8.2 (#11511) add 4b7e6e41573 [HUDI-7922] Add Hudi CLI bundle for Scala 2.13 (#11495) No new revisions were added by this update. Summary of changes: .github/workflows/bot.yml | 2 +- .../apache/hudi/cli/commands/ArchivedCommitsCommand.java | 8 +--- .../org/apache/hudi/cli/commands/CompactionCommand.java | 15 --- .../scala/org/apache/hudi/util/JavaScalaConverters.scala | 8 scripts/release/deploy_staging_jars.sh| 3 ++- scripts/release/validate_staged_bundles.sh| 2 +- 6 files changed, 25 insertions(+), 13 deletions(-)
[jira] [Updated] (HUDI-7882) Umbrella ticket to track all changes required to support reading 1.x tables with 0.16.0
[ https://issues.apache.org/jira/browse/HUDI-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7882: -- Sprint: 2024/06/17-30 > Umbrella ticket to track all changes required to support reading 1.x tables > with 0.16.0 > > > Key: HUDI-7882 > URL: https://issues.apache.org/jira/browse/HUDI-7882 > Project: Apache Hudi > Issue Type: Improvement > Components: reader-core > Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Major > Fix For: 0.16.0 > > > We wanted to support reading 1.x tables in 0.16.0 release. So, creating this > umbrella ticket to track all of them. > > Changes required to be ported: > 0. Creating 0.16.0 branch > 0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. > > 1. Timeline > 1.a Hoodie instant parsing should be able to read 1.x instants. > https://issues.apache.org/jira/browse/HUDI-7883 Sagar. > 1.b Commit metadata parsing is able to handle both json and avro formats. > Scope might be non-trivial. https://issues.apache.org/jira/browse/HUDI-7866 > Siva. > 1.c HoodieDefaultTimeline able to read both timelines based on table version. > https://issues.apache.org/jira/browse/HUDI-7884 Siva. > 1.d Reading LSM timeline using 0.16.0 > https://issues.apache.org/jira/browse/HUDI-7890 Siva. > 1.e Ensure 1.0 MDT timeline is readable by 0.16 - HUDI-7901 > > 2. Table property changes > 2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885 > https://issues.apache.org/jira/browse/HUDI-7865 LJ > > 3. MDT table changes > 3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ > 3.b MDT payload schema changes. > https://issues.apache.org/jira/browse/HUDI-7886 LJ > > 4. Log format changes > 4.a All metadata header types porting > https://issues.apache.org/jira/browse/HUDI-7887 Jon > 4.b Meaningful error for incompatible features from 1.x > https://issues.apache.org/jira/browse/HUDI-7888 Jon > > 5. Log file slice or grouping detection compatibility > > 5. Tests > 5.a Tests to validate that 1.x tables can be read w/ 0.16.0 > https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. > > 6 Doc changes > 6.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. > https://issues.apache.org/jira/browse/HUDI-7889 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7882) Umbrella ticket to track all changes required to support reading 1.x tables with 0.16.0
[ https://issues.apache.org/jira/browse/HUDI-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7882: -- Fix Version/s: 0.16.0 > Umbrella ticket to track all changes required to support reading 1.x tables > with 0.16.0 > > > Key: HUDI-7882 > URL: https://issues.apache.org/jira/browse/HUDI-7882 > Project: Apache Hudi > Issue Type: Improvement > Components: reader-core > Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Major > Fix For: 0.16.0 > > > We wanted to support reading 1.x tables in 0.16.0 release. So, creating this > umbrella ticket to track all of them. > > Changes required to be ported: > 0. Creating 0.16.0 branch > 0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. > > 1. Timeline > 1.a Hoodie instant parsing should be able to read 1.x instants. > https://issues.apache.org/jira/browse/HUDI-7883 Sagar. > 1.b Commit metadata parsing is able to handle both json and avro formats. > Scope might be non-trivial. https://issues.apache.org/jira/browse/HUDI-7866 > Siva. > 1.c HoodieDefaultTimeline able to read both timelines based on table version. > https://issues.apache.org/jira/browse/HUDI-7884 Siva. > 1.d Reading LSM timeline using 0.16.0 > https://issues.apache.org/jira/browse/HUDI-7890 Siva. > 1.e Ensure 1.0 MDT timeline is readable by 0.16 - HUDI-7901 > > 2. Table property changes > 2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885 > https://issues.apache.org/jira/browse/HUDI-7865 LJ > > 3. MDT table changes > 3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ > 3.b MDT payload schema changes. > https://issues.apache.org/jira/browse/HUDI-7886 LJ > > 4. Log format changes > 4.a All metadata header types porting > https://issues.apache.org/jira/browse/HUDI-7887 Jon > 4.b Meaningful error for incompatible features from 1.x > https://issues.apache.org/jira/browse/HUDI-7888 Jon > > 5. Log file slice or grouping detection compatibility > > 5. Tests > 5.a Tests to validate that 1.x tables can be read w/ 0.16.0 > https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. > > 6 Doc changes > 6.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. > https://issues.apache.org/jira/browse/HUDI-7889 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7882) Umbrella ticket to track all changes required to support reading 1.x tables with 0.16.0
[ https://issues.apache.org/jira/browse/HUDI-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan reassigned HUDI-7882: - Assignee: sivabalan narayanan > Umbrella ticket to track all changes required to support reading 1.x tables > with 0.16.0 > > > Key: HUDI-7882 > URL: https://issues.apache.org/jira/browse/HUDI-7882 > Project: Apache Hudi > Issue Type: Improvement > Components: reader-core > Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Major > > We wanted to support reading 1.x tables in 0.16.0 release. So, creating this > umbrella ticket to track all of them. > > Changes required to be ported: > 0. Creating 0.16.0 branch > 0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. > > 1. Timeline > 1.a Hoodie instant parsing should be able to read 1.x instants. > https://issues.apache.org/jira/browse/HUDI-7883 Sagar. > 1.b Commit metadata parsing is able to handle both json and avro formats. > Scope might be non-trivial. https://issues.apache.org/jira/browse/HUDI-7866 > Siva. > 1.c HoodieDefaultTimeline able to read both timelines based on table version. > https://issues.apache.org/jira/browse/HUDI-7884 Siva. > 1.d Reading LSM timeline using 0.16.0 > https://issues.apache.org/jira/browse/HUDI-7890 Siva. > 1.e Ensure 1.0 MDT timeline is readable by 0.16 - HUDI-7901 > > 2. Table property changes > 2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885 > https://issues.apache.org/jira/browse/HUDI-7865 LJ > > 3. MDT table changes > 3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ > 3.b MDT payload schema changes. > https://issues.apache.org/jira/browse/HUDI-7886 LJ > > 4. Log format changes > 4.a All metadata header types porting > https://issues.apache.org/jira/browse/HUDI-7887 Jon > 4.b Meaningful error for incompatible features from 1.x > https://issues.apache.org/jira/browse/HUDI-7888 Jon > > 5. Log file slice or grouping detection compatibility > > 5. Tests > 5.a Tests to validate that 1.x tables can be read w/ 0.16.0 > https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. > > 6 Doc changes > 6.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. > https://issues.apache.org/jira/browse/HUDI-7889 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7921) Chase down memory leaks in Writeclient with MDT enabled
sivabalan narayanan created HUDI-7921: - Summary: Chase down memory leaks in Writeclient with MDT enabled Key: HUDI-7921 URL: https://issues.apache.org/jira/browse/HUDI-7921 Project: Apache Hudi Issue Type: Improvement Components: metadata Reporter: sivabalan narayanan We see OOMs when deltastreamer is running continuously for days together. We suspect some memory leaks when metadata table is enabled. Lets try to chase down all of them and fix it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7921) Chase down memory leaks in Writeclient with MDT enabled
[ https://issues.apache.org/jira/browse/HUDI-7921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan reassigned HUDI-7921: - Assignee: sivabalan narayanan > Chase down memory leaks in Writeclient with MDT enabled > --- > > Key: HUDI-7921 > URL: https://issues.apache.org/jira/browse/HUDI-7921 > Project: Apache Hudi > Issue Type: Improvement > Components: metadata >Reporter: sivabalan narayanan > Assignee: sivabalan narayanan >Priority: Major > Fix For: 1.0.0 > > > We see OOMs when deltastreamer is running continuously for days together. We > suspect some memory leaks when metadata table is enabled. Lets try to chase > down all of them and fix it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7921) Chase down memory leaks in Writeclient with MDT enabled
[ https://issues.apache.org/jira/browse/HUDI-7921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7921: -- Fix Version/s: 1.0.0 > Chase down memory leaks in Writeclient with MDT enabled > --- > > Key: HUDI-7921 > URL: https://issues.apache.org/jira/browse/HUDI-7921 > Project: Apache Hudi > Issue Type: Improvement > Components: metadata >Reporter: sivabalan narayanan >Priority: Major > Fix For: 1.0.0 > > > We see OOMs when deltastreamer is running continuously for days together. We > suspect some memory leaks when metadata table is enabled. Lets try to chase > down all of them and fix it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
(hudi) branch master updated: [MINOR][DNM] Test disabling new HFile reader (#11488)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 1ce97bae116 [MINOR][DNM] Test disabling new HFile reader (#11488) 1ce97bae116 is described below commit 1ce97bae11655c9a33f8665c3dd53116302686ee Author: Y Ethan Guo AuthorDate: Fri Jun 21 18:44:15 2024 -0700 [MINOR][DNM] Test disabling new HFile reader (#11488) --- .../src/main/java/org/apache/hudi/common/config/HoodieReaderConfig.java | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hudi-common/src/main/java/org/apache/hudi/common/config/HoodieReaderConfig.java b/hudi-common/src/main/java/org/apache/hudi/common/config/HoodieReaderConfig.java index bb29e090ec3..a7e41098d66 100644 --- a/hudi-common/src/main/java/org/apache/hudi/common/config/HoodieReaderConfig.java +++ b/hudi-common/src/main/java/org/apache/hudi/common/config/HoodieReaderConfig.java @@ -31,7 +31,7 @@ import javax.annotation.concurrent.Immutable; public class HoodieReaderConfig extends HoodieConfig { public static final ConfigProperty USE_NATIVE_HFILE_READER = ConfigProperty .key("_hoodie.hfile.use.native.reader") - .defaultValue(true) + .defaultValue(false) .markAdvanced() .sinceVersion("1.0.0") .withDocumentation("When enabled, the native HFile reader is used to read HFiles. This is an internal config.");
[jira] [Updated] (HUDI-7882) Umbrella ticket to track all changes required to support reading 1.x tables with 0.16.0
[ https://issues.apache.org/jira/browse/HUDI-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7882: -- Description: We wanted to support reading 1.x tables in 0.16.0 release. So, creating this umbrella ticket to track all of them. Changes required to be ported: 0. Creating 0.16.0 branch 0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. 1. Timeline 1.a Hoodie instant parsing should be able to read 1.x instants. https://issues.apache.org/jira/browse/HUDI-7883 Sagar. 1.b Commit metadata parsing is able to handle both json and avro formats. Scope might be non-trivial. https://issues.apache.org/jira/browse/HUDI-7866 Siva. 1.c HoodieDefaultTimeline able to read both timelines based on table version. https://issues.apache.org/jira/browse/HUDI-7884 Siva. 1.d Reading LSM timeline using 0.16.0 https://issues.apache.org/jira/browse/HUDI-7890 Siva. 1.e Ensure 1.0 MDT timeline is readable by 0.16 - HUDI-7901 2. Table property changes 2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885 https://issues.apache.org/jira/browse/HUDI-7865 LJ 3. MDT table changes 3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ 3.b MDT payload schema changes. https://issues.apache.org/jira/browse/HUDI-7886 LJ 4. Log format changes 4.a All metadata header types porting https://issues.apache.org/jira/browse/HUDI-7887 Jon 4.b Meaningful error for incompatible features from 1.x https://issues.apache.org/jira/browse/HUDI-7888 Jon 5. Log file slice or grouping detection compatibility 5. Tests 5.a Tests to validate that 1.x tables can be read w/ 0.16.0 https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. 6 Doc changes 6.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. https://issues.apache.org/jira/browse/HUDI-7889 was: We wanted to support reading 1.x tables in 0.16.0 release. So, creating this umbrella ticket to track all of them. Changes required to be ported: 0. Creating 0.16.0 branch 0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. 1. Timeline 1.a Hoodie instant parsing should be able to read 1.x instants. https://issues.apache.org/jira/browse/HUDI-7883 Sagar. 1.b Commit metadata parsing is able to handle both json and avro formats. Scope might be non-trivial. https://issues.apache.org/jira/browse/HUDI-7866 Siva. 1.c HoodieDefaultTimeline able to read both timelines based on table version. https://issues.apache.org/jira/browse/HUDI-7884 Siva. 1.d Reading LSM timeline using 0.16.0 https://issues.apache.org/jira/browse/HUDI-7890 Siva. 1.e Ensure 1.0 MDT timeline is readable by 0.16 - HUDI-7901 2. Table property changes 2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885 https://issues.apache.org/jira/browse/HUDI-7865 LJ 3. MDT table changes 3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ 3.b MDT payload schema changes. https://issues.apache.org/jira/browse/HUDI-7886 LJ 4. Log format changes 4.a All metadata header types porting https://issues.apache.org/jira/browse/HUDI-7887 Jon 4.b Meaningful error for incompatible features from 1.x https://issues.apache.org/jira/browse/HUDI-7888 Jon 5. Tests 5.a Tests to validate that 1.x tables can be read w/ 0.16.0 https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. 6 Doc changes 6.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. https://issues.apache.org/jira/browse/HUDI-7889 > Umbrella ticket to track all changes required to support reading 1.x tables > with 0.16.0 > > > Key: HUDI-7882 > URL: https://issues.apache.org/jira/browse/HUDI-7882 > Project: Apache Hudi > Issue Type: Improvement > Components: reader-core > Reporter: sivabalan narayanan >Priority: Major > > We wanted to support reading 1.x tables in 0.16.0 release. So, creating this > umbrella ticket to track all of them. > > Changes required to be ported: > 0. Creating 0.16.0 branch > 0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. > > 1. Timeline > 1.a Hoodie instant parsing should be able to read 1.x instants. > https://issues.apache.org/jira/browse/HUDI-7883 Sagar. > 1.b Commit metadata parsing is able to handle both json and avro formats. > Scope might be non-trivial. https://issues.apache.org/jira/browse/HUDI-7866 > Siva. > 1.c HoodieDefaultTimeline able to read both timelines based on table version. > https://issues.apache.org/jira/browse/HUDI-7884 Siva. > 1.d Reading LSM timeline using 0.16.0 > https://i
[jira] [Updated] (HUDI-7882) Umbrella ticket to track all changes required to support reading 1.x tables with 0.16.0
[ https://issues.apache.org/jira/browse/HUDI-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7882: -- Description: We wanted to support reading 1.x tables in 0.16.0 release. So, creating this umbrella ticket to track all of them. Changes required to be ported: 0. Creating 0.16.0 branch 0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. 1. Timeline 1.a Hoodie instant parsing should be able to read 1.x instants. https://issues.apache.org/jira/browse/HUDI-7883 Sagar. 1.b Commit metadata parsing is able to handle both json and avro formats. Scope might be non-trivial. https://issues.apache.org/jira/browse/HUDI-7866 Siva. 1.c HoodieDefaultTimeline able to read both timelines based on table version. https://issues.apache.org/jira/browse/HUDI-7884 Siva. 1.d Reading LSM timeline using 0.16.0 https://issues.apache.org/jira/browse/HUDI-7890 Siva. 1.e Ensure 1.0 MDT timeline is readable by 0.16 - HUDI-7901 2. Table property changes 2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885 https://issues.apache.org/jira/browse/HUDI-7865 LJ 3. MDT table changes 3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ 3.b MDT payload schema changes. https://issues.apache.org/jira/browse/HUDI-7886 LJ 4. Log format changes 4.a All metadata header types porting https://issues.apache.org/jira/browse/HUDI-7887 Jon 4.b Meaningful error for incompatible features from 1.x https://issues.apache.org/jira/browse/HUDI-7888 Jon 5. Tests 5.a Tests to validate that 1.x tables can be read w/ 0.16.0 https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. 6 Doc changes 6.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. https://issues.apache.org/jira/browse/HUDI-7889 was: We wanted to support reading 1.x tables in 0.16.0 release. So, creating this umbrella ticket to track all of them. Changes required to be ported: 0. Creating 0.16.0 branch 0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. 1. Timeline 1.a Commit instant parsing should be able to read 1.x instants. https://issues.apache.org/jira/browse/HUDI-7883 Sagar. 1.b Commit metadata parsing is able to handle both json and avro formats. Scope might be non-trivial. https://issues.apache.org/jira/browse/HUDI-7866 Siva. 1.c HoodieDefaultTimeline able to read both timelines based on table version. https://issues.apache.org/jira/browse/HUDI-7884 Siva. 1.d Reading LSM timeline using 0.16.0 https://issues.apache.org/jira/browse/HUDI-7890 Siva. 1.e Ensure 1.0 MDT timeline is readable by 0.16 - HUDI-7901 2. Table property changes 2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885 https://issues.apache.org/jira/browse/HUDI-7865 LJ 3. MDT table changes 3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ 3.b MDT payload schema changes. https://issues.apache.org/jira/browse/HUDI-7886 LJ 4. Log format changes 4.a All metadata header types porting https://issues.apache.org/jira/browse/HUDI-7887 Jon 4.b Meaningful error for incompatible features from 1.x https://issues.apache.org/jira/browse/HUDI-7888 Jon 5. Tests 5.a Tests to validate that 1.x tables can be read w/ 0.16.0 https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. 6 Doc changes 6.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. https://issues.apache.org/jira/browse/HUDI-7889 > Umbrella ticket to track all changes required to support reading 1.x tables > with 0.16.0 > > > Key: HUDI-7882 > URL: https://issues.apache.org/jira/browse/HUDI-7882 > Project: Apache Hudi > Issue Type: Improvement > Components: reader-core > Reporter: sivabalan narayanan >Priority: Major > > We wanted to support reading 1.x tables in 0.16.0 release. So, creating this > umbrella ticket to track all of them. > > Changes required to be ported: > 0. Creating 0.16.0 branch > 0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. > > 1. Timeline > 1.a Hoodie instant parsing should be able to read 1.x instants. > https://issues.apache.org/jira/browse/HUDI-7883 Sagar. > 1.b Commit metadata parsing is able to handle both json and avro formats. > Scope might be non-trivial. https://issues.apache.org/jira/browse/HUDI-7866 > Siva. > 1.c HoodieDefaultTimeline able to read both timelines based on table version. > https://issues.apache.org/jira/browse/HUDI-7884 Siva. > 1.d Reading LSM timeline using 0.16.0 > https://issues.apache.org/jira/browse/HUDI-7890 Siva. > 1.e En
[jira] [Updated] (HUDI-7896) Add tests to validate 1.x tables are readable with 0.16.0 hudi
[ https://issues.apache.org/jira/browse/HUDI-7896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7896: -- Description: We need tests added to 0.16.0 hudi to ensure it can read 1.x tables. We should have good coverage for all new log formats, diff table types, pending compactions, pending clustering, completed clustering, failed commits/delta commits, pending rollbacks, pending cleans etc. diff payloads, key gen, partitioned/non-partitioned. regular index and global index. Cover all different writers: a. spark ds writer b. spark sql c. deltastreamer d. spark streaming writer Reader side validation: lets account for snapshot read, incremental read, time travel read. was: We need tests added to 0.16.0 hudi to ensure it can read 1.x tables. We should have good coverage for all new log formats, diff table types, pending compactions, pending clustering, completed clustering, failed commits/delta commits, pending rollbacks, pending cleans etc. diff payloads, key gen, partitioned/non-partitioned. Cover all different writers: a. spark ds writer b. spark sql c. deltastreamer d. spark streaming writer Reader side validation: lets account for snapshot read, incremental read, time travel read. > Add tests to validate 1.x tables are readable with 0.16.0 hudi > -- > > Key: HUDI-7896 > URL: https://issues.apache.org/jira/browse/HUDI-7896 > Project: Apache Hudi > Issue Type: Improvement > Components: tests-ci >Reporter: sivabalan narayanan > Assignee: sivabalan narayanan >Priority: Major > > We need tests added to 0.16.0 hudi to ensure it can read 1.x tables. > > We should have good coverage for all new log formats, diff table types, > pending compactions, pending clustering, completed clustering, failed > commits/delta commits, pending rollbacks, pending cleans etc. diff payloads, > key gen, partitioned/non-partitioned. regular index and global index. > Cover all different writers: > a. spark ds writer > b. spark sql > c. deltastreamer > d. spark streaming writer > > > > Reader side validation: > lets account for snapshot read, incremental read, time travel read. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7894) Umbrella ticket for 0.16.0 to 1.0.0 upgrade
[ https://issues.apache.org/jira/browse/HUDI-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7894: -- Fix Version/s: 1.0.0 > Umbrella ticket for 0.16.0 to 1.0.0 upgrade > --- > > Key: HUDI-7894 > URL: https://issues.apache.org/jira/browse/HUDI-7894 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Priority: Major > Fix For: 1.0.0 > > > # Rewrite archived timeline as LSM timeline - HUDI-7858 > # Test backward compatibility for table properties/timeline/log format > changes - create 1.0 tables as test artifacts - HUDI-7898 > # Per log file marker - HUDI-1517 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7863) Ensure full compaction before downgrade for unsupported features
[ https://issues.apache.org/jira/browse/HUDI-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7863: -- Fix Version/s: 1.0.0 > Ensure full compaction before downgrade for unsupported features > > > Key: HUDI-7863 > URL: https://issues.apache.org/jira/browse/HUDI-7863 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Priority: Major > Fix For: 1.0.0 > > > While downgrading from 1.0 to 0.16.0, lets check for log headers for presence > of unsupported features (partial updates, deletion vectors), and trigger full > compaction. > We need a custom compaction strategy as well as part of it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7863) Ensure full compaction before downgrade for unsupported features
[ https://issues.apache.org/jira/browse/HUDI-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7863: -- Description: While downgrading from 1.0 to 0.16.0, lets check for log headers for presence of unsupported features (partial updates, deletion vectors), and trigger full compaction. We need a custom compaction strategy as well as part of it. > Ensure full compaction before downgrade for unsupported features > > > Key: HUDI-7863 > URL: https://issues.apache.org/jira/browse/HUDI-7863 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Priority: Major > > While downgrading from 1.0 to 0.16.0, lets check for log headers for presence > of unsupported features (partial updates, deletion vectors), and trigger full > compaction. > We need a custom compaction strategy as well as part of it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7863) Ensure full compaction before downgrade for unsupported features
[ https://issues.apache.org/jira/browse/HUDI-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7863: -- Summary: Ensure full compaction before downgrade for unsupported features (was: Ensure full compaction before downgrade) > Ensure full compaction before downgrade for unsupported features > > > Key: HUDI-7863 > URL: https://issues.apache.org/jira/browse/HUDI-7863 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7896) Add tests to validate 1.x tables are readable with 0.16.0 hudi
[ https://issues.apache.org/jira/browse/HUDI-7896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7896: -- Description: We need tests added to 0.16.0 hudi to ensure it can read 1.x tables. We should have good coverage for all new log formats, diff table types, pending compactions, pending clustering, completed clustering, failed commits/delta commits, pending rollbacks, pending cleans etc. diff payloads, key gen, partitioned/non-partitioned. Cover all different writers: a. spark ds writer b. spark sql c. deltastreamer d. spark streaming writer Reader side validation: lets account for snapshot read, incremental read, time travel read. was: We need tests added to 0.16.0 hudi to ensure it can read 1.x tables. We should have good coverage for all new log formats, diff table types, pending compactions, pending clustering, completed clustering, failed commits/delta commits, pending rollbacks, pending cleans etc. Reader side validation: lets account for snapshot read, incremental read, time travel read. > Add tests to validate 1.x tables are readable with 0.16.0 hudi > -- > > Key: HUDI-7896 > URL: https://issues.apache.org/jira/browse/HUDI-7896 > Project: Apache Hudi > Issue Type: Improvement > Components: tests-ci >Reporter: sivabalan narayanan > Assignee: sivabalan narayanan >Priority: Major > > We need tests added to 0.16.0 hudi to ensure it can read 1.x tables. > > We should have good coverage for all new log formats, diff table types, > pending compactions, pending clustering, completed clustering, failed > commits/delta commits, pending rollbacks, pending cleans etc. diff payloads, > key gen, partitioned/non-partitioned. > Cover all different writers: > a. spark ds writer > b. spark sql > c. deltastreamer > d. spark streaming writer > > Reader side validation: > lets account for snapshot read, incremental read, time travel read. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7898) Add tests to validate 0.x tables are readable with 1.x hudi
[ https://issues.apache.org/jira/browse/HUDI-7898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7898: -- Description: We need tests added to 1.x hudi to ensure it can read 0.16.0 tables. We should have good coverage for all new log formats, diff table types, pending compactions, pending clustering, completed clustering, failed commits/delta commits, pending rollbacks, pending cleans etc. diff payloads, key gen, partitioned/non-partitioned. Cover all different writers: a. spark ds writer b. spark sql c. deltastreamer d. spark streaming writer Reader side validation: lets account for snapshot read, incremental read, time travel read. was: We need tests added to 1.x hudi to ensure it can read 0.16.0 tables. We should have good coverage for all new log formats, diff table types, pending compactions, pending clustering, completed clustering, failed commits/delta commits, pending rollbacks, pending cleans etc. diff payloads. Reader side validation: lets account for snapshot read, incremental read, time travel read. > Add tests to validate 0.x tables are readable with 1.x hudi > --- > > Key: HUDI-7898 > URL: https://issues.apache.org/jira/browse/HUDI-7898 > Project: Apache Hudi > Issue Type: Improvement > Components: tests-ci >Reporter: sivabalan narayanan > Assignee: sivabalan narayanan >Priority: Major > > We need tests added to 1.x hudi to ensure it can read 0.16.0 tables. > > We should have good coverage for all new log formats, diff table types, > pending compactions, pending clustering, completed clustering, failed > commits/delta commits, pending rollbacks, pending cleans etc. diff payloads, > key gen, partitioned/non-partitioned. > Cover all different writers: > a. spark ds writer > b. spark sql > c. deltastreamer > d. spark streaming writer > > Reader side validation: > lets account for snapshot read, incremental read, time travel read. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7898) Add tests to validate 0.x tables are readable with 1.x hudi
[ https://issues.apache.org/jira/browse/HUDI-7898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7898: -- Description: We need tests added to 1.x hudi to ensure it can read 0.16.0 tables. We should have good coverage for all new log formats, diff table types, pending compactions, pending clustering, completed clustering, failed commits/delta commits, pending rollbacks, pending cleans etc. Reader side validation: lets account for snapshot read, incremental read, time travel read. was: We need tests added to 0.16.0 hudi to ensure it can read 1.x tables. We should have good coverage for all new log formats, diff table types, pending compactions, pending clustering, completed clustering, failed commits/delta commits, pending rollbacks, pending cleans etc. Reader side validation: lets account for snapshot read, incremental read, time travel read. > Add tests to validate 0.x tables are readable with 1.x hudi > --- > > Key: HUDI-7898 > URL: https://issues.apache.org/jira/browse/HUDI-7898 > Project: Apache Hudi > Issue Type: Improvement > Components: tests-ci >Reporter: sivabalan narayanan > Assignee: sivabalan narayanan >Priority: Major > > We need tests added to 1.x hudi to ensure it can read 0.16.0 tables. > > We should have good coverage for all new log formats, diff table types, > pending compactions, pending clustering, completed clustering, failed > commits/delta commits, pending rollbacks, pending cleans etc. > > Reader side validation: > lets account for snapshot read, incremental read, time travel read. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7898) Add tests to validate 0.x tables are readable with 1.x hudi
sivabalan narayanan created HUDI-7898: - Summary: Add tests to validate 0.x tables are readable with 1.x hudi Key: HUDI-7898 URL: https://issues.apache.org/jira/browse/HUDI-7898 Project: Apache Hudi Issue Type: Improvement Components: tests-ci Reporter: sivabalan narayanan Assignee: sivabalan narayanan We need tests added to 0.16.0 hudi to ensure it can read 1.x tables. We should have good coverage for all new log formats, diff table types, pending compactions, pending clustering, completed clustering, failed commits/delta commits, pending rollbacks, pending cleans etc. Reader side validation: lets account for snapshot read, incremental read, time travel read. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7898) Add tests to validate 0.x tables are readable with 1.x hudi
[ https://issues.apache.org/jira/browse/HUDI-7898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7898: -- Description: We need tests added to 1.x hudi to ensure it can read 0.16.0 tables. We should have good coverage for all new log formats, diff table types, pending compactions, pending clustering, completed clustering, failed commits/delta commits, pending rollbacks, pending cleans etc. diff payloads. Reader side validation: lets account for snapshot read, incremental read, time travel read. was: We need tests added to 1.x hudi to ensure it can read 0.16.0 tables. We should have good coverage for all new log formats, diff table types, pending compactions, pending clustering, completed clustering, failed commits/delta commits, pending rollbacks, pending cleans etc. Reader side validation: lets account for snapshot read, incremental read, time travel read. > Add tests to validate 0.x tables are readable with 1.x hudi > --- > > Key: HUDI-7898 > URL: https://issues.apache.org/jira/browse/HUDI-7898 > Project: Apache Hudi > Issue Type: Improvement > Components: tests-ci >Reporter: sivabalan narayanan > Assignee: sivabalan narayanan >Priority: Major > > We need tests added to 1.x hudi to ensure it can read 0.16.0 tables. > > We should have good coverage for all new log formats, diff table types, > pending compactions, pending clustering, completed clustering, failed > commits/delta commits, pending rollbacks, pending cleans etc. diff payloads. > > Reader side validation: > lets account for snapshot read, incremental read, time travel read. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7882) Umbrella ticket to track all changes required to support reading 1.x tables with 0.16.0
[ https://issues.apache.org/jira/browse/HUDI-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7882: -- Description: We wanted to support reading 1.x tables in 0.16.0 release. So, creating this umbrella ticket to track all of them. Changes required to be ported: 0. Creating 0.16.0 branch 0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. 1. Timeline 1.a Commit instant parsing should be able to read 1.x instants. https://issues.apache.org/jira/browse/HUDI-7883 Sagar. 1.b Commit metadata parsing is able to handle both json and avro formats. Scope might be non-trivial. https://issues.apache.org/jira/browse/HUDI-7866 Siva. 1.c HoodieDefaultTimeline able to read both timelines based on table version. https://issues.apache.org/jira/browse/HUDI-7884 Siva. 1.d Reading LSM timeline using 0.16.0 https://issues.apache.org/jira/browse/HUDI-7890 Siva. 2. Table property changes 2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885 https://issues.apache.org/jira/browse/HUDI-7865 LJ 3. MDT table changes 3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ 3.b MDT payload schema changes. https://issues.apache.org/jira/browse/HUDI-7886 LJ 4. Log format changes 4.a All metadata header types porting https://issues.apache.org/jira/browse/HUDI-7887 Jon 4.b Meaningful error for incompatible features from 1.x https://issues.apache.org/jira/browse/HUDI-7888 Jon 5. Tests 5.a Tests to validate that 1.x tables can be read w/ 0.16.0 https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. 6 Doc changes 6.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. https://issues.apache.org/jira/browse/HUDI-7889 was: We wanted to support reading 1.x tables in 0.16.0 release. So, creating this umbrella ticket to track all of them. Changes required to be ported: 0. Creating 0.16.0 branch 0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. 1. Timeline 1.a Commit instant parsing should be able to read 1.x instants. https://issues.apache.org/jira/browse/HUDI-7883 Sagar. 1.b Commit metadata parsing is able to handle both json and avro formats. Scope might be non-trivial. https://issues.apache.org/jira/browse/HUDI-7866 Siva. 1.c HoodieDefaultTimeline able to read both timelines based on table version. https://issues.apache.org/jira/browse/HUDI-7884 Siva. 1.d Reading LSM timeline using 0.16.0 https://issues.apache.org/jira/browse/HUDI-7890 Siva. 2. Table property changes 2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885 https://issues.apache.org/jira/browse/HUDI-7865 LJ 3. MDT table changes 3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ 3.b MDT payload schema changes. https://issues.apache.org/jira/browse/HUDI-7886 LJ 4. Log format changes 4.a All metadata header types porting https://issues.apache.org/jira/browse/HUDI-7887 Jon 4.b Meaningful error for incompatible features from 1.x https://issues.apache.org/jira/browse/HUDI-7888 Jon 5. Tests 5.a Tests to validate that 1.x tables can be read w/ 0.16.0 https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. 6 Doc changes 5.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. https://issues.apache.org/jira/browse/HUDI-7889 > Umbrella ticket to track all changes required to support reading 1.x tables > with 0.16.0 > > > Key: HUDI-7882 > URL: https://issues.apache.org/jira/browse/HUDI-7882 > Project: Apache Hudi > Issue Type: Improvement > Components: reader-core > Reporter: sivabalan narayanan >Priority: Major > > We wanted to support reading 1.x tables in 0.16.0 release. So, creating this > umbrella ticket to track all of them. > > Changes required to be ported: > 0. Creating 0.16.0 branch > 0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. > > 1. Timeline > 1.a Commit instant parsing should be able to read 1.x instants. > https://issues.apache.org/jira/browse/HUDI-7883 Sagar. > 1.b Commit metadata parsing is able to handle both json and avro formats. > Scope might be non-trivial. https://issues.apache.org/jira/browse/HUDI-7866 > Siva. > 1.c HoodieDefaultTimeline able to read both timelines based on table version. > https://issues.apache.org/jira/browse/HUDI-7884 Siva. > 1.d Reading LSM timeline using 0.16.0 > https://issues.apache.org/jira/browse/HUDI-7890 Siva. > > 2. Table property changes > 2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885 > https://
[jira] [Updated] (HUDI-7884) Design HoodieDefaultTimeline to be able to read timeline both from 0.16.0 and from 1.0
[ https://issues.apache.org/jira/browse/HUDI-7884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7884: -- Description: We wanted to support reading 1.x tables in 0.16.0 reader. Either entire timeline will be in 0.x or in 1.x. So, lets design a HoodieDefaultTimeline to be able to parse either timeline based on table version. We could define Hoodie0.xDefaultTimeline and Hoodie1.xDefaultTimeline and internally(w/n HoodieDefaultTimeline) to delegate to either of them. Related PR links: [HUDI-6776] Replace JSON with Avro bytes for commit metadata [https://github.com/apache/hudi/pull/9579] was: We wanted to support reading 1.x tables in 0.16.0 reader. Either entire timeline will be in 0.x or in 1.x. So, lets design a HoodieDefaultTimeline to be able to parse either timeline based on table version. We could define Hoodie0.xDefaultTimeline and Hoodie1.xDefaultTimeline and internally(w/n HoodieDefaultTimeline) to delegate to either of them. > Design HoodieDefaultTimeline to be able to read timeline both from 0.16.0 and > from 1.0 > -- > > Key: HUDI-7884 > URL: https://issues.apache.org/jira/browse/HUDI-7884 > Project: Apache Hudi > Issue Type: Improvement > Components: reader-core > Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Major > > We wanted to support reading 1.x tables in 0.16.0 reader. > Either entire timeline will be in 0.x or in 1.x. So, lets design a > HoodieDefaultTimeline to be able to parse either timeline based on table > version. We could define Hoodie0.xDefaultTimeline and > Hoodie1.xDefaultTimeline and internally(w/n HoodieDefaultTimeline) to > delegate to either of them. > > Related PR links: > [HUDI-6776] Replace JSON with Avro bytes for commit metadata > [https://github.com/apache/hudi/pull/9579] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7890) Ensure 0.16.0 reader can read LSM timeline from 1.x
[ https://issues.apache.org/jira/browse/HUDI-7890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan reassigned HUDI-7890: - Assignee: sivabalan narayanan > Ensure 0.16.0 reader can read LSM timeline from 1.x > --- > > Key: HUDI-7890 > URL: https://issues.apache.org/jira/browse/HUDI-7890 > Project: Apache Hudi > Issue Type: Improvement > Components: reader-core >Reporter: sivabalan narayanan > Assignee: sivabalan narayanan >Priority: Major > > We are looking to support reading 1.x tables from 0.16.0 readers. > > so, lets ensure LSM timeline reader is supported in 0.16.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7888) Throw meaningful error when reading partial update or DV written in 1.x from 0.16.0 reader
[ https://issues.apache.org/jira/browse/HUDI-7888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7888: -- Description: We wanted to support reading 1.x tables in 0.16.0 reader. If 1.x table does not have any new features enabled which are backwards incompatible we are good. If not, if someone has enabled partial update feature or deletion vector support, we should parse and throw a meaningful error from 0.16.0 reader. Lets also comb for any other additional features in 1.x and throw meaningful error. was: We wanted to support reading 1.x tables in 0.16.0 reader. If 1.x table does not have any new features enabled which are backwards incompatible we are good. If not, if someone has enabled partial update feature or deletion vector support, we should parse and throw a meaningful error from 0.16.0 reader > Throw meaningful error when reading partial update or DV written in 1.x from > 0.16.0 reader > -- > > Key: HUDI-7888 > URL: https://issues.apache.org/jira/browse/HUDI-7888 > Project: Apache Hudi > Issue Type: Improvement > Components: reader-core > Reporter: sivabalan narayanan >Assignee: Jonathan Vexler >Priority: Major > > We wanted to support reading 1.x tables in 0.16.0 reader. > > If 1.x table does not have any new features enabled which are backwards > incompatible we are good. If not, if someone has enabled partial update > feature or deletion vector support, we should parse and throw a meaningful > error from 0.16.0 reader. Lets also comb for any other additional features in > 1.x and throw meaningful error. > -- This message was sent by Atlassian Jira (v8.20.10#820010)