date:20240529

Re: [PR] [HUDI-7815] Multiple writer with bulkinsert getAllPendingClusteringPlans should refresh timeline [hudi]

2024-05-29 Thread via GitHub



hudi-bot commented on PR #11365:
URL: https://github.com/apache/hudi/pull/11365#issuecomment-2138810841

   
   ## CI report:
   
   * 8147454d905761bd2256aac273ef69aa1e56fba8 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7146] Integrate secondary index on reader path [hudi]

2024-05-29 Thread via GitHub



hudi-bot commented on PR #11162:
URL: https://github.com/apache/hudi/pull/11162#issuecomment-2138810306

   
   ## CI report:
   
   * 9d0e80222f6cc69b2dba6f4cdbfc642f31a95e52 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24124)
 
   * a602c9c4234062e66877fc4bf2c50f94f43767bc UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7146] Implement secondary index write path [hudi]

2024-05-29 Thread via GitHub



hudi-bot commented on PR #11146:
URL: https://github.com/apache/hudi/pull/11146#issuecomment-2138810132

   
   ## CI report:
   
   * 470bc5f44e7a6658a8717ef1b77e92afcdd90087 UNKNOWN
   * e8a4507886bc97b1819ea39788f2abd7385b8cf2 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24139)
 
   * 43f73661f79eb87ac52d29fa153b996a15f29b99 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7007] Add bloom_filters index support on read side [hudi]

2024-05-29 Thread via GitHub



hudi-bot commented on PR #11043:
URL: https://github.com/apache/hudi/pull/11043#issuecomment-2138809912

   
   ## CI report:
   
   * 541b544049e68b3d22cdf0f5159fbd9b0005d345 UNKNOWN
   * 87c15b2c23430d967749dede5e09d74a33dcce88 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24128)
 
   * 6ece7645a69b367901c71ab78dea15f39d69fca5 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7146] Implement secondary index write path [hudi]

2024-05-29 Thread via GitHub



codope commented on code in PR #11146:
URL: https://github.com/apache/hudi/pull/11146#discussion_r1620048869


##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieIndexDefinition.java:
##
@@ -45,14 +49,14 @@ public class HoodieFunctionalIndexDefinition implements 
Serializable {
   // Any other configuration or properties specific to the index
   private Map indexOptions;
 
-  public HoodieFunctionalIndexDefinition() {
+  public HoodieIndexDefinition() {
   }
 
-  public HoodieFunctionalIndexDefinition(String indexName, String indexType, 
String indexFunction, List sourceFields,
- Map indexOptions) {
+  public HoodieIndexDefinition(String indexName, String indexType, String 
indexFunction, List sourceFields,
+   Map indexOptions) {
 this.indexName = indexName;
 this.indexType = indexType;
-this.indexFunction = indexFunction;
+this.indexFunction = nonEmpty(indexFunction) ? indexFunction : 
SPARK_IDENTITY;

Review Comment:
   On second thought, it should not bind to spark function. I will correct it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] [HUDI-7815] Multiple writer with bulkinsert getAllPendingClusteringPl… [hudi]

2024-05-29 Thread via GitHub



xuzifu666 opened a new pull request, #11365:
URL: https://github.com/apache/hudi/pull/11365

   ### Change Logs
   
   Multiple writer with bulkinsert getAllPendingClusteringPlans should refresh 
timeline
   Caused by: org.apache.hudi.exception.HoodieException: Error getting all file 
groups in pending clustering
at 
org.apache.hudi.common.util.ClusteringUtils.getAllFileGroupsInPendingClusteringPlans(ClusteringUtils.java:135)
at 
org.apache.hudi.common.table.view.AbstractTableFileSystemView.init(AbstractTableFileSystemView.java:113)
at 
org.apache.hudi.common.table.view.HoodieTableFileSystemView.init(HoodieTableFileSystemView.java:108)
at 
org.apache.hudi.common.table.view.HoodieTableFileSystemView.(HoodieTableFileSystemView.java:102)
at 
org.apache.hudi.common.table.view.HoodieTableFileSystemView.(HoodieTableFileSystemView.java:93)
at 
org.apache.hudi.metadata.HoodieMetadataFileSystemView.(HoodieMetadataFileSystemView.java:44)
at 
org.apache.hudi.common.table.view.FileSystemViewManager.createInMemoryFileSystemView(FileSystemViewManager.java:166)
at 
org.apache.hudi.common.table.view.FileSystemViewManager.lambda$createViewManager$5fcdabfe$1(FileSystemViewManager.java:259)
   
   ### Impact
   
   low
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change. If not, put "none"._
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-7815) Multiple writer with bulkinsert getAllPendingClusteringPlans should refresh timeline

2024-05-29 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7815:
-
Labels: pull-request-available  (was: )

> Multiple writer with bulkinsert getAllPendingClusteringPlans should refresh 
> timeline
> 
>
> Key: HUDI-7815
> URL: https://issues.apache.org/jira/browse/HUDI-7815
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql
>Reporter: xy
>Assignee: xy
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7815) Multiple writer with bulkinsert getAllPendingClusteringPlans should refresh timeline

2024-05-29 Thread xy (Jira)

xy created HUDI-7815:


 Summary: Multiple writer with bulkinsert 
getAllPendingClusteringPlans should refresh timeline
 Key: HUDI-7815
 URL: https://issues.apache.org/jira/browse/HUDI-7815
 Project: Apache Hudi
  Issue Type: Improvement
  Components: spark-sql
Reporter: xy
Assignee: xy






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [HUDI-7146] Implement secondary index write path [hudi]

2024-05-29 Thread via GitHub



codope commented on code in PR #11146:
URL: https://github.com/apache/hudi/pull/11146#discussion_r1620002871


##
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java:
##
@@ -351,6 +351,25 @@ public final class HoodieMetadataConfig extends 
HoodieConfig {
   .sinceVersion("1.0.0")
   .withDocumentation("Parallelism to use, when generating partition stats 
index.");
 
+  public static final ConfigProperty SECONDARY_INDEX_ENABLE_PROP = 
ConfigProperty
+  .key(METADATA_PREFIX + ".index.secondary.enable")

Review Comment:
   `enable` is consistent with other configs such as `hoodie.metadata.enable` 
or `hoodie.metadata.record.index.enable`



##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##
@@ -1833,12 +1837,153 @@ public static HoodieData 
readRecordKeysFromFileSlices(HoodieEngine
 });
   }
 
-  public static Schema 
getProjectedSchemaForFunctionalIndex(HoodieFunctionalIndexDefinition 
indexDefinition, HoodieTableMetaClient metaClient) throws Exception {
+  public static Schema 
getProjectedSchemaForFunctionalIndex(HoodieIndexDefinition indexDefinition, 
HoodieTableMetaClient metaClient) throws Exception {
 TableSchemaResolver schemaResolver = new TableSchemaResolver(metaClient);
 Schema tableSchema = schemaResolver.getTableAvroSchema();
 return addMetadataFields(getSchemaForFields(tableSchema, 
indexDefinition.getSourceFields()));
   }
 
+  public static HoodieData 
readSecondaryKeysFromBaseFiles(HoodieEngineContext engineContext,
+
List>>> partitionFiles,
+int 
secondaryIndexMaxParallelism,
+String 
activeModule, HoodieTableMetaClient metaClient, EngineType engineType,
+
HoodieIndexDefinition indexDefinition) {
+if (partitionFiles.isEmpty()) {
+  return engineContext.emptyHoodieData();
+}
+final int parallelism = Math.min(partitionFiles.size(), 
secondaryIndexMaxParallelism);
+final String basePath = metaClient.getBasePathV2().toString();
+Schema tableSchema;
+try {
+  tableSchema = new TableSchemaResolver(metaClient).getTableAvroSchema();
+} catch (Exception e) {
+  throw new HoodieException("Failed to get latest schema for " + 
metaClient.getBasePathV2(), e);
+}
+
+engineContext.setJobStatus(activeModule, "Secondary Index: reading 
secondary keys from " + partitionFiles.size() + " partitions");
+return engineContext.parallelize(partitionFiles, 
parallelism).flatMap(partitionAndBaseFile -> {
+  final String partition = partitionAndBaseFile.getKey();
+  final Pair> baseAndLogFiles = 
partitionAndBaseFile.getValue();
+  List logFilePaths = new ArrayList<>();
+  baseAndLogFiles.getValue().forEach(logFile -> logFilePaths.add(basePath 
+ StoragePath.SEPARATOR + partition + StoragePath.SEPARATOR + logFile));
+  String filePath = baseAndLogFiles.getKey();
+  Option dataFilePath = filePath.isEmpty() ? Option.empty() : 
Option.of(filePath(basePath, partition, filePath));
+  Schema readerSchema;
+  if (dataFilePath.isPresent()) {
+readerSchema = HoodieIOFactory.getIOFactory(metaClient.getStorage())
+
.getFileFormatUtils(metaClient.getTableConfig().getBaseFileFormat())
+.readAvroSchema(metaClient.getStorage(), dataFilePath.get());
+  } else {
+readerSchema = tableSchema;
+  }
+  return createSecondaryIndexGenerator(metaClient, engineType, 
logFilePaths, readerSchema, partition, dataFilePath, indexDefinition);
+});
+  }
+
+  public static HoodieData 
readSecondaryKeysFromFileSlices(HoodieEngineContext engineContext,
+ 
List> partitionFileSlicePairs,
+ int 
secondaryIndexMaxParallelism,
+ 
String activeModule, HoodieTableMetaClient metaClient, EngineType engineType,
+ 
HoodieIndexDefinition indexDefinition) {
+if (partitionFileSlicePairs.isEmpty()) {
+  return engineContext.emptyHoodieData();
+}
+final int parallelism = Math.min(partitionFileSlicePairs.size(), 
secondaryIndexMaxParallelism);
+final String basePath = metaClient.getBasePathV2().toString();
+Schema tableSchema;
+try {
+  tableSchema = new TableSchemaResolver(metaClient).getTableAvroSchema();
+} catch (Exception e) {
+  throw new HoodieException("Failed to get latest schema for " + 
metaClient.getBasePathV2(), e);
+}
+
+engineContext.setJobStatus(activeModule, "Secondary Index: re

(hudi) branch release-0.15.0 updated (fe08b6fecbc -> c009895c280)

2024-05-29 Thread yihua

This is an automated email from the ASF dual-hosted git repository.

yihua pushed a change to branch release-0.15.0
in repository https://gitbox.apache.org/repos/asf/hudi.git


from fe08b6fecbc [HUDI-7807] Fixing spark-sql for pk less tables (#11354)
 new 9e79996a48b [HUDI-7812] Disabling row writer for clustering (#11360)
 new c009895c280 [HUDI-7655] Ensuring clean action executor cleans up all 
intended files (#11363)

The 2 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 .../table/action/clean/CleanActionExecutor.java|   6 +
 .../MultipleSparkJobExecutionStrategy.java |   2 +-
 .../table/functional/TestCleanActionExecutor.java  | 188 +
 3 files changed, 195 insertions(+), 1 deletion(-)
 create mode 100644 
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/functional/TestCleanActionExecutor.java

(hudi) 01/02: [HUDI-7812] Disabling row writer for clustering (#11360)

2024-05-29 Thread yihua

This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch release-0.15.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 9e79996a48b50bc2a136fc477d453cc2193e51fe
Author: Sivabalan Narayanan 
AuthorDate: Wed May 29 15:02:40 2024 -0700

[HUDI-7812] Disabling row writer for clustering (#11360)
---
 .../clustering/run/strategy/MultipleSparkJobExecutionStrategy.java  | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java
 
b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java
index 976795b7dc6..eb59397b328 100644
--- 
a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java
+++ 
b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java
@@ -119,7 +119,7 @@ public abstract class MultipleSparkJobExecutionStrategy
   Stream> writeStatusesStream = FutureUtils.allOf(
   clusteringPlan.getInputGroups().stream()
   .map(inputGroup -> {
-if 
(getWriteConfig().getBooleanOrDefault("hoodie.datasource.write.row.writer.enable",
 true)) {
+if 
(getWriteConfig().getBooleanOrDefault("hoodie.datasource.write.row.writer.enable",
 false)) {
   return runClusteringForGroupAsyncAsRow(inputGroup,
   clusteringPlan.getStrategy().getStrategyParams(),
   shouldPreserveMetadata,

(hudi) 02/02: [HUDI-7655] Ensuring clean action executor cleans up all intended files (#11363)

2024-05-29 Thread yihua

This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch release-0.15.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit c009895c280aa13d3c06896f18d04660841ab902
Author: Sivabalan Narayanan 
AuthorDate: Wed May 29 17:36:07 2024 -0700

[HUDI-7655] Ensuring clean action executor cleans up all intended files 
(#11363)
---
 .../table/action/clean/CleanActionExecutor.java|   6 +
 .../table/functional/TestCleanActionExecutor.java  | 188 +
 2 files changed, 194 insertions(+)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java
index 83d8cbde4a3..6973d76c5d0 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java
@@ -81,6 +81,12 @@ public class CleanActionExecutor extends 
BaseActionExecutorhttp://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.table.functional;
+
+import org.apache.hudi.avro.model.HoodieActionInstant;
+import org.apache.hudi.avro.model.HoodieCleanFileInfo;
+import org.apache.hudi.avro.model.HoodieCleanMetadata;
+import org.apache.hudi.avro.model.HoodieCleanPartitionMetadata;
+import org.apache.hudi.avro.model.HoodieCleanerPlan;
+import org.apache.hudi.common.config.HoodieMetadataConfig;
+import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.common.engine.HoodieLocalEngineContext;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.model.HoodieCleaningPolicy;
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.table.timeline.TimelineMetadataUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.storage.HoodieStorage;
+import org.apache.hudi.storage.StorageConfiguration;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.hudi.table.action.clean.CleanActionExecutor;
+import org.apache.hudi.table.action.clean.CleanPlanner;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.params.ParameterizedTest;
+import org.junit.jupiter.params.provider.EnumSource;
+
+import java.io.FileNotFoundException;
+import java.io.IOException;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.UUID;
+
+import static 
org.apache.hudi.common.testutils.HoodieTestUtils.getDefaultStorageConf;
+import static org.junit.jupiter.api.Assertions.assertThrows;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+import static org.mockito.ArgumentMatchers.any;
+import static org.mockito.Mockito.mock;
+import static org.mockito.Mockito.when;
+
+/**
+ * Tests Clean action executor.
+ */
+public class TestCleanActionExecutor {
+
+  private static final StorageConfiguration CONF = 
getDefaultStorageConf();
+  private final HoodieEngineContext context = new 
HoodieLocalEngineContext(CONF);
+  private final HoodieTable mockHoodieTable = 
mock(HoodieTable.class);
+  private HoodieTableMetaClient metaClient;
+  private FileSystem fs;
+
+  private static String PARTITION1 = "partition1";
+
+  String earliestInstant = "20231204194919610";
+  String earliestInstantMinusThreeDays = "20231201194919610";
+
+  @BeforeEach
+  void setUp() {
+metaClient = mock(HoodieTableMetaClient.class);
+when(mockHoodieTable.getMetaClient()).thenReturn(metaClient);
+HoodieTableConfig tableConfig = new HoodieTableConfig();
+when(metaClient.getTableConfig()).thenReturn(tableConfig);
+HoodieStorage storage = mock(HoodieStorage.class);
+when(metaClient.getStorage()).thenReturn(storage);
+when(mockHoodieTable.getStorage()).thenReturn(storage);
+fs = mock(FileSystem.class);
+when(storage.getFileSystem()).thenReturn(fs);
+when(fs.getConf()).thenReturn(CONF.unwrap());
+  }
+
+  @ParameterizedTest
+  @EnumSource(CleanFailureType.class)
+  void testPartialCleanFailure(Cle

Re: [I] [SUPPORT] NPE when using PySpark with release-0.15.0 [hudi]

2024-05-29 Thread via GitHub



geserdugarov commented on issue #11334:
URL: https://github.com/apache/hudi/issues/11334#issuecomment-2138693456

   @yihua , hi!
   It's really impressive follow-up. Thank you for landing it!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT] NPE when using PySpark with release-0.15.0 [hudi]

2024-05-29 Thread via GitHub



yihua commented on issue #11334:
URL: https://github.com/apache/hudi/issues/11334#issuecomment-2138673728

   Hi @geserdugarov I've landed a fix (#11355 on master, #11356 on branch-0.x) 
to make `spark.kryo.registrator` config optional.  The fix is going to be in 
0.15.0-rc3 (cut from `release-0.15.0` branch).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT] - Partial update of the MOR table after compaction with Hudi Streamer [hudi]

2024-05-29 Thread via GitHub



kirillklimenko commented on issue #11348:
URL: https://github.com/apache/hudi/issues/11348#issuecomment-2138665869

   > @kirillklimenko Normally PartialUpdateAvroPayload should avoid updating 
the columns which have null value.
   
   This is what I expect, however, it does not happen, which is strange. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT] - Partial update of the MOR table after compaction with Hudi Streamer [hudi]

2024-05-29 Thread via GitHub



ad1happy2go commented on issue #11348:
URL: https://github.com/apache/hudi/issues/11348#issuecomment-2138590861

   @kirillklimenko Normally PartialUpdateAvroPayload should avoid updating the 
columns which have null value.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7146] Implement secondary index write path [hudi]

2024-05-29 Thread via GitHub



danny0405 commented on code in PR #11146:
URL: https://github.com/apache/hudi/pull/11146#discussion_r1619761496


##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieIndexDefinition.java:
##
@@ -45,14 +49,14 @@ public class HoodieFunctionalIndexDefinition implements 
Serializable {
   // Any other configuration or properties specific to the index
   private Map indexOptions;
 
-  public HoodieFunctionalIndexDefinition() {
+  public HoodieIndexDefinition() {
   }
 
-  public HoodieFunctionalIndexDefinition(String indexName, String indexType, 
String indexFunction, List sourceFields,
- Map indexOptions) {
+  public HoodieIndexDefinition(String indexName, String indexType, String 
indexFunction, List sourceFields,
+   Map indexOptions) {
 this.indexName = indexName;
 this.indexType = indexType;
-this.indexFunction = indexFunction;
+this.indexFunction = nonEmpty(indexFunction) ? indexFunction : 
SPARK_IDENTITY;

Review Comment:
   Not sure why the empty string binds to spark.



##
hudi-common/src/main/java/org/apache/hudi/keygen/constant/KeyGeneratorOptions.java:
##
@@ -54,6 +54,13 @@ public class KeyGeneratorOptions extends HoodieConfig {
   + "Actual value will be obtained by invoking .toString() on the 
field value. Nested fields can be specified using\n"
   + "the dot notation eg: `a.b.c`");
 
+  public static final ConfigProperty SECONDARYKEY_FIELD_NAME = 
ConfigProperty

Review Comment:
   SECONDARYKEY_COLUMN_NAME



##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##
@@ -1833,12 +1837,153 @@ public static HoodieData 
readRecordKeysFromFileSlices(HoodieEngine
 });
   }
 
-  public static Schema 
getProjectedSchemaForFunctionalIndex(HoodieFunctionalIndexDefinition 
indexDefinition, HoodieTableMetaClient metaClient) throws Exception {
+  public static Schema 
getProjectedSchemaForFunctionalIndex(HoodieIndexDefinition indexDefinition, 
HoodieTableMetaClient metaClient) throws Exception {
 TableSchemaResolver schemaResolver = new TableSchemaResolver(metaClient);
 Schema tableSchema = schemaResolver.getTableAvroSchema();
 return addMetadataFields(getSchemaForFields(tableSchema, 
indexDefinition.getSourceFields()));
   }
 
+  public static HoodieData 
readSecondaryKeysFromBaseFiles(HoodieEngineContext engineContext,
+
List>>> partitionFiles,
+int 
secondaryIndexMaxParallelism,
+String 
activeModule, HoodieTableMetaClient metaClient, EngineType engineType,
+
HoodieIndexDefinition indexDefinition) {
+if (partitionFiles.isEmpty()) {
+  return engineContext.emptyHoodieData();
+}
+final int parallelism = Math.min(partitionFiles.size(), 
secondaryIndexMaxParallelism);
+final String basePath = metaClient.getBasePathV2().toString();
+Schema tableSchema;
+try {
+  tableSchema = new TableSchemaResolver(metaClient).getTableAvroSchema();
+} catch (Exception e) {
+  throw new HoodieException("Failed to get latest schema for " + 
metaClient.getBasePathV2(), e);
+}
+
+engineContext.setJobStatus(activeModule, "Secondary Index: reading 
secondary keys from " + partitionFiles.size() + " partitions");
+return engineContext.parallelize(partitionFiles, 
parallelism).flatMap(partitionAndBaseFile -> {
+  final String partition = partitionAndBaseFile.getKey();
+  final Pair> baseAndLogFiles = 
partitionAndBaseFile.getValue();
+  List logFilePaths = new ArrayList<>();
+  baseAndLogFiles.getValue().forEach(logFile -> logFilePaths.add(basePath 
+ StoragePath.SEPARATOR + partition + StoragePath.SEPARATOR + logFile));
+  String filePath = baseAndLogFiles.getKey();
+  Option dataFilePath = filePath.isEmpty() ? Option.empty() : 
Option.of(filePath(basePath, partition, filePath));
+  Schema readerSchema;
+  if (dataFilePath.isPresent()) {
+readerSchema = HoodieIOFactory.getIOFactory(metaClient.getStorage())
+
.getFileFormatUtils(metaClient.getTableConfig().getBaseFileFormat())
+.readAvroSchema(metaClient.getStorage(), dataFilePath.get());
+  } else {
+readerSchema = tableSchema;
+  }
+  return createSecondaryIndexGenerator(metaClient, engineType, 
logFilePaths, readerSchema, partition, dataFilePath, indexDefinition);
+});
+  }
+
+  public static HoodieData 
readSecondaryKeysFromFileSlices(HoodieEngineContext engineContext,
+ 
List> partitionFileSlicePairs,
+ int 
seconda

Re: [PR] [HUDI-7146] Implement secondary index write path [hudi]

2024-05-29 Thread via GitHub



hudi-bot commented on PR #11146:
URL: https://github.com/apache/hudi/pull/11146#issuecomment-2138579384

   
   ## CI report:
   
   * 470bc5f44e7a6658a8717ef1b77e92afcdd90087 UNKNOWN
   * e8a4507886bc97b1819ea39788f2abd7385b8cf2 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24139)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7146] Implement secondary index write path [hudi]

2024-05-29 Thread via GitHub



hudi-bot commented on PR #11146:
URL: https://github.com/apache/hudi/pull/11146#issuecomment-2138536999

   
   ## CI report:
   
   * 470bc5f44e7a6658a8717ef1b77e92afcdd90087 UNKNOWN
   * 3d9b55d5f226698abf7339dc604f0e02d8fc8133 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24120)
 
   * e8a4507886bc97b1819ea39788f2abd7385b8cf2 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24139)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7146] Implement secondary index write path [hudi]

2024-05-29 Thread via GitHub



hudi-bot commented on PR #11146:
URL: https://github.com/apache/hudi/pull/11146#issuecomment-2138531108

   
   ## CI report:
   
   * 470bc5f44e7a6658a8717ef1b77e92afcdd90087 UNKNOWN
   * 3d9b55d5f226698abf7339dc604f0e02d8fc8133 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24120)
 
   * e8a4507886bc97b1819ea39788f2abd7385b8cf2 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7810] Fix OptionsResolver#allowCommitOnEmptyBatch default value… [hudi]

2024-05-29 Thread via GitHub



usberkeley commented on code in PR #11359:
URL: https://github.com/apache/hudi/pull/11359#discussion_r161976


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/OptionsResolver.java:
##
@@ -370,7 +370,7 @@ public static ConflictResolutionStrategy 
getConflictResolutionStrategy(Configura
* Returns whether to commit even when current batch has no data, for flink 
defaults false
*/
   public static boolean allowCommitOnEmptyBatch(Configuration conf) {
-return conf.getBoolean(HoodieWriteConfig.ALLOW_EMPTY_COMMIT.key(), false);
+return conf.getBoolean(HoodieWriteConfig.ALLOW_EMPTY_COMMIT.key(), 
HoodieWriteConfig.ALLOW_EMPTY_COMMIT.defaultValue());

Review Comment:
   let me check



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7814] Exclude unused transitive dependencies that introduce vulnerabilities [hudi]

2024-05-29 Thread via GitHub



hudi-bot commented on PR #11364:
URL: https://github.com/apache/hudi/pull/11364#issuecomment-2138522914

   
   ## CI report:
   
   * 3337f90b44d58d07c8a4055c9544f0e957d93226 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24138)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7146] Implement secondary index write path [hudi]

2024-05-29 Thread via GitHub



codope commented on code in PR #11146:
URL: https://github.com/apache/hudi/pull/11146#discussion_r1619691280


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -525,21 +534,66 @@ private Pair> 
initializeFunctionalIndexPartiti
   }
 
   private Set getFunctionalIndexPartitionsToInit() {
-Set functionalIndexPartitions = 
dataMetaClient.getFunctionalIndexMetadata().get().getIndexDefinitions().keySet();
+Set functionalIndexPartitions = 
dataMetaClient.getIndexMetadata().get().getIndexDefinitions().keySet();
 Set completedMetadataPartitions = 
dataMetaClient.getTableConfig().getMetadataPartitions();
 functionalIndexPartitions.removeAll(completedMetadataPartitions);
 return functionalIndexPartitions;
   }
 
-  private HoodieFunctionalIndexDefinition getFunctionalIndexDefinition(String 
indexName) {
-Option functionalIndexMetadata = 
dataMetaClient.getFunctionalIndexMetadata();
+  private HoodieIndexDefinition getFunctionalIndexDefinition(String indexName) 
{
+Option functionalIndexMetadata = 
dataMetaClient.getIndexMetadata();
 if (functionalIndexMetadata.isPresent()) {
   return 
functionalIndexMetadata.get().getIndexDefinitions().get(indexName);
 } else {
   throw new HoodieIndexException("Functional Index definition is not 
present");
 }
   }
 
+  private Set getSecondaryIndexPartitionsToInit() {
+Set secondaryIndexPartitions = 
dataMetaClient.getIndexMetadata().get().getIndexDefinitions().values().stream()
+.map(HoodieIndexDefinition::getIndexName)
+.filter(indexName -> 
indexName.startsWith(HoodieTableMetadataUtil.PARTITION_NAME_SECONDARY_INDEX_PREFIX))
+.collect(Collectors.toSet());
+Set completedMetadataPartitions = 
dataMetaClient.getTableConfig().getMetadataPartitions();
+secondaryIndexPartitions.removeAll(completedMetadataPartitions);
+return secondaryIndexPartitions;
+  }
+
+  private Pair> 
initializeSecondaryIndexPartition(String indexName) throws IOException {
+HoodieIndexDefinition indexDefinition = 
getFunctionalIndexDefinition(indexName);
+ValidationUtils.checkState(indexDefinition != null, "Secondary Index 
definition is not present for index " + indexName);
+List> partitionFileSlicePairs = 
getPartitionFileSlicePairs();
+
+// Reuse record index parallelism config to build secondary index

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT] Flink bucket index partitioner may cause data skew [hudi]

2024-05-29 Thread via GitHub



xicm closed issue #11288: [SUPPORT] Flink bucket index partitioner may cause 
data skew
URL: https://github.com/apache/hudi/issues/11288


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7814] Exclude unused transitive dependencies that introduce vulnerabilities [hudi]

2024-05-29 Thread via GitHub



hudi-bot commented on PR #11364:
URL: https://github.com/apache/hudi/pull/11364#issuecomment-2138484947

   
   ## CI report:
   
   * 8a4dcafd9f65d08d21ae4e190ce46e9a950729ed Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24137)
 
   * 3337f90b44d58d07c8a4055c9544f0e957d93226 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24138)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-7807) spark-sql updates for a pk less table fails w/ partitioned table

2024-05-29 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7807:

Sprint: Sprint 2023-04-26

> spark-sql updates for a pk less table fails w/ partitioned table 
> -
>
> Key: HUDI-7807
> URL: https://issues.apache.org/jira/browse/HUDI-7807
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> quick start fails when trying to UPDATE with spark-sql for a pk less table. 
>  
> {code:java}
>          > UPDATE hudi_table4 SET fare = 25.0 WHERE rider = 'rider-D';
> 24/05/28 11:44:41 WARN package: Truncated the string representation of a plan 
> since it was too large. This behavior can be adjusted by setting 
> 'spark.sql.debug.maxToStringFields'.
> 24/05/28 11:44:41 ERROR SparkSQLDriver: Failed in [UPDATE hudi_table4 SET 
> fare = 25.0 WHERE rider = 'rider-D']
> org.apache.hudi.exception.HoodieException: Unable to instantiate class 
> org.apache.hudi.keygen.SimpleKeyGenerator
>   at 
> org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:75)
>   at 
> org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:123)
>   at 
> org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory.createKeyGenerator(HoodieSparkKeyGeneratorFactory.java:91)
>   at 
> org.apache.hudi.util.SparkKeyGenUtils$.getPartitionColumns(SparkKeyGenUtils.scala:47)
>   at 
> org.apache.hudi.HoodieWriterUtils$.validateTableConfig(HoodieWriterUtils.scala:218)
>   at 
> org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:232)
>   at 
> org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:187)
>   at 
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:125)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:168)
>   at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457)
>   at 
> org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:106)
>   at 
> org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:93)
>   at 
> org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecut

[jira] [Updated] (HUDI-7791) Bump h2 from 1.4.200 to 2.2.220 in /packaging/hudi-metaserver-server-bundle

2024-05-29 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7791:

Sprint: Sprint 2023-04-26

> Bump h2 from 1.4.200 to 2.2.220 in /packaging/hudi-metaserver-server-bundle
> ---
>
> Key: HUDI-7791
> URL: https://issues.apache.org/jira/browse/HUDI-7791
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7796) Gracefully cast file system instance in Avro writers

2024-05-29 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7796:

Sprint: Sprint 2023-04-26

> Gracefully cast file system instance in Avro writers
> 
>
> Key: HUDI-7796
> URL: https://issues.apache.org/jira/browse/HUDI-7796
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> When running tests in Trino with Hudi MDT enabled, the following line in 
> HoodieAvroHFileWriter throws class cast exception, since Trino uses 
> dependency injection to provide the Hadoop file system instance, which may 
> skip the Hudi wrapper file system logic.
> {code:java}
>     this.fs = (HoodieWrapperFileSystem) this.file.getFileSystem(conf); {code}
> {code:java}
> Caused by: java.lang.ClassCastException: class 
> io.trino.hdfs.TrinoFileSystemCache$FileSystemWrapper cannot be cast to class 
> org.apache.hudi.hadoop.fs.HoodieWrapperFileSystem 
> (io.trino.hdfs.TrinoFileSystemCache$FileSystemWrapper and 
> org.apache.hudi.hadoop.fs.HoodieWrapperFileSystem are in unnamed module of 
> loader 'app')
>     at 
> org.apache.hudi.io.hadoop.HoodieAvroHFileWriter.(HoodieAvroHFileWriter.java:91)
>     at 
> org.apache.hudi.io.hadoop.HoodieAvroFileWriterFactory.newHFileFileWriter(HoodieAvroFileWriterFactory.java:108)
>     at 
> org.apache.hudi.io.storage.HoodieFileWriterFactory.getFileWriterByFormat(HoodieFileWriterFactory.java:70)
>     at 
> org.apache.hudi.io.storage.HoodieFileWriterFactory.getFileWriter(HoodieFileWriterFactory.java:53)
>     at 
> org.apache.hudi.io.HoodieCreateHandle.(HoodieCreateHandle.java:108)
>     at 
> org.apache.hudi.io.HoodieCreateHandle.(HoodieCreateHandle.java:77)
>     at 
> org.apache.hudi.io.CreateHandleFactory.create(CreateHandleFactory.java:45)
>     at 
> org.apache.hudi.execution.CopyOnWriteInsertHandler.consume(CopyOnWriteInsertHandler.java:101)
>     at 
> org.apache.hudi.execution.CopyOnWriteInsertHandler.consume(CopyOnWriteInsertHandler.java:44)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7801) Directly pass down HoodieStorage instance instead of recreation

2024-05-29 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7801:

Sprint: Sprint 2023-04-26

> Directly pass down HoodieStorage instance instead of recreation
> ---
>
> Key: HUDI-7801
> URL: https://issues.apache.org/jira/browse/HUDI-7801
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> There are places that use HoodieStorage#newInstance to recreate HoodieStorage 
> instance which may not be necessary.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7808) Security upgrade io.acryl:datahub-client from 0.8.31 to 0.8.45

2024-05-29 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7808:

Sprint: Sprint 2023-04-26

> Security upgrade io.acryl:datahub-client from 0.8.31 to 0.8.45
> --
>
> Key: HUDI-7808
> URL: https://issues.apache.org/jira/browse/HUDI-7808
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7769) Fix Hudi CDC read with legacy parquet file format on Spark

2024-05-29 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7769:

Sprint: Sprint 2023-04-26

> Fix Hudi CDC read with legacy parquet file format on Spark
> --
>
> Key: HUDI-7769
> URL: https://issues.apache.org/jira/browse/HUDI-7769
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7809) Use Spark SerializableConfiguration to avoid NPE in Kryo serde

2024-05-29 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7809:

Sprint: Sprint 2023-04-26

> Use Spark SerializableConfiguration to avoid NPE in Kryo serde
> --
>
> Key: HUDI-7809
> URL: https://issues.apache.org/jira/browse/HUDI-7809
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: hoodie-storage, pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> With Hudi 0.14.1, without 
> "spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar", Hudi 
> query in Spark quick start guide succeeds.  In Hudi 0.15.0-rc2, without the 
> Kryo registratrar, the Hudi read throws NPE due to HadoopStorageConfiguration.
> {code:java}
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2450)
>   at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2399)
>   at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2398)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2398)
>   at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1156)
>   at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1156)
>   at scala.Option.foreach(Option.scala:407)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1156)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2638)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2580)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2569)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>   at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2224)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2245)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2264)
>   at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:492)
>   at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:445)
>   at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48)
>   at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715)
>   at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728)
>   at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
>   at org.apache.spark.sql.Dataset.head(Dataset.scala:2728)
>   at org.apache.spark.sql.Dataset.take(Dataset.scala:2935)
>   at org.apache.spark.sql.Dataset.getRows(Dataset.scala:287)
>   at org.apache.spark.sql.Dataset.showString(Dataset.scala:326)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:806)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:765)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:774)
>   ... 47 elided
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.datasources.parquet.Spark32LegacyHoodieParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(Spark32LegacyHoodieParquetFileFormat.scala:152)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:522)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_

[jira] [Updated] (HUDI-7790) Revert changes in DFSPathSelector and UtilHelpers.readConfig

2024-05-29 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7790:

Sprint: Sprint 2023-04-26

> Revert changes in DFSPathSelector and UtilHelpers.readConfig
> 
>
> Key: HUDI-7790
> URL: https://issues.apache.org/jira/browse/HUDI-7790
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> This is to avoid behavior changes in DFSPathSelector and keep the 
> UtilHelpers.readConfig API the same as before.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7792) Bump h2 from 1.4.200 to 2.2.220 in /hudi-platform-service/hudi-metaserver

2024-05-29 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7792:

Sprint: Sprint 2023-04-26

> Bump h2 from 1.4.200 to 2.2.220 in /hudi-platform-service/hudi-metaserver
> -
>
> Key: HUDI-7792
> URL: https://issues.apache.org/jira/browse/HUDI-7792
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7784) Fix serde of HoodieHadoopConfiguration in Spark

2024-05-29 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7784:

Sprint: Sprint 2023-04-26

> Fix serde of HoodieHadoopConfiguration in Spark
> ---
>
> Key: HUDI-7784
> URL: https://issues.apache.org/jira/browse/HUDI-7784
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: hoodie-storage, pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7785) Keep public APIs in utilities module the same as before HoodieStorage abstraction

2024-05-29 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7785:

Sprint: Sprint 2023-04-26

> Keep public APIs in utilities module the same as before HoodieStorage 
> abstraction
> -
>
> Key: HUDI-7785
> URL: https://issues.apache.org/jira/browse/HUDI-7785
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: hoodie-storage, pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> BaseErrorTableWriter, HoodieStreamer, StreamSync, etc., are public API 
> classes and contain public API methods, which should be kept the same as 
> before.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7794) Bump org.apache.hive:hive-service from 2.3.1 to 2.3.4

2024-05-29 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7794:

Sprint: Sprint 2023-04-26

> Bump org.apache.hive:hive-service from 2.3.1 to 2.3.4
> -
>
> Key: HUDI-7794
> URL: https://issues.apache.org/jira/browse/HUDI-7794
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7814) Exclude unused transitive dependencies that introduce vulnerabilities

2024-05-29 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7814:

Sprint: Sprint 2023-04-26

> Exclude unused transitive dependencies that introduce vulnerabilities
> -
>
> Key: HUDI-7814
> URL: https://issues.apache.org/jira/browse/HUDI-7814
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0, 0.16.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7786) Fix roaring bitmap dependency in hudi-integ-test-bundle

2024-05-29 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7786:

Sprint: Sprint 2023-04-26

> Fix roaring bitmap dependency in hudi-integ-test-bundle
> ---
>
> Key: HUDI-7786
> URL: https://issues.apache.org/jira/browse/HUDI-7786
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: hoodie-storage, pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7798) Mark configs included in 0.15.0 release

2024-05-29 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7798:

Sprint: Sprint 2023-04-26

> Mark configs included in 0.15.0 release
> ---
>
> Key: HUDI-7798
> URL: https://issues.apache.org/jira/browse/HUDI-7798
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> We need to mark the configs that go out in 0.15.0 release with 
> `.sinceVersion("0.15.0")`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7802) Fix bundle validation scripts

2024-05-29 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7802:

Sprint: Sprint 2023-04-26

> Fix bundle validation scripts
> -
>
> Key: HUDI-7802
> URL: https://issues.apache.org/jira/browse/HUDI-7802
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> Issues:
>  * Bundle validation with packaging/bundle-validation/ci_run.sh fails for 
> release-0.15.0 branch due to script issue
>  * scripts/release/validate_staged_bundles.sh needs to include additional 
> bundles.
>  * Add release candidate validation on scala 2.13 bundles.
>  * Disable release candidate validation by default.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7788) Fixing exception handling in AverageRecordSizeUtils

2024-05-29 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7788:

Sprint: Sprint 2023-04-26

> Fixing exception handling in AverageRecordSizeUtils
> ---
>
> Key: HUDI-7788
> URL: https://issues.apache.org/jira/browse/HUDI-7788
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> We should catch Throwable to avoid any issue during record size estimation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7777) Allow HoodieTableMetaClient to take HoodieStorage instance directly

2024-05-29 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-:

Sprint: Sprint 2023-04-26

>  Allow HoodieTableMetaClient to take HoodieStorage instance directly
> 
>
> Key: HUDI-
> URL: https://issues.apache.org/jira/browse/HUDI-
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: hoodie-storage, pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> We need to functionality for the meta client to 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [HUDI-7814] Exclude unused transitive dependencies that introduce vulnerabilities [hudi]

2024-05-29 Thread via GitHub



hudi-bot commented on PR #11364:
URL: https://github.com/apache/hudi/pull/11364#issuecomment-2138478691

   
   ## CI report:
   
   * 8a4dcafd9f65d08d21ae4e190ce46e9a950729ed Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24137)
 
   * 3337f90b44d58d07c8a4055c9544f0e957d93226 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7814] Exclude unused transitive dependencies that introduce vulnerabilities [hudi]

2024-05-29 Thread via GitHub



hudi-bot commented on PR #11364:
URL: https://github.com/apache/hudi/pull/11364#issuecomment-2138472556

   
   ## CI report:
   
   * 8a4dcafd9f65d08d21ae4e190ce46e9a950729ed Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24137)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-7814) Exclude unused transitive dependencies that introduce vulnerabilities

2024-05-29 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7814:

Fix Version/s: 1.0.0
   0.16.0

> Exclude unused transitive dependencies that introduce vulnerabilities
> -
>
> Key: HUDI-7814
> URL: https://issues.apache.org/jira/browse/HUDI-7814
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0, 0.16.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

(hudi) branch branch-0.x updated: [HUDI-7655] Ensuring clean action executor cleans up all intended files (#11363)

2024-05-29 Thread yihua

This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/branch-0.x by this push:
 new bf278212f81 [HUDI-7655] Ensuring clean action executor cleans up all 
intended files (#11363)
bf278212f81 is described below

commit bf278212f817cc55a9301e52c9eac01926d9f56b
Author: Sivabalan Narayanan 
AuthorDate: Wed May 29 17:36:07 2024 -0700

[HUDI-7655] Ensuring clean action executor cleans up all intended files 
(#11363)
---
 .../table/action/clean/CleanActionExecutor.java|   6 +
 .../table/functional/TestCleanActionExecutor.java  | 188 +
 2 files changed, 194 insertions(+)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java
index 83d8cbde4a3..6973d76c5d0 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java
@@ -81,6 +81,12 @@ public class CleanActionExecutor extends 
BaseActionExecutorhttp://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.table.functional;
+
+import org.apache.hudi.avro.model.HoodieActionInstant;
+import org.apache.hudi.avro.model.HoodieCleanFileInfo;
+import org.apache.hudi.avro.model.HoodieCleanMetadata;
+import org.apache.hudi.avro.model.HoodieCleanPartitionMetadata;
+import org.apache.hudi.avro.model.HoodieCleanerPlan;
+import org.apache.hudi.common.config.HoodieMetadataConfig;
+import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.common.engine.HoodieLocalEngineContext;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.model.HoodieCleaningPolicy;
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.table.timeline.TimelineMetadataUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.storage.HoodieStorage;
+import org.apache.hudi.storage.StorageConfiguration;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.hudi.table.action.clean.CleanActionExecutor;
+import org.apache.hudi.table.action.clean.CleanPlanner;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.params.ParameterizedTest;
+import org.junit.jupiter.params.provider.EnumSource;
+
+import java.io.FileNotFoundException;
+import java.io.IOException;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.UUID;
+
+import static 
org.apache.hudi.common.testutils.HoodieTestUtils.getDefaultStorageConf;
+import static org.junit.jupiter.api.Assertions.assertThrows;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+import static org.mockito.ArgumentMatchers.any;
+import static org.mockito.Mockito.mock;
+import static org.mockito.Mockito.when;
+
+/**
+ * Tests Clean action executor.
+ */
+public class TestCleanActionExecutor {
+
+  private static final StorageConfiguration CONF = 
getDefaultStorageConf();
+  private final HoodieEngineContext context = new 
HoodieLocalEngineContext(CONF);
+  private final HoodieTable mockHoodieTable = 
mock(HoodieTable.class);
+  private HoodieTableMetaClient metaClient;
+  private FileSystem fs;
+
+  private static String PARTITION1 = "partition1";
+
+  String earliestInstant = "20231204194919610";
+  String earliestInstantMinusThreeDays = "20231201194919610";
+
+  @BeforeEach
+  void setUp() {
+metaClient = mock(HoodieTableMetaClient.class);
+when(mockHoodieTable.getMetaClient()).thenReturn(metaClient);
+HoodieTableConfig tableConfig = new HoodieTableConfig();
+when(metaClient.getTableConfig()).thenReturn(tableConfig);
+HoodieStorage storage = mock(HoodieStorage.class);
+when(metaClient.getStorage()).thenReturn(storage);
+when(mockHoodieTable.getStorage()).thenReturn(storage);
+fs = mock(FileSystem.class);

Re: [PR] [HUDI-7655] Ensuring clean action executor cleans up all intended files [hudi]

2024-05-29 Thread via GitHub



yihua merged PR #11363:
URL: https://github.com/apache/hudi/pull/11363


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Assigned] (HUDI-7814) Exclude unused transitive dependencies that introduce vulnerabilities

2024-05-29 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-7814:
---

Assignee: Ethan Guo

> Exclude unused transitive dependencies that introduce vulnerabilities
> -
>
> Key: HUDI-7814
> URL: https://issues.apache.org/jira/browse/HUDI-7814
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [HUDI-7814] WIP [hudi]

2024-05-29 Thread via GitHub



hudi-bot commented on PR #11364:
URL: https://github.com/apache/hudi/pull/11364#issuecomment-2138436457

   
   ## CI report:
   
   * 8a4dcafd9f65d08d21ae4e190ce46e9a950729ed Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24137)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7814] WIP [hudi]

2024-05-29 Thread via GitHub



hudi-bot commented on PR #11364:
URL: https://github.com/apache/hudi/pull/11364#issuecomment-2138431517

   
   ## CI report:
   
   * 8a4dcafd9f65d08d21ae4e190ce46e9a950729ed UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-7814) Exclude unused transitive dependencies that introduce vulnerabilities

2024-05-29 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7814:
-
Labels: pull-request-available  (was: )

> Exclude unused transitive dependencies that introduce vulnerabilities
> -
>
> Key: HUDI-7814
> URL: https://issues.apache.org/jira/browse/HUDI-7814
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [HUDI-7655] Ensuring clean action executor cleans up all intended files [hudi]

2024-05-29 Thread via GitHub



hudi-bot commented on PR #11363:
URL: https://github.com/apache/hudi/pull/11363#issuecomment-2138425661

   
   ## CI report:
   
   * e2d39fea99a1b6278a2f8665df380b1dcbd6cb2c Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24136)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] [HUDI-7814] WIP [hudi]

2024-05-29 Thread via GitHub



yihua opened a new pull request, #11364:
URL: https://github.com/apache/hudi/pull/11364

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change. If not, put "none"._
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-7814) Exclude unused transitive dependencies that introduce vulnerabilities

2024-05-29 Thread Ethan Guo (Jira)

Ethan Guo created HUDI-7814:
---

 Summary: Exclude unused transitive dependencies that introduce 
vulnerabilities
 Key: HUDI-7814
 URL: https://issues.apache.org/jira/browse/HUDI-7814
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [HUDI-7655] Ensuring clean action executor cleans up all intended files [hudi]

2024-05-29 Thread via GitHub



hudi-bot commented on PR #11363:
URL: https://github.com/apache/hudi/pull/11363#issuecomment-2138382521

   
   ## CI report:
   
   * e2d39fea99a1b6278a2f8665df380b1dcbd6cb2c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24136)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR] Fix GitHub CI concurrency [hudi]

2024-05-29 Thread via GitHub



hudi-bot commented on PR #11361:
URL: https://github.com/apache/hudi/pull/11361#issuecomment-2138382475

   
   ## CI report:
   
   * e137495282c0968274f71c4147c8d593529e8360 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24134)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] With autogenerated keys HoodieStreamer failing with error - ts(Part -ts) field not found in record [hudi]

2024-05-29 Thread via GitHub



nsivabalan commented on issue #10233:
URL: https://github.com/apache/hudi/issues/10233#issuecomment-2138376386

   For auto record key gen, you need to set operation type to "INSERT". Can you 
give that a try. w/ "UPSERT" I do see the same exception as you. but with 
"INSERT" I don't see any exception. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-7211) Relax need of ordering/precombine field for tables with autogenerated record keys for DeltaStreamer

2024-05-29 Thread sivabalan narayanan (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-7211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17850517#comment-17850517
 ] 

sivabalan narayanan commented on HUDI-7211:
---

For auto record key gen, you need to set operation type to "INSERT". Can you 
give that a try. 

> Relax need of ordering/precombine field for tables with autogenerated record 
> keys for DeltaStreamer
> ---
>
> Key: HUDI-7211
> URL: https://issues.apache.org/jira/browse/HUDI-7211
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Aditya Goenka
>Priority: Critical
> Fix For: 1.1.0
>
>
> [https://github.com/apache/hudi/issues/10233]
>  
> ```
> NOW=$(date '+%Y%m%dt%H%M%S')
> ${SPARK_HOME}/bin/spark-submit \
> --jars 
> ${path_prefix}/jars/${SPARK_V}/hudi-spark${SPARK_VERSION}-bundle_2.12-${HUDI_VERSION}.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> ${path_prefix}/jars/${SPARK_V}/hudi-utilities-slim-bundle_2.12-${HUDI_VERSION}.jar
>  \
> --target-base-path ${path_prefix}/testcases/stocks/data/target/${NOW} \
> --target-table stocks${NOW} \
> --table-type COPY_ON_WRITE \
> --base-file-format PARQUET \
> --props ${path_prefix}/testcases/stocks/configs/hoodie.properties \
> --source-class org.apache.hudi.utilities.sources.JsonDFSSource \
> --schemaprovider-class 
> org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
> --hoodie-conf 
> hoodie.deltastreamer.schemaprovider.source.schema.file=${path_prefix}/testcases/stocks/data/schema_without_ts.avsc
>  \
> --hoodie-conf 
> hoodie.deltastreamer.schemaprovider.target.schema.file=${path_prefix}/testcases/stocks/data/schema_without_ts.avsc
>  \
> --op UPSERT \
> --spark-master yarn \
> --hoodie-conf 
> hoodie.deltastreamer.source.dfs.root=${path_prefix}/testcases/stocks/data/source_without_ts
>  \
> --hoodie-conf hoodie.datasource.write.partitionpath.field=date \
> --hoodie-conf hoodie.datasource.write.keygenerator.type=SIMPLE \
> --hoodie-conf hoodie.datasource.write.hive_style_partitioning=false \
> --hoodie-conf hoodie.metadata.enable=true
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7813) Hive Style partitioning on a bootstrap table is not configurable

2024-05-29 Thread Jonathan Vexler (Jira)

Jonathan Vexler created HUDI-7813:
-

 Summary: Hive Style partitioning on a bootstrap table is not 
configurable
 Key: HUDI-7813
 URL: https://issues.apache.org/jira/browse/HUDI-7813
 Project: Apache Hudi
  Issue Type: Bug
  Components: bootstrap
Reporter: Jonathan Vexler


I modified DecodedBootstrapPartitionPathTranslator to be:
{code:java}
public class DecodedBootstrapPartitionPathTranslator extends 
BootstrapPartitionPathTranslator {
  public DecodedBootstrapPartitionPathTranslator() {
super();
  }

  @Override
  public String getBootstrapTranslatedPath(String bootStrapPartitionPath) {
String pathMaybeWithHive = 
PartitionPathEncodeUtils.unescapePathName(bootStrapPartitionPath);
if (pathMaybeWithHive.contains("=")) {
  return Arrays.stream(pathMaybeWithHive.split("/")).map(split -> {
if (split.contains("=")) {
  return split.split("=")[1];
} else {
  return split;
}
  }).collect(Collectors.joining("/"));
}
return pathMaybeWithHive;
  }
} {code}
And setting hive style partitioning to true does not add it back



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [HUDI-7146] Implement secondary index write path [hudi]

2024-05-29 Thread via GitHub



danny0405 commented on code in PR #11146:
URL: https://github.com/apache/hudi/pull/11146#discussion_r1619526480


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -525,21 +534,66 @@ private Pair> 
initializeFunctionalIndexPartiti
   }
 
   private Set getFunctionalIndexPartitionsToInit() {
-Set functionalIndexPartitions = 
dataMetaClient.getFunctionalIndexMetadata().get().getIndexDefinitions().keySet();
+Set functionalIndexPartitions = 
dataMetaClient.getIndexMetadata().get().getIndexDefinitions().keySet();
 Set completedMetadataPartitions = 
dataMetaClient.getTableConfig().getMetadataPartitions();
 functionalIndexPartitions.removeAll(completedMetadataPartitions);
 return functionalIndexPartitions;
   }
 
-  private HoodieFunctionalIndexDefinition getFunctionalIndexDefinition(String 
indexName) {
-Option functionalIndexMetadata = 
dataMetaClient.getFunctionalIndexMetadata();
+  private HoodieIndexDefinition getFunctionalIndexDefinition(String indexName) 
{
+Option functionalIndexMetadata = 
dataMetaClient.getIndexMetadata();
 if (functionalIndexMetadata.isPresent()) {
   return 
functionalIndexMetadata.get().getIndexDefinitions().get(indexName);
 } else {
   throw new HoodieIndexException("Functional Index definition is not 
present");
 }
   }
 
+  private Set getSecondaryIndexPartitionsToInit() {
+Set secondaryIndexPartitions = 
dataMetaClient.getIndexMetadata().get().getIndexDefinitions().values().stream()
+.map(HoodieIndexDefinition::getIndexName)
+.filter(indexName -> 
indexName.startsWith(HoodieTableMetadataUtil.PARTITION_NAME_SECONDARY_INDEX_PREFIX))
+.collect(Collectors.toSet());
+Set completedMetadataPartitions = 
dataMetaClient.getTableConfig().getMetadataPartitions();
+secondaryIndexPartitions.removeAll(completedMetadataPartitions);
+return secondaryIndexPartitions;
+  }
+
+  private Pair> 
initializeSecondaryIndexPartition(String indexName) throws IOException {
+HoodieIndexDefinition indexDefinition = 
getFunctionalIndexDefinition(indexName);
+ValidationUtils.checkState(indexDefinition != null, "Secondary Index 
definition is not present for index " + indexName);
+List> partitionFileSlicePairs = 
getPartitionFileSlicePairs();
+
+// Reuse record index parallelism config to build secondary index

Review Comment:
   Remove this line.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7655] Ensuring clean action executor cleans up all intended files [hudi]

2024-05-29 Thread via GitHub



hudi-bot commented on PR #11363:
URL: https://github.com/apache/hudi/pull/11363#issuecomment-2138338769

   
   ## CI report:
   
   * e2d39fea99a1b6278a2f8665df380b1dcbd6cb2c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

(hudi) branch branch-0.x updated: [HUDI-7812] Disabling row writer for clustering (#11360)

2024-05-29 Thread yihua

This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/branch-0.x by this push:
 new 9485ba08a89 [HUDI-7812] Disabling row writer for clustering (#11360)
9485ba08a89 is described below

commit 9485ba08a89d2c0495d04d18691cb2fa87dc68a1
Author: Sivabalan Narayanan 
AuthorDate: Wed May 29 15:02:40 2024 -0700

[HUDI-7812] Disabling row writer for clustering (#11360)
---
 .../clustering/run/strategy/MultipleSparkJobExecutionStrategy.java  | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java
 
b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java
index 976795b7dc6..eb59397b328 100644
--- 
a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java
+++ 
b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java
@@ -119,7 +119,7 @@ public abstract class MultipleSparkJobExecutionStrategy
   Stream> writeStatusesStream = FutureUtils.allOf(
   clusteringPlan.getInputGroups().stream()
   .map(inputGroup -> {
-if 
(getWriteConfig().getBooleanOrDefault("hoodie.datasource.write.row.writer.enable",
 true)) {
+if 
(getWriteConfig().getBooleanOrDefault("hoodie.datasource.write.row.writer.enable",
 false)) {
   return runClusteringForGroupAsyncAsRow(inputGroup,
   clusteringPlan.getStrategy().getStrategyParams(),
   shouldPreserveMetadata,

Re: [PR] [HUDI-7812] Disabling row writer for clustering [hudi]

2024-05-29 Thread via GitHub



yihua merged PR #11360:
URL: https://github.com/apache/hudi/pull/11360


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR] Fix GitHub CI concurrency [hudi]

2024-05-29 Thread via GitHub



hudi-bot commented on PR #11362:
URL: https://github.com/apache/hudi/pull/11362#issuecomment-2138329584

   
   ## CI report:
   
   * 38da0bc8b0330a2716cd31390bd7aab3be48a540 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24135)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7810] Fix OptionsResolver#allowCommitOnEmptyBatch default value… [hudi]

2024-05-29 Thread via GitHub



danny0405 commented on code in PR #11359:
URL: https://github.com/apache/hudi/pull/11359#discussion_r1619518356


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/OptionsResolver.java:
##
@@ -370,7 +370,7 @@ public static ConflictResolutionStrategy 
getConflictResolutionStrategy(Configura
* Returns whether to commit even when current batch has no data, for flink 
defaults false
*/
   public static boolean allowCommitOnEmptyBatch(Configuration conf) {
-return conf.getBoolean(HoodieWriteConfig.ALLOW_EMPTY_COMMIT.key(), false);
+return conf.getBoolean(HoodieWriteConfig.ALLOW_EMPTY_COMMIT.key(), 
HoodieWriteConfig.ALLOW_EMPTY_COMMIT.defaultValue());

Review Comment:
   Hmm, there are some flink test failures.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] [HUDI-7655] Ensuring clean action executor cleans up all intended files [hudi]

2024-05-29 Thread via GitHub



nsivabalan opened a new pull request, #11363:
URL: https://github.com/apache/hudi/pull/11363

   ### Change Logs
   
   This patch is for branch-0-x. Master patch 
https://github.com/apache/hudi/pull/11343
   We can't let clean action executor skip files to be deleted. Due to 
networking glitches, there are chances some files may not be cleaned up. As of 
now, clean action execution succeeds and is tracked as part of 
"failedDeleteFiles". But if files to be deleted does not exists or already 
cleaned up, we are good. but if file deletion fails due to some other 
exception, we can't go past the issue. It could cause data consistency issues 
once the commit of interest goes into archived timeline (specifically incase of 
replace commits). So, this patch fixes the flow, where in clean action executor 
will ensure all intended files will be deleted for sure. On any exception, 
clean execution will fail. 
   
   Added tests to cover below scenario:
   1. fs.delete returns true
   2. fs.delete returns false. fs.exists returns false
   3. fs.delete return false, fs.exists returns true (clean execution is 
expected to fail)
   4. fs.delete throws FileNotFoundException
   5. fs.delete throws Runtime exception (clean execution is expected to fail)
   
   ### Impact
   
   Robust clean action. 
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change. If not, put "none"._
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7812] Disabling row writer for clustering [hudi]

2024-05-29 Thread via GitHub



hudi-bot commented on PR #11360:
URL: https://github.com/apache/hudi/pull/11360#issuecomment-2138265596

   
   ## CI report:
   
   * cc232dba4b0572f7afeb033cb74cf80be879c445 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24132)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR] Fix GitHub CI concurrency [hudi]

2024-05-29 Thread via GitHub



hudi-bot commented on PR #11361:
URL: https://github.com/apache/hudi/pull/11361#issuecomment-2138265643

   
   ## CI report:
   
   * e137495282c0968274f71c4147c8d593529e8360 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24134)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR] Fix GitHub CI concurrency [hudi]

2024-05-29 Thread via GitHub



hudi-bot commented on PR #11362:
URL: https://github.com/apache/hudi/pull/11362#issuecomment-2138265692

   
   ## CI report:
   
   * 38da0bc8b0330a2716cd31390bd7aab3be48a540 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24135)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR] Fix GitHub CI concurrency [hudi]

2024-05-29 Thread via GitHub



hudi-bot commented on PR #11362:
URL: https://github.com/apache/hudi/pull/11362#issuecomment-2138252456

   
   ## CI report:
   
   * 38da0bc8b0330a2716cd31390bd7aab3be48a540 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR] Fix GitHub CI concurrency [hudi]

2024-05-29 Thread via GitHub



hudi-bot commented on PR #11361:
URL: https://github.com/apache/hudi/pull/11361#issuecomment-2138252417

   
   ## CI report:
   
   * e137495282c0968274f71c4147c8d593529e8360 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7713] Enforce ordering of fields during schema reconciliation [hudi]

2024-05-29 Thread via GitHub



hudi-bot commented on PR #11154:
URL: https://github.com/apache/hudi/pull/11154#issuecomment-2138241541

   
   ## CI report:
   
   * 12038dbde068e26f733a7b1c9cc7217019c31f25 UNKNOWN
   * 0d8349e7a74982fd69267c7db481b9648bcd80a5 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24131)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7655] Ensuring clean action executor cleans up all intended files [hudi]

2024-05-29 Thread via GitHub



yihua commented on PR #11343:
URL: https://github.com/apache/hudi/pull/11343#issuecomment-2138239696

   GH CI has an issue of skipping a couple of actions which are irrelevant.  
Merging the PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

(hudi) branch master updated (0e6e49abeac -> e8a36ce3a3e)

2024-05-29 Thread yihua

This is an automated email from the ASF dual-hosted git repository.

yihua pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 0e6e49abeac [HUDI-7770] Parse partition path from hudi directory for 
bootstrap tables (#11237)
 add e8a36ce3a3e [HUDI-7655] Ensuring clean action executor cleans up all 
intended files (#11343)

No new revisions were added by this update.

Summary of changes:
 .../table/action/clean/CleanActionExecutor.java|   6 +
 .../table/functional/TestCleanActionExecutor.java  | 188 +
 2 files changed, 194 insertions(+)
 create mode 100644 
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/functional/TestCleanActionExecutor.java

Re: [PR] [HUDI-7655] Ensuring clean action executor cleans up all intended files [hudi]

2024-05-29 Thread via GitHub



yihua merged PR #11343:
URL: https://github.com/apache/hudi/pull/11343


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] [MINOR] Fix GitHub CI concurrency [hudi]

2024-05-29 Thread via GitHub



yihua opened a new pull request, #11362:
URL: https://github.com/apache/hudi/pull/11362

   ### Change Logs
   
   PR targeting master: https://github.com/apache/hudi/pull/11361
   This PR targets at `branch-0.x` with the same changes.
   
   As above, to avoid unnecessary cancels and skips.
   
   ### Impact
   
   Improves GitHub Ci stability.
   
   ### Risk level
   
   none
   
   ### Documentation Update
   
   non
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] [MINOR] Fix GitHub CI concurrency [hudi]

2024-05-29 Thread via GitHub



yihua opened a new pull request, #11361:
URL: https://github.com/apache/hudi/pull/11361

   ### Change Logs
   
   As above, to avoid unnecessary cancels and skips.
   
   ### Impact
   
   Improves GitHub Ci stability.
   
   ### Risk level
   
   none
   
   ### Documentation Update
   
   non
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7812] Disabling row writer for clustering [hudi]

2024-05-29 Thread via GitHub



hudi-bot commented on PR #11360:
URL: https://github.com/apache/hudi/pull/11360#issuecomment-2138179707

   
   ## CI report:
   
   * cc232dba4b0572f7afeb033cb74cf80be879c445 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24132)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7655] Ensuring clean action executor cleans up all intended files [hudi]

2024-05-29 Thread via GitHub



hudi-bot commented on PR #11343:
URL: https://github.com/apache/hudi/pull/11343#issuecomment-2138179616

   
   ## CI report:
   
   * e41510c1b33efe37feb7e791c61c0a85b647c8ab Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24130)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7812] Disabling row writer for clustering [hudi]

2024-05-29 Thread via GitHub



hudi-bot commented on PR #11360:
URL: https://github.com/apache/hudi/pull/11360#issuecomment-2138166141

   
   ## CI report:
   
   * cc232dba4b0572f7afeb033cb74cf80be879c445 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7713] Enforce ordering of fields during schema reconciliation [hudi]

2024-05-29 Thread via GitHub



hudi-bot commented on PR #11154:
URL: https://github.com/apache/hudi/pull/11154#issuecomment-2138165427

   
   ## CI report:
   
   * 12038dbde068e26f733a7b1c9cc7217019c31f25 UNKNOWN
   * 11bf799de8acdf9e8683eaa170bd69556d1feaab Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24129)
 
   * 0d8349e7a74982fd69267c7db481b9648bcd80a5 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24131)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7713] Enforce ordering of fields during schema reconciliation [hudi]

2024-05-29 Thread via GitHub



hudi-bot commented on PR #11154:
URL: https://github.com/apache/hudi/pull/11154#issuecomment-2138153060

   
   ## CI report:
   
   * 12038dbde068e26f733a7b1c9cc7217019c31f25 UNKNOWN
   * 11bf799de8acdf9e8683eaa170bd69556d1feaab Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24129)
 
   * 0d8349e7a74982fd69267c7db481b9648bcd80a5 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Assigned] (HUDI-7812) Async Clustering w/ row writer fails due to timetravel query validation

2024-05-29 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-7812:
-

Assignee: sivabalan narayanan

> Async Clustering w/ row writer fails due to timetravel query validation 
> 
>
> Key: HUDI-7812
> URL: https://issues.apache.org/jira/browse/HUDI-7812
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: clustering
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
>
> With clustering row writer enabled flow, we trigger a time travel query to 
> read input records. But the query side fails if there are any pending commits 
> (due to new ingestion ) whose timestamp < clustering instant time. we need to 
> relax this constraint. 
>  
> {code:java}
> Failed to execute CLUSTERING service
>     java.util.concurrent.CompletionException: 
> org.apache.hudi.exception.HoodieTimeTravelException: Time travel's timestamp 
> '20240406123837295' must be earlier than the first incomplete commit 
> timestamp '20240406123834233'.
>         at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
>  ~[?:1.8.0_392-internal]
>         at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
>  ~[?:1.8.0_392-internal]
>         at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1606)
>  ~[?:1.8.0_392-internal]
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  ~[?:1.8.0_392-internal]
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  ~[?:1.8.0_392-internal]
>         at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_392-internal]
>     Caused by: org.apache.hudi.exception.HoodieTimeTravelException: Time 
> travel's timestamp '20240406123837295' must be earlier than the first 
> incomplete commit timestamp '20240406123834233'.
>         at 
> org.apache.hudi.common.table.timeline.TimelineUtils.validateTimestampAsOf(TimelineUtils.java:369)
>  ~[hudi-utilities-bundle_2.12-1.8.1-INTERNAL.jar:1.8.1-INTERNAL]
>         at 
> org.apache.hudi.HoodieBaseRelation.$anonfun$listLatestFileSlices$1(HoodieBaseRelation.scala:416)
>  ~[hudi-utilities-bundle_2.12-1.8.1-INTERNAL.jar:1.8.1-INTERNAL]
>         at 
> org.apache.hudi.HoodieBaseRelation.$anonfun$listLatestFileSlices$1$adapted(HoodieBaseRelation.scala:416)
>  ~[hudi-utilities-bundle_2.12-1.8.1-INTERNAL.jar:1.8.1-INTERNAL]
>         at scala.Option.foreach(Option.scala:407) 
> ~[scala-library-2.12.17.jar:?]
>         at 
> org.apache.hudi.HoodieBaseRelation.listLatestFileSlices(HoodieBaseRelation.scala:416)
>  ~[hudi-utilities-bundle_2.12-1.8.1-INTERNAL.jar:1.8.1-INTERNAL]
>         at 
> org.apache.hudi.BaseMergeOnReadSnapshotRelation.collectFileSplits(MergeOnReadSnapshotRelation.scala:225)
>  ~[hudi-utilities-bundle_2.12-1.8.1-INTERNAL.jar:1.8.1-INTERNAL]
>         at 
> org.apache.hudi.BaseMergeOnReadSnapshotRelation.collectFileSplits(MergeOnReadSnapshotRelation.scala:68)
>  ~[hudi-utilities-bundle_2.12-1.8.1-INTERNAL.jar:1.8.1-INTERNAL]
>         at 
> org.apache.hudi.HoodieBaseRelation.buildScan(HoodieBaseRelation.scala:369) 
> ~[hudi-utilities-bundle_2.12-1.8.1-INTERNAL.jar:1.8.1-INTERNAL]
>         at 
> org.apache.spark.sql.execution.datasources.DataSourceStrategy$.$anonfun$apply$4(DataSourceStrategy.scala:323)
>  ~[spark-sql_2.12-3.2.3.jar:1.8.1-INTERNAL]
>         at 
> org.apache.spark.sql.execution.datasources.DataSourceStrategy$.$anonfun$pruneFilterProject$1(DataSourceStrategy.scala:357)
>  ~[spark-sql_2.12-3.2.3.jar:1.8.1-INTERNAL]
>         at 
> org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProjectRaw(DataSourceStrategy.scala:413)
>  ~[spark-sql_2.12-3.2.3.jar:1.8.1-INTERNAL]
>         at 
> org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProject(DataSourceStrategy.scala:356)
>  ~[spark-sql_2.12-3.2.3.jar:1.8.1-INTERNAL]
>         at 
> org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:323)
>  ~[spark-sql_2.12-3.2.3.jar:1.8.1-INTERNAL]
>         at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)
>  ~[spark-catalyst_2.12-3.2.3.jar:3.2.3]
>         at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) 
> ~[scala-library-2.12.17.jar:?]
>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) 
> ~[scala-library-2.12.17.jar:?]
>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491) 
> ~[scala-library-2.12.17.jar:?]
>         at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:

[jira] [Updated] (HUDI-7812) Async Clustering w/ row writer fails due to timetravel query validation

2024-05-29 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7812:
--
Description: 
With clustering row writer enabled flow, we trigger a time travel query to read 
input records. But the query side fails if there are any pending commits (due 
to new ingestion ) whose timestamp < clustering instant time. we need to relax 
this constraint. 

 
{code:java}
Failed to execute CLUSTERING service
    java.util.concurrent.CompletionException: 
org.apache.hudi.exception.HoodieTimeTravelException: Time travel's timestamp 
'20240406123837295' must be earlier than the first incomplete commit timestamp 
'20240406123834233'.
        at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
 ~[?:1.8.0_392-internal]
        at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
 ~[?:1.8.0_392-internal]
        at 
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1606)
 ~[?:1.8.0_392-internal]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
~[?:1.8.0_392-internal]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
~[?:1.8.0_392-internal]
        at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_392-internal]
    Caused by: org.apache.hudi.exception.HoodieTimeTravelException: Time 
travel's timestamp '20240406123837295' must be earlier than the first 
incomplete commit timestamp '20240406123834233'.
        at 
org.apache.hudi.common.table.timeline.TimelineUtils.validateTimestampAsOf(TimelineUtils.java:369)
 ~[hudi-utilities-bundle_2.12-1.8.1-INTERNAL.jar:1.8.1-INTERNAL]
        at 
org.apache.hudi.HoodieBaseRelation.$anonfun$listLatestFileSlices$1(HoodieBaseRelation.scala:416)
 ~[hudi-utilities-bundle_2.12-1.8.1-INTERNAL.jar:1.8.1-INTERNAL]
        at 
org.apache.hudi.HoodieBaseRelation.$anonfun$listLatestFileSlices$1$adapted(HoodieBaseRelation.scala:416)
 ~[hudi-utilities-bundle_2.12-1.8.1-INTERNAL.jar:1.8.1-INTERNAL]
        at scala.Option.foreach(Option.scala:407) ~[scala-library-2.12.17.jar:?]
        at 
org.apache.hudi.HoodieBaseRelation.listLatestFileSlices(HoodieBaseRelation.scala:416)
 ~[hudi-utilities-bundle_2.12-1.8.1-INTERNAL.jar:1.8.1-INTERNAL]
        at 
org.apache.hudi.BaseMergeOnReadSnapshotRelation.collectFileSplits(MergeOnReadSnapshotRelation.scala:225)
 ~[hudi-utilities-bundle_2.12-1.8.1-INTERNAL.jar:1.8.1-INTERNAL]
        at 
org.apache.hudi.BaseMergeOnReadSnapshotRelation.collectFileSplits(MergeOnReadSnapshotRelation.scala:68)
 ~[hudi-utilities-bundle_2.12-1.8.1-INTERNAL.jar:1.8.1-INTERNAL]
        at 
org.apache.hudi.HoodieBaseRelation.buildScan(HoodieBaseRelation.scala:369) 
~[hudi-utilities-bundle_2.12-1.8.1-INTERNAL.jar:1.8.1-INTERNAL]
        at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$.$anonfun$apply$4(DataSourceStrategy.scala:323)
 ~[spark-sql_2.12-3.2.3.jar:1.8.1-INTERNAL]
        at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$.$anonfun$pruneFilterProject$1(DataSourceStrategy.scala:357)
 ~[spark-sql_2.12-3.2.3.jar:1.8.1-INTERNAL]
        at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProjectRaw(DataSourceStrategy.scala:413)
 ~[spark-sql_2.12-3.2.3.jar:1.8.1-INTERNAL]
        at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProject(DataSourceStrategy.scala:356)
 ~[spark-sql_2.12-3.2.3.jar:1.8.1-INTERNAL]
        at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:323)
 ~[spark-sql_2.12-3.2.3.jar:1.8.1-INTERNAL]
        at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)
 ~[spark-catalyst_2.12-3.2.3.jar:3.2.3]
        at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) 
~[scala-library-2.12.17.jar:?]
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) 
~[scala-library-2.12.17.jar:?]
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491) 
~[scala-library-2.12.17.jar:?]
        at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93) 
~[spark-catalyst_2.12-3.2.3.jar:3.2.3]
        at 
org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:67) 
~[spark-sql_2.12-3.2.3.jar:1.8.1-INTERNAL]
        at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)
 ~[spark-catalyst_2.12-3.2.3.jar:3.2.3]
        at 
scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:196) 
~[scala-library-2.12.17.jar:?]
        at 
scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:194) 
~[scala-library-2.12.17.jar:?]
        at scala.collection.Iterator.foreach(Iterator.scala:943) 
~[scala-library-2.12.17.jar:?]
        at scala.collection.Iterator.fore

[jira] [Updated] (HUDI-7812) Async Clustering w/ row writer fails due to timetravel query validation

2024-05-29 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7812:
--
Description: 
With clustering row writer enabled flow, we trigger a time travel query to read 
input records. But the query side fails if there are any pending commits (due 
to new ingestion ) whose timestamp < clustering instant time. we need to relax 
this constraint. 

 

 

 

  was:
With clustering row writer enabled flow, we trigger a time travel query to read 
input records. But the query side fails if there are any pending commits (due 
to new ingestion ) whose timestamp < clustering instant time. we need to relax 
this constraint. 

 


> Async Clustering w/ row writer fails due to timetravel query validation 
> 
>
> Key: HUDI-7812
> URL: https://issues.apache.org/jira/browse/HUDI-7812
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: clustering
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
>
> With clustering row writer enabled flow, we trigger a time travel query to 
> read input records. But the query side fails if there are any pending commits 
> (due to new ingestion ) whose timestamp < clustering instant time. we need to 
> relax this constraint. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7812) Async Clustering w/ row writer fails due to timetravel query validation

2024-05-29 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7812:
-
Labels: pull-request-available  (was: )

> Async Clustering w/ row writer fails due to timetravel query validation 
> 
>
> Key: HUDI-7812
> URL: https://issues.apache.org/jira/browse/HUDI-7812
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: clustering
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
>
> With clustering row writer enabled flow, we trigger a time travel query to 
> read input records. But the query side fails if there are any pending commits 
> (due to new ingestion ) whose timestamp < clustering instant time. we need to 
> relax this constraint. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[PR] [HUDI-7812] Disabling row writer for clustering [hudi]

2024-05-29 Thread via GitHub



nsivabalan opened a new pull request, #11360:
URL: https://github.com/apache/hudi/pull/11360

   ### Change Logs
   
   Due to the time travel query restrictions, disabling row writer for 
clustering by default. This patch is targetted against branch-0-x. We will 
triage the issue and relax the constraint for master post 0.15.0 release. 
   
   ### Impact
   
   Due to the time travel query restrictions, disabling row writer for 
clustering by default. Async clustering will not have any failures after this 
patch. 
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change. If not, put "none"._
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-7812) Async Clustering w/ row writer fails due to timetravel query validation

2024-05-29 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7812:
-

 Summary: Async Clustering w/ row writer fails due to timetravel 
query validation 
 Key: HUDI-7812
 URL: https://issues.apache.org/jira/browse/HUDI-7812
 Project: Apache Hudi
  Issue Type: Bug
  Components: clustering
Reporter: sivabalan narayanan


With clustering row writer enabled flow, we trigger a time travel query to read 
input records. But the query side fails if there are any pending commits (due 
to new ingestion ) whose timestamp < clustering instant time. we need to relax 
this constraint. 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [HUDI-7655] Ensuring clean action executor cleans up all intended files [hudi]

2024-05-29 Thread via GitHub



hudi-bot commented on PR #11343:
URL: https://github.com/apache/hudi/pull/11343#issuecomment-2138084310

   
   ## CI report:
   
   * da5fde85d00993e849321dcadf001fbcd524d1d0 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24109)
 
   * e41510c1b33efe37feb7e791c61c0a85b647c8ab Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24130)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7655] Ensuring clean action executor cleans up all intended files [hudi]

2024-05-29 Thread via GitHub



hudi-bot commented on PR #11343:
URL: https://github.com/apache/hudi/pull/11343#issuecomment-2138071569

   
   ## CI report:
   
   * da5fde85d00993e849321dcadf001fbcd524d1d0 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24109)
 
   * e41510c1b33efe37feb7e791c61c0a85b647c8ab UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7007] Add bloom_filters index support on read side [hudi]

2024-05-29 Thread via GitHub



hudi-bot commented on PR #11043:
URL: https://github.com/apache/hudi/pull/11043#issuecomment-2138070700

   
   ## CI report:
   
   * 541b544049e68b3d22cdf0f5159fbd9b0005d345 UNKNOWN
   * 87c15b2c23430d967749dede5e09d74a33dcce88 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24128)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7713] Enforce ordering of fields during schema reconciliation [hudi]

2024-05-29 Thread via GitHub



hudi-bot commented on PR #11154:
URL: https://github.com/apache/hudi/pull/11154#issuecomment-2138057278

   
   ## CI report:
   
   * 12038dbde068e26f733a7b1c9cc7217019c31f25 UNKNOWN
   * 11bf799de8acdf9e8683eaa170bd69556d1feaab Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24129)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT] Hudi Sink Connector shows broker disconnected [hudi]

2024-05-29 Thread via GitHub



prabodh1194 commented on issue #9070:
URL: https://github.com/apache/hudi/issues/9070#issuecomment-2138022231

   yeah. i just wanted to check out kafka-connect. got massively stuck on this 
issue :( . anyways, i think prefixing the props with 
   `consumer.override` works well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7655] Ensuring clean action executor cleans up all intended files [hudi]

2024-05-29 Thread via GitHub



nsivabalan commented on code in PR #11343:
URL: https://github.com/apache/hudi/pull/11343#discussion_r1619309145


##
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/functional/TestCleanActionExecutor.java:
##
@@ -0,0 +1,156 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.table.functional;
+
+import org.apache.hudi.avro.model.HoodieActionInstant;
+import org.apache.hudi.avro.model.HoodieCleanFileInfo;
+import org.apache.hudi.avro.model.HoodieCleanMetadata;
+import org.apache.hudi.avro.model.HoodieCleanPartitionMetadata;
+import org.apache.hudi.avro.model.HoodieCleanerPlan;
+import org.apache.hudi.common.config.HoodieMetadataConfig;
+import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.common.engine.HoodieLocalEngineContext;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.model.HoodieCleaningPolicy;
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.table.timeline.TimelineMetadataUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.storage.HoodieStorage;
+import org.apache.hudi.storage.StorageConfiguration;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.hudi.table.action.clean.CleanActionExecutor;
+import org.apache.hudi.table.action.clean.CleanPlanner;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.params.ParameterizedTest;
+import org.junit.jupiter.params.provider.ValueSource;
+
+import java.io.IOException;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.UUID;
+
+import static 
org.apache.hudi.common.testutils.HoodieTestUtils.getDefaultStorageConf;
+import static org.junit.jupiter.api.Assertions.assertThrows;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+import static org.mockito.ArgumentMatchers.any;
+import static org.mockito.Mockito.mock;
+import static org.mockito.Mockito.when;
+
+/**
+ * Tests Clean action executor.
+ */
+public class TestCleanActionExecutor {
+
+  private static final StorageConfiguration CONF = 
getDefaultStorageConf();
+  private final HoodieEngineContext context = new 
HoodieLocalEngineContext(CONF);
+  private final HoodieTable mockHoodieTable = 
mock(HoodieTable.class);
+  private HoodieTableMetaClient metaClient;
+  private FileSystem fs;
+
+  private static String PARTITION1 = "partition1";
+
+  String earliestInstant = "20231204194919610";
+  String earliestInstantMinusThreeDays = "20231201194919610";
+
+  @BeforeEach
+  void setUp() {
+metaClient = mock(HoodieTableMetaClient.class);
+when(mockHoodieTable.getMetaClient()).thenReturn(metaClient);
+HoodieTableConfig tableConfig = new HoodieTableConfig();
+when(metaClient.getTableConfig()).thenReturn(tableConfig);
+HoodieStorage storage = mock(HoodieStorage.class);
+when(metaClient.getStorage()).thenReturn(storage);
+when(mockHoodieTable.getStorage()).thenReturn(storage);
+fs = mock(FileSystem.class);
+when(storage.getFileSystem()).thenReturn(fs);
+when(fs.getConf()).thenReturn(CONF.unwrap());
+  }
+
+  @ParameterizedTest
+  @ValueSource(booleans = {true, false})
+  void testPartialCleanFailure(boolean simulateFailedDeletion) throws 
IOException {
+HoodieWriteConfig config = getCleanByCommitsConfig();
+String fileGroup = UUID.randomUUID() + "-0";
+HoodieBaseFile baseFile = new 
HoodieBaseFile(String.format("/tmp/base/%s_1-0-1_%s.parquet", fileGroup, 
"001"));
+FileSystem localFs = new 
Path(baseFile.getPath()).getFileSystem(CONF.unwrap());
+Path filePath = new Path(baseFile.getPath());
+localFs.create(filePath);
+String ex

Re: [PR] [HUDI-7146] Integrate secondary index on reader path [hudi]

2024-05-29 Thread via GitHub



hudi-bot commented on PR #11162:
URL: https://github.com/apache/hudi/pull/11162#issuecomment-2137985942

   
   ## CI report:
   
   * 9d0e80222f6cc69b2dba6f4cdbfc642f31a95e52 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24124)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7713] Enforce ordering of fields during schema reconciliation [hudi]

2024-05-29 Thread via GitHub



hudi-bot commented on PR #11154:
URL: https://github.com/apache/hudi/pull/11154#issuecomment-2137985796

   
   ## CI report:
   
   * 12038dbde068e26f733a7b1c9cc7217019c31f25 UNKNOWN
   * 00b4e2d253b857c9387964eacfbfa46949a21580 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24092)
 
   * 11bf799de8acdf9e8683eaa170bd69556d1feaab UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7007] Add bloom_filters index support on read side [hudi]

2024-05-29 Thread via GitHub



hudi-bot commented on PR #11043:
URL: https://github.com/apache/hudi/pull/11043#issuecomment-2137985435

   
   ## CI report:
   
   * 541b544049e68b3d22cdf0f5159fbd9b0005d345 UNKNOWN
   * 05041565a628a2883ec642ecdfcddf0d66f4e7ce Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24090)
 
   * 87c15b2c23430d967749dede5e09d74a33dcce88 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24128)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7713] Enforce ordering of fields during schema reconciliation [hudi]

2024-05-29 Thread via GitHub



the-other-tim-brown commented on code in PR #11154:
URL: https://github.com/apache/hudi/pull/11154#discussion_r1619287772


##
hudi-spark-datasource/hudi-spark-common/src/test/java/org/apache/hudi/TestHoodieSchemaUtils.java:
##
@@ -239,6 +240,51 @@ void testMissingColumn(boolean allowDroppedColumns) {
 }
   }
 
+  @Test
+  void testFieldReordering() {
+// field order changes and incoming schema is missing an existing field
+Schema start = createRecord("reorderFields",
+createPrimitiveField("field1", Schema.Type.INT),
+createPrimitiveField("field2", Schema.Type.INT),
+createPrimitiveField("field3", Schema.Type.INT));
+Schema end = createRecord("reorderFields",
+createPrimitiveField("field3", Schema.Type.INT),
+createPrimitiveField("field1", Schema.Type.INT));
+assertEquals(start, deduceWriterSchema(end, start, true));
+
+// nested field ordering changes and new field is added
+start = createRecord("reorderNestedFields",
+createPrimitiveField("field1", Schema.Type.INT),
+createPrimitiveField("field2", Schema.Type.INT),
+createArrayField("field3", createRecord("nestedRecord",
+createPrimitiveField("nestedField1", Schema.Type.INT),
+createPrimitiveField("nestedField2", Schema.Type.INT),
+createPrimitiveField("nestedField3", Schema.Type.INT))),
+createPrimitiveField("field4", Schema.Type.INT));
+end = createRecord("reorderNestedFields",
+createPrimitiveField("field1", Schema.Type.INT),
+createPrimitiveField("field2", Schema.Type.INT),
+createPrimitiveField("field5", Schema.Type.INT),
+createArrayField("field3", createRecord("nestedRecord",
+createPrimitiveField("nestedField2", Schema.Type.INT),
+createPrimitiveField("nestedField1", Schema.Type.INT),
+createPrimitiveField("nestedField3", Schema.Type.INT),
+createPrimitiveField("nestedField4", Schema.Type.INT))),
+createPrimitiveField("field4", Schema.Type.INT));
+
+Schema expected = createRecord("reorderNestedFields",
+createPrimitiveField("field1", Schema.Type.INT),
+createPrimitiveField("field2", Schema.Type.INT),
+createArrayField("field3", createRecord("reorderNestedFields.field3",

Review Comment:
   
https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/internal/schema/convert/AvroInternalSchemaConverter.java#L341
 - this logic was not created/updated by me. Do you want me to change it as 
part of this PR?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT] Issue with RECORD_INDEX Initialization Falling Back to GLOBAL_SIMPLE [hudi]

2024-05-29 Thread via GitHub



soumilshah1995 closed issue #11350: [SUPPORT] Issue with RECORD_INDEX 
Initialization Falling Back to GLOBAL_SIMPLE 
URL: https://github.com/apache/hudi/issues/11350


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

1 2 >

1 - 100 of 178 matches

Mail list logo