date:20240522

Re: [PR] [HUDI-5505] Fix counting of delta commits since last compaction in Sc… [hudi]

2024-05-22 Thread via GitHub



a-erofeev commented on PR #11251:
URL: https://github.com/apache/hudi/pull/11251#issuecomment-2126173947

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR] LSMTimeline needs to handle case for tables which has not performed first archived yet [hudi]

2024-05-22 Thread via GitHub



bvaradar commented on code in PR #11271:
URL: https://github.com/apache/hudi/pull/11271#discussion_r1610840996


##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/LSMTimeline.java:
##
@@ -158,13 +159,18 @@ public static int 
latestSnapshotVersion(HoodieTableMetaClient metaClient) throws
 LOG.warn("Error reading version file {}", versionFilePath, e);
   }
 }
+
 return 
allSnapshotVersions(metaClient).stream().max(Integer::compareTo).orElse(-1);
   }
 
   /**
* Returns all the valid snapshot versions.
*/
   public static List allSnapshotVersions(HoodieTableMetaClient 
metaClient) throws IOException {
+StoragePath archivedFolderPath = new 
StoragePath(metaClient.getArchivePath());

Review Comment:
   Yes, found during testing the 0.14.x writer with 1.x reader testing for a 
newly created 0.14.x table. LSMTimeline#latestSnapshotVersion was calling 
LSMTimeline#allSnapshotVersions due to the absence of latest snapshot version 
file. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7783] Fix connection leak in FileSystemBasedLockProvider [hudi]

2024-05-22 Thread via GitHub



xuzifu666 commented on code in PR #11267:
URL: https://github.com/apache/hudi/pull/11267#discussion_r1610831406


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/FileSystemBasedLockProvider.java:
##
@@ -96,6 +96,7 @@ public void close() {
 synchronized (LOCK_FILE_NAME) {
   try {
 fs.delete(this.lockFile, true);
+fs.close();

Review Comment:
   close it first



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7783] Fix connection leak in FileSystemBasedLockProvider [hudi]

2024-05-22 Thread via GitHub



xuzifu666 closed pull request #11267: [HUDI-7783] Fix connection leak in 
FileSystemBasedLockProvider
URL: https://github.com/apache/hudi/pull/11267


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7146] Implement secondary index write path [hudi]

2024-05-22 Thread via GitHub



danny0405 commented on code in PR #11146:
URL: https://github.com/apache/hudi/pull/11146#discussion_r1610825092


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -540,6 +547,51 @@ private HoodieFunctionalIndexDefinition 
getFunctionalIndexDefinition(String inde
 }
   }
 
+  private Set getSecondaryIndexPartitionsToInit() {
+Set secondaryIndexPartitions = 
dataMetaClient.getFunctionalIndexMetadata().get().getIndexDefinitions().values().stream()
+.map(HoodieFunctionalIndexDefinition::getIndexName)
+.filter(indexName -> 
indexName.startsWith(HoodieTableMetadataUtil.PARTITION_NAME_SECONDARY_INDEX_PREFIX))

Review Comment:
   +1 for `HoodieIndexDefinition` which is more general.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7146] Implement secondary index write path [hudi]

2024-05-22 Thread via GitHub



danny0405 commented on code in PR #11146:
URL: https://github.com/apache/hudi/pull/11146#discussion_r1610824721


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -437,9 +448,7 @@ private boolean initializeFromFilesystem(String 
initializationTime, List records = 
fileGroupCountAndRecordsPair.getValue();
   bulkCommit(commitTimeForPartition, partitionType, records, 
fileGroupCount);
   metadataMetaClient.reloadActiveTimeline();
-  String partitionPath = partitionType == FUNCTIONAL_INDEX
-  ? dataWriteConfig.getFunctionalIndexConfig().getIndexName()
-  : partitionType.getPartitionPath();
+  String partitionPath = (partitionType == FUNCTIONAL_INDEX || 
partitionType == SECONDARY_INDEX) ? 
dataWriteConfig.getFunctionalIndexConfig().getIndexName() : 
partitionType.getPartitionPath();

Review Comment:
   Yeah, probably we can add some method like 
`MetadataPartitionType.getPartitionPath` to simplify the code.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7146] Implement secondary index write path [hudi]

2024-05-22 Thread via GitHub



danny0405 commented on code in PR #11146:
URL: https://github.com/apache/hudi/pull/11146#discussion_r1610823837


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -410,6 +413,14 @@ private boolean initializeFromFilesystem(String 
initializationTime, List secondaryIndexPartitionsToInit = 
getSecondaryIndexPartitionsToInit();
+if (secondaryIndexPartitionsToInit.isEmpty()) {
+  continue;
+}
+ValidationUtils.checkState(secondaryIndexPartitionsToInit.size() 
== 1, "Only one secondary index at a time is supported for now");

Review Comment:
   > but can only create one at a time using sql due to limitation of sql 
grammar.
   
   Reasonable, but it also looks like we only allow one index for MDT 
bootstrap, is this expected?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7783] Fix connection leak in FileSystemBasedLockProvider [hudi]

2024-05-22 Thread via GitHub



danny0405 commented on code in PR #11267:
URL: https://github.com/apache/hudi/pull/11267#discussion_r1610822698


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/FileSystemBasedLockProvider.java:
##
@@ -96,6 +96,7 @@ public void close() {
 synchronized (LOCK_FILE_NAME) {
   try {
 fs.delete(this.lockFile, true);
+fs.close();

Review Comment:
   Usually there is no need to close the filesystem because there are some 
reuse strategy on the HADOOP CLASSPATH, if we close the fs instance, it could 
induce exception for other invokers that are using this fs instance.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [DISCUSSION] Deltastreamer - Reading commit checkpoint from Kafka instead of latest Hoodie commit [hudi]

2024-05-22 Thread via GitHub



danny0405 commented on issue #11268:
URL: https://github.com/apache/hudi/issues/11268#issuecomment-2125996412

   > We reviewed the deltastreamer code and noticed that the deltastreamer can 
read commits from Kafka consumer groups
   
   If the consumer `offset` is what the Hoodie checkpoint persisted, sounds 
more reasonable to always pull the offset from the Kafka server where the 
offset can be managed in good shape.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7762] Optimizing Hudi Table Check with Delta Lake by Refining Class Name Checks In Spark3.5 [hudi]

2024-05-22 Thread via GitHub



danny0405 commented on code in PR #11224:
URL: https://github.com/apache/hudi/pull/11224#discussion_r1610819252


##
hudi-spark-datasource/hudi-spark3.5.x/src/main/scala/org/apache/spark/sql/adapter/Spark3_5Adapter.scala:
##
@@ -54,7 +54,7 @@ class Spark3_5Adapter extends BaseSpark3Adapter {
 case plan if !plan.resolved => None
 // NOTE: When resolving Hudi table we allow [[Filter]]s and 
[[Project]]s be applied
 //   on top of it
-case PhysicalOperation(_, _, DataSourceV2Relation(v2: 
V2TableWithV1Fallback, _, _, _, _)) if isHoodieTable(v2.v1Table) =>
+case PhysicalOperation(_, _, DataSourceV2Relation(v2: 
V2TableWithV1Fallback, _, _, _, _)) if isHoodieTable(v2) =>

Review Comment:
   @jonvex can you help for the review?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR] LSMTimeline needs to handle case for tables which has not performed first archived yet [hudi]

2024-05-22 Thread via GitHub



danny0405 commented on code in PR #11271:
URL: https://github.com/apache/hudi/pull/11271#discussion_r1610818043


##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/LSMTimeline.java:
##
@@ -158,13 +159,18 @@ public static int 
latestSnapshotVersion(HoodieTableMetaClient metaClient) throws
 LOG.warn("Error reading version file {}", versionFilePath, e);
   }
 }
+
 return 
allSnapshotVersions(metaClient).stream().max(Integer::compareTo).orElse(-1);
   }
 
   /**
* Returns all the valid snapshot versions.
*/
   public static List allSnapshotVersions(HoodieTableMetaClient 
metaClient) throws IOException {
+StoragePath archivedFolderPath = new 
StoragePath(metaClient.getArchivePath());

Review Comment:
   I search around the invokers, one is archive `LSMTimelineWriter#clean` and 
another is `LSMTimeline#`latestSnapshotVersion, both would ensure the archving 
already happens, is this a case specific for backward compatibility?



##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/LSMTimeline.java:
##
@@ -158,13 +159,18 @@ public static int 
latestSnapshotVersion(HoodieTableMetaClient metaClient) throws
 LOG.warn("Error reading version file {}", versionFilePath, e);
   }
 }
+
 return 
allSnapshotVersions(metaClient).stream().max(Integer::compareTo).orElse(-1);
   }
 
   /**
* Returns all the valid snapshot versions.
*/
   public static List allSnapshotVersions(HoodieTableMetaClient 
metaClient) throws IOException {
+StoragePath archivedFolderPath = new 
StoragePath(metaClient.getArchivePath());

Review Comment:
   I search around the invokers, one is archive `LSMTimelineWriter#clean` and 
another is `LSMTimeline#latestSnapshotVersion`, both would ensure the archving 
already happens, is this a case specific for backward compatibility?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR] LSMTimeline needs to handle case for tables which has not performed first archived yet [hudi]

2024-05-22 Thread via GitHub



hudi-bot commented on PR #11271:
URL: https://github.com/apache/hudi/pull/11271#issuecomment-2125988423

   
   ## CI report:
   
   * 818e29cf9483b5b3a17724a490e5c602f5b408c4 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] [MINOR] LSMTimeline needs to handle case for tables which has not performed first archived yet [hudi]

2024-05-22 Thread via GitHub



bvaradar opened a new pull request, #11271:
URL: https://github.com/apache/hudi/pull/11271

   Found during backwards compatibility testing. Sometimes, Archiving would not 
have run for a Hudi table. In that case, LSMTimeline must gracefully handle 
instead of throwing File not found exception. 
   
   ### Change Logs
   
   Found during backwards compatibility testing. Sometimes, Archiving would not 
have run for a Hudi table. In that case, LSMTimeline must gracefully handle 
instead of throwing File not found exception. 
   
   ### Impact
   
   Backwards Compatibility
   
   ### Risk level (write none, low medium or high below)
   
   None
   
   ### Documentation Update
   none
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] Exception org.apache.hudi.exception.HoodieIOException: Could not read commit details [hudi]

2024-05-22 Thread via GitHub



Jason-liujc commented on issue #6143:
URL: https://github.com/apache/hudi/issues/6143#issuecomment-2125944255

   This happened for us and the root cause is we have concurrent running spark 
jobs writing to the same insert_overwrite table, different partition. Our jobs 
didn't have proper concurrent locking configurations. After adding DynamoDB 
Locks for these insert_overwrite write tasks, we stopped seeing these issues 
anymore.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

(hudi) branch branch-0.x updated: [HUDI-7784] Fix serde of HoodieHadoopConfiguration in Spark (#11270)

2024-05-22 Thread sivabalan

This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/branch-0.x by this push:
 new 2e39b41be07 [HUDI-7784] Fix serde of HoodieHadoopConfiguration in 
Spark (#11270)
2e39b41be07 is described below

commit 2e39b41be07d42c0d41fd2cf765732e592954466
Author: Y Ethan Guo 
AuthorDate: Wed May 22 15:27:48 2024 -0700

[HUDI-7784] Fix serde of HoodieHadoopConfiguration in Spark (#11270)
---
 .../apache/spark/HoodieSparkKryoRegistrar.scala|  6 +-
 .../apache/spark/TestHoodieSparkKryoRegistrar.java | 86 ++
 2 files changed, 89 insertions(+), 3 deletions(-)

diff --git 
a/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/HoodieSparkKryoRegistrar.scala
 
b/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/HoodieSparkKryoRegistrar.scala
index a8650e5668a..eba3999ea57 100644
--- 
a/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/HoodieSparkKryoRegistrar.scala
+++ 
b/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/HoodieSparkKryoRegistrar.scala
@@ -22,7 +22,7 @@ import org.apache.hudi.client.model.HoodieInternalRow
 import org.apache.hudi.common.model.{HoodieKey, HoodieSparkRecord}
 import org.apache.hudi.common.util.HoodieCommonKryoRegistrar
 import org.apache.hudi.config.HoodieWriteConfig
-import org.apache.hudi.storage.StorageConfiguration
+import org.apache.hudi.storage.hadoop.HadoopStorageConfiguration
 
 import com.esotericsoftware.kryo.io.{Input, Output}
 import com.esotericsoftware.kryo.serializers.JavaSerializer
@@ -64,8 +64,8 @@ class HoodieSparkKryoRegistrar extends 
HoodieCommonKryoRegistrar with KryoRegist
 //   Hadoop's configuration is not a serializable object by itself, 
and hence
 //   we're relying on [[SerializableConfiguration]] wrapper to work it 
around.
 //   We cannot remove this entry; otherwise the ordering is changed.
-//   So we replace it with [[StorageConfiguration]].
-kryo.register(classOf[StorageConfiguration[_]], new JavaSerializer())
+//   So we replace it with [[HadoopStorageConfiguration]] for Spark.
+kryo.register(classOf[HadoopStorageConfiguration], new JavaSerializer())
   }
 
   /**
diff --git 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/spark/TestHoodieSparkKryoRegistrar.java
 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/spark/TestHoodieSparkKryoRegistrar.java
new file mode 100644
index 000..4dd297a02b6
--- /dev/null
+++ 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/spark/TestHoodieSparkKryoRegistrar.java
@@ -0,0 +1,86 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.spark;
+
+import org.apache.hudi.storage.hadoop.HadoopStorageConfiguration;
+
+import com.esotericsoftware.kryo.Kryo;
+import com.esotericsoftware.kryo.io.Input;
+import com.esotericsoftware.kryo.io.Output;
+import org.apache.hadoop.conf.Configuration;
+import org.junit.jupiter.api.Test;
+import org.objenesis.strategy.StdInstantiatorStrategy;
+
+import java.io.ByteArrayInputStream;
+import java.io.ByteArrayOutputStream;
+import java.util.HashMap;
+import java.util.Map;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+
+/**
+ * Tests {@link HoodieSparkKryoRegistrar}
+ */
+public class TestHoodieSparkKryoRegistrar {
+  @Test
+  public void testSerdeHoodieHadoopConfiguration() {
+Kryo kryo = newKryo();
+
+HadoopStorageConfiguration conf = new HadoopStorageConfiguration(new 
Configuration());
+
+// Serialize
+ByteArrayOutputStream baos = new ByteArrayOutputStream();
+Output output = new Output(baos);
+kryo.writeObject(output, conf);
+output.close();
+
+// Deserialize
+ByteArrayInputStream bais = new ByteArrayInputStream(baos.toByteArray());
+Input input = new Input(bais);
+HadoopStorageConfiguration deserialized = kryo.readObject(input, 
HadoopStorageConfiguration.class);
+input.close();
+
+// Verify
+assertEquals(getPropsInMap(conf), getPropsInMap(deserialized));
+  }
+
+  private Kryo newKryo() {
+Kryo kryo = new

Re: [PR] [HUDI-7784][branch-0.x] Fix serde of HoodieHadoopConfiguration in Spark [hudi]

2024-05-22 Thread via GitHub



nsivabalan merged PR #11270:
URL: https://github.com/apache/hudi/pull/11270


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

(hudi) branch master updated: [HUDI-7784] Fix serde of HoodieHadoopConfiguration in Spark (#11269)

2024-05-22 Thread sivabalan

This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new e1aa1bcb4af [HUDI-7784] Fix serde of HoodieHadoopConfiguration in 
Spark (#11269)
e1aa1bcb4af is described below

commit e1aa1bcb4af5cbcad9e985d0333eb5275e128e5b
Author: Y Ethan Guo 
AuthorDate: Wed May 22 15:27:03 2024 -0700

[HUDI-7784] Fix serde of HoodieHadoopConfiguration in Spark (#11269)
---
 .../apache/spark/HoodieSparkKryoRegistrar.scala|  6 +-
 .../apache/spark/TestHoodieSparkKryoRegistrar.java | 86 ++
 2 files changed, 89 insertions(+), 3 deletions(-)

diff --git 
a/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/HoodieSparkKryoRegistrar.scala
 
b/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/HoodieSparkKryoRegistrar.scala
index a8650e5668a..eba3999ea57 100644
--- 
a/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/HoodieSparkKryoRegistrar.scala
+++ 
b/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/HoodieSparkKryoRegistrar.scala
@@ -22,7 +22,7 @@ import org.apache.hudi.client.model.HoodieInternalRow
 import org.apache.hudi.common.model.{HoodieKey, HoodieSparkRecord}
 import org.apache.hudi.common.util.HoodieCommonKryoRegistrar
 import org.apache.hudi.config.HoodieWriteConfig
-import org.apache.hudi.storage.StorageConfiguration
+import org.apache.hudi.storage.hadoop.HadoopStorageConfiguration
 
 import com.esotericsoftware.kryo.io.{Input, Output}
 import com.esotericsoftware.kryo.serializers.JavaSerializer
@@ -64,8 +64,8 @@ class HoodieSparkKryoRegistrar extends 
HoodieCommonKryoRegistrar with KryoRegist
 //   Hadoop's configuration is not a serializable object by itself, 
and hence
 //   we're relying on [[SerializableConfiguration]] wrapper to work it 
around.
 //   We cannot remove this entry; otherwise the ordering is changed.
-//   So we replace it with [[StorageConfiguration]].
-kryo.register(classOf[StorageConfiguration[_]], new JavaSerializer())
+//   So we replace it with [[HadoopStorageConfiguration]] for Spark.
+kryo.register(classOf[HadoopStorageConfiguration], new JavaSerializer())
   }
 
   /**
diff --git 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/spark/TestHoodieSparkKryoRegistrar.java
 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/spark/TestHoodieSparkKryoRegistrar.java
new file mode 100644
index 000..4dd297a02b6
--- /dev/null
+++ 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/spark/TestHoodieSparkKryoRegistrar.java
@@ -0,0 +1,86 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.spark;
+
+import org.apache.hudi.storage.hadoop.HadoopStorageConfiguration;
+
+import com.esotericsoftware.kryo.Kryo;
+import com.esotericsoftware.kryo.io.Input;
+import com.esotericsoftware.kryo.io.Output;
+import org.apache.hadoop.conf.Configuration;
+import org.junit.jupiter.api.Test;
+import org.objenesis.strategy.StdInstantiatorStrategy;
+
+import java.io.ByteArrayInputStream;
+import java.io.ByteArrayOutputStream;
+import java.util.HashMap;
+import java.util.Map;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+
+/**
+ * Tests {@link HoodieSparkKryoRegistrar}
+ */
+public class TestHoodieSparkKryoRegistrar {
+  @Test
+  public void testSerdeHoodieHadoopConfiguration() {
+Kryo kryo = newKryo();
+
+HadoopStorageConfiguration conf = new HadoopStorageConfiguration(new 
Configuration());
+
+// Serialize
+ByteArrayOutputStream baos = new ByteArrayOutputStream();
+Output output = new Output(baos);
+kryo.writeObject(output, conf);
+output.close();
+
+// Deserialize
+ByteArrayInputStream bais = new ByteArrayInputStream(baos.toByteArray());
+Input input = new Input(bais);
+HadoopStorageConfiguration deserialized = kryo.readObject(input, 
HadoopStorageConfiguration.class);
+input.close();
+
+// Verify
+assertEquals(getPropsInMap(conf), getPropsInMap(deserialized));
+  }
+
+  private Kryo newKryo() {
+Kryo kryo = new Kryo();
+

Re: [PR] [HUDI-7784] Fix serde of HoodieHadoopConfiguration in Spark [hudi]

2024-05-22 Thread via GitHub



nsivabalan merged PR #11269:
URL: https://github.com/apache/hudi/pull/11269


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [BUG]hudi cli command with Wrong FS error [hudi]

2024-05-22 Thread via GitHub



ehurheap commented on issue #9903:
URL: https://github.com/apache/hudi/issues/9903#issuecomment-2125644288

   I also ran into this problem when running:
   `compaction validate --instant 20240516172801913`
   
   The validation itself appears to complete ok because I see this output:
   ```
   24/05/22 17:58:05 INFO DAGScheduler: Job 0 finished: collect at 
HoodieSparkEngineContext.java:103, took 297.868011 s
   Result of Validation Operation :
   24/05/22 17:58:05 INFO MultipartUploadOutputStream: close closed:false 
s3://bucketname-redacted/tmp/b0fcc58c-da3a-4a74-90f5-4a63bca5ada8.ser
   24/05/22 17:58:05 INFO BlockManagerInfo: Removed broadcast_0_piece0 on 
ip-10-18-146-49.heap:44129 in memory (size: 598.3 KiB, free: 2.2 GiB)
   24/05/22 17:58:05 INFO MultipartUploadOutputStream: close closed:true 
s3://bucketname-redacted/tmp/b0fcc58c-da3a-4a74-90f5-4a63bca5ada8.ser
   24/05/22 17:58:05 INFO Javalin: Stopping Javalin ...
   24/05/22 17:58:05 INFO Javalin: Javalin has stopped
   24/05/22 17:58:05 INFO SparkUI: Stopped Spark web UI at 
http://ip-10-18-146-49.heap:4040
   24/05/22 17:58:05 INFO MapOutputTrackerMasterEndpoint: 
MapOutputTrackerMasterEndpoint stopped!
   ```
   then it continues shutting down spark but then:
   ```
   24/05/22 17:58:05 INFO SparkContext: Successfully stopped SparkContext
   24/05/22 17:58:05 INFO ShutdownHookManager: Shutdown hook called
   24/05/22 17:58:05 INFO ShutdownHookManager: Deleting directory 
/mnt/tmp/spark-ab2d5508-e1e0-4f35-af09-d75a368a84b6
   24/05/22 17:58:05 INFO ShutdownHookManager: Deleting directory 
/mnt/tmp/spark-51ff5e59-d464-4f7a-8ed5-dbd4906ea294
   Command failed java.lang.IllegalArgumentException: Wrong FS: 
s3:/tmp/b0fcc58c-da3a-4a74-90f5-4a63bca5ada8.ser, expected: 
s3://bucketname-redacted
   OperationResult{operation=CompactionOperation{baseInstantTime=.. etc the 
details of the compaction here}
   ```
   
   
   hudi version 0.13.1
   hudi-cli version 1.0



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7784][branch-0.x] Fix serde of HoodieHadoopConfiguration in Spark [hudi]

2024-05-22 Thread via GitHub



hudi-bot commented on PR #11270:
URL: https://github.com/apache/hudi/pull/11270#issuecomment-2125580506

   
   ## CI report:
   
   * fd1bc98346a08d14b6f110a70827b27e888e77f3 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7784] Fix serde of HoodieHadoopConfiguration in Spark [hudi]

2024-05-22 Thread via GitHub



hudi-bot commented on PR #11269:
URL: https://github.com/apache/hudi/pull/11269#issuecomment-2125492570

   
   ## CI report:
   
   * 6ea8a900dbcd6c815270d07e51ed3683360462e5 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] [HUDI-7784][branch-0.x] Fix serde of HoodieHadoopConfiguration in Spark [hudi]

2024-05-22 Thread via GitHub



yihua opened a new pull request, #11270:
URL: https://github.com/apache/hudi/pull/11270

   ### Change Logs
   
   PR for master: https://github.com/apache/hudi/pull/11269
   This PR targets at `branch-0.x`.
   
   This PR fixes the issue that `HoodieHadoopConfiguration` is not properly 
(de)serialized by Kryo in Spark.  A new test is added to validate the 
(de)serialization.  Before the fix, the test failed with NullPointerException.
   
   ### Impact
   
   Fixes the bug.
   
   ### Risk level
   
   none
   
   ### Documentation Update
   
   none
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-7784) Fix serde of HoodieHadoopConfiguration in Spark

2024-05-22 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7784:
-
Labels: hoodie-storage pull-request-available  (was: hoodie-storage)

> Fix serde of HoodieHadoopConfiguration in Spark
> ---
>
> Key: HUDI-7784
> URL: https://issues.apache.org/jira/browse/HUDI-7784
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: hoodie-storage, pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[PR] [HUDI-7784] Fix serde of HoodieHadoopConfiguration in Spark [hudi]

2024-05-22 Thread via GitHub



yihua opened a new pull request, #11269:
URL: https://github.com/apache/hudi/pull/11269

   ### Change Logs
   
   This PR fixes the issue that `HoodieHadoopConfiguration` is not properly 
(de)serialized by Kryo in Spark.  A new test is added to validate the 
(de)serialization.  Before the fix, the test failed with NullPointerException.
   
   ### Impact
   
   Fixes the bug.
   
   ### Risk level
   
   none
   
   ### Documentation Update
   
   none
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-7784) Fix serde of HoodieHadoopConfiguration in Spark

2024-05-22 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7784:

Labels: hoodie-storage  (was: )

> Fix serde of HoodieHadoopConfiguration in Spark
> ---
>
> Key: HUDI-7784
> URL: https://issues.apache.org/jira/browse/HUDI-7784
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: hoodie-storage
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7784) Fix serde of HoodieHadoopConfiguration in Spark

2024-05-22 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7784:

Story Points: 2

> Fix serde of HoodieHadoopConfiguration in Spark
> ---
>
> Key: HUDI-7784
> URL: https://issues.apache.org/jira/browse/HUDI-7784
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 0.15.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7784) Fix serde of HoodieHadoopConfiguration in Spark

2024-05-22 Thread Ethan Guo (Jira)

Ethan Guo created HUDI-7784:
---

 Summary: Fix serde of HoodieHadoopConfiguration in Spark
 Key: HUDI-7784
 URL: https://issues.apache.org/jira/browse/HUDI-7784
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-7784) Fix serde of HoodieHadoopConfiguration in Spark

2024-05-22 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-7784:
---

Assignee: Ethan Guo

> Fix serde of HoodieHadoopConfiguration in Spark
> ---
>
> Key: HUDI-7784
> URL: https://issues.apache.org/jira/browse/HUDI-7784
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.15.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7784) Fix serde of HoodieHadoopConfiguration in Spark

2024-05-22 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7784:

Fix Version/s: 0.15.0

> Fix serde of HoodieHadoopConfiguration in Spark
> ---
>
> Key: HUDI-7784
> URL: https://issues.apache.org/jira/browse/HUDI-7784
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 0.15.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7784) Fix serde of HoodieHadoopConfiguration in Spark

2024-05-22 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7784:

Fix Version/s: 1.0.0

> Fix serde of HoodieHadoopConfiguration in Spark
> ---
>
> Key: HUDI-7784
> URL: https://issues.apache.org/jira/browse/HUDI-7784
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [HUDI-7146] Implement secondary index write path [hudi]

2024-05-22 Thread via GitHub



codope commented on code in PR #11146:
URL: https://github.com/apache/hudi/pull/11146#discussion_r1610397009


##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java:
##
@@ -783,4 +784,101 @@ public int 
getNumFileGroupsForPartition(MetadataPartitionType partition) {
 metadataFileSystemView, partition.getPartitionPath()));
 return partitionFileSliceMap.get(partition.getPartitionPath()).size();
   }
+
+  @Override
+  protected Map getSecondaryKeysForRecordKeys(List 
recordKeys, String partitionName) {
+if (recordKeys.isEmpty()) {
+  return Collections.emptyMap();
+}
+
+// Load the file slices for the partition. Each file slice is a shard 
which saves a portion of the keys.
+List partitionFileSlices = 
partitionFileSliceMap.computeIfAbsent(partitionName,
+k -> 
HoodieTableMetadataUtil.getPartitionLatestMergedFileSlices(metadataMetaClient, 
metadataFileSystemView, partitionName));
+final int numFileSlices = partitionFileSlices.size();
+ValidationUtils.checkState(numFileSlices > 0, "Number of file slices for 
partition " + partitionName + " should be > 0");
+
+// Lookup keys from each file slice
+// TODO: parallelize this loop
+Map reverseSecondaryKeyMap = new HashMap<>();
+for (FileSlice partition : partitionFileSlices) {
+  reverseLookupSecondaryKeys(partitionName, recordKeys, partition, 
reverseSecondaryKeyMap);
+}
+
+return reverseSecondaryKeyMap;
+  }
+
+  private void reverseLookupSecondaryKeys(String partitionName, List 
recordKeys, FileSlice fileSlice, Map recordKeyMap) {
+Set keySet = new HashSet<>(recordKeys.size());
+Map> logRecordsMap = new 
HashMap<>();

Review Comment:
   IDE suggests some duplicate blocks in previously written code but none in 
the new code. However, I think you are talking about few lines below where we 
get the readers and return early if they are null. I will refactor and reuse as 
much as possible.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7146] Implement secondary index write path [hudi]

2024-05-22 Thread via GitHub



codope commented on code in PR #11146:
URL: https://github.com/apache/hudi/pull/11146#discussion_r1610379136


##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java:
##
@@ -295,6 +311,11 @@ protected HoodieMetadataPayload(String key, int type,
 this.recordIndexMetadata = recordIndexMetadata;
   }
 
+  public HoodieMetadataPayload(String key, HoodieSecondaryIndexInfo 
secondaryIndexMetadata) {

Review Comment:
   Should be private. Refactored as suggested.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT] throw "java.lang.NoSuchMethodError: org.apache.hadoop.hdfs.client.HdfsDataInputStream.getReadStatistics() [hudi]

2024-05-22 Thread via GitHub



dinhphu2k1-gif commented on issue #1:
URL: https://github.com/apache/hudi/issues/1#issuecomment-2125370032

   @kon-si when i build hbase jar, where can i put it to profile hudi
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7146] Implement secondary index write path [hudi]

2024-05-22 Thread via GitHub



codope commented on code in PR #11146:
URL: https://github.com/apache/hudi/pull/11146#discussion_r1610368740


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -540,6 +547,51 @@ private HoodieFunctionalIndexDefinition 
getFunctionalIndexDefinition(String inde
 }
   }
 
+  private Set getSecondaryIndexPartitionsToInit() {
+Set secondaryIndexPartitions = 
dataMetaClient.getFunctionalIndexMetadata().get().getIndexDefinitions().values().stream()
+.map(HoodieFunctionalIndexDefinition::getIndexName)
+.filter(indexName -> 
indexName.startsWith(HoodieTableMetadataUtil.PARTITION_NAME_SECONDARY_INDEX_PREFIX))
+.collect(Collectors.toSet());
+Set completedMetadataPartitions = 
dataMetaClient.getTableConfig().getMetadataPartitions();
+secondaryIndexPartitions.removeAll(completedMetadataPartitions);
+return secondaryIndexPartitions;
+  }
+
+  private Pair> 
initializeSecondaryIndexPartition(String indexName) throws IOException {
+HoodieFunctionalIndexDefinition indexDefinition = 
getFunctionalIndexDefinition(indexName);
+ValidationUtils.checkState(indexDefinition != null, "Secondary Index 
definition is not present for index " + indexName);
+List> partitionFileSlicePairs = 
getPartitionFileSlicePairs();
+
+// Reuse record index parallelism config to build secondary index
+int parallelism = Math.min(partitionFileSlicePairs.size(), 
dataWriteConfig.getMetadataConfig().getRecordIndexMaxParallelism());
+HoodieData records = readSecondaryKeysFromFileSlices(
+engineContext,
+partitionFileSlicePairs,
+parallelism,
+this.getClass().getSimpleName(),
+dataMetaClient,
+EngineType.SPARK,

Review Comment:
   makes sense.. have added abstract methods which is implemented by 
engine-specific metadata writer.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT] throw "java.lang.NoSuchMethodError: org.apache.hadoop.hdfs.client.HdfsDataInputStream.getReadStatistics() [hudi]

2024-05-22 Thread via GitHub



kon-si commented on issue #1:
URL: https://github.com/apache/hudi/issues/1#issuecomment-2125323910

   > @kon-si how to package Hudi again? Can you help me?
   
   There are instruction on how to do it in the README of the hudi project.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7146] Implement secondary index write path [hudi]

2024-05-22 Thread via GitHub



codope commented on code in PR #11146:
URL: https://github.com/apache/hudi/pull/11146#discussion_r1610329349


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -437,9 +448,7 @@ private boolean initializeFromFilesystem(String 
initializationTime, List records = 
fileGroupCountAndRecordsPair.getValue();
   bulkCommit(commitTimeForPartition, partitionType, records, 
fileGroupCount);
   metadataMetaClient.reloadActiveTimeline();
-  String partitionPath = partitionType == FUNCTIONAL_INDEX
-  ? dataWriteConfig.getFunctionalIndexConfig().getIndexName()
-  : partitionType.getPartitionPath();
+  String partitionPath = (partitionType == FUNCTIONAL_INDEX || 
partitionType == SECONDARY_INDEX) ? 
dataWriteConfig.getFunctionalIndexConfig().getIndexName() : 
partitionType.getPartitionPath();

Review Comment:
   Will have to supply indexing config to 
`MetadataPartitionType#getPartitionPath` to do so beacuse unlike functional or 
secondary index, all other indexes in metadata have same name as well as 
partition path. Functional or secondary index name are defined by user and that 
is why need to fetch them from functional index config.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT] HiveSyncTool failure - Unable to create a `_ro` table when writing data [hudi]

2024-05-22 Thread via GitHub



shubhamn21 commented on issue #11254:
URL: https://github.com/apache/hudi/issues/11254#issuecomment-2125294694

   I think it may have to do something with AWSGlue compatibility. The 
[documentation](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-hudi.html#aws-glue-programming-etl-format-hudi-write)
 said that it is only support upto hudi 0.12.1. 
   
   As a workaround - I am using `.save` instead of `.saveAsTable`. I am not 
able to sync with glue/hive but able to I am able to ingest data and query with 
spark-sql.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7146] Implement secondary index write path [hudi]

2024-05-22 Thread via GitHub



codope commented on code in PR #11146:
URL: https://github.com/apache/hudi/pull/11146#discussion_r1610294310


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -540,6 +547,51 @@ private HoodieFunctionalIndexDefinition 
getFunctionalIndexDefinition(String inde
 }
   }
 
+  private Set getSecondaryIndexPartitionsToInit() {
+Set secondaryIndexPartitions = 
dataMetaClient.getFunctionalIndexMetadata().get().getIndexDefinitions().values().stream()
+.map(HoodieFunctionalIndexDefinition::getIndexName)
+.filter(indexName -> 
indexName.startsWith(HoodieTableMetadataUtil.PARTITION_NAME_SECONDARY_INDEX_PREFIX))

Review Comment:
   actually, i should rename `HoodieFunctionalIndexDefinition` to 
`HoodieIndexDefinition` as it will contain the definition for both functional 
index and secondary index, and any future record level index.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT] Hudi SQL Based Transformer Fails when trying to provide SQL File as input [hudi]

2024-05-22 Thread via GitHub



soumilshah1995 closed issue #11258: [SUPPORT] Hudi SQL Based Transformer Fails 
when trying to provide SQL File as input 
URL: https://github.com/apache/hudi/issues/11258


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT] Hudi SQL Based Transformer Fails when trying to provide SQL File as input [hudi]

2024-05-22 Thread via GitHub



soumilshah1995 commented on issue #11258:
URL: https://github.com/apache/hudi/issues/11258#issuecomment-2125096672

   Thanks man 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7146] Implement secondary index write path [hudi]

2024-05-22 Thread via GitHub



codope commented on code in PR #11146:
URL: https://github.com/apache/hudi/pull/11146#discussion_r1610214658


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -410,6 +413,14 @@ private boolean initializeFromFilesystem(String 
initializationTime, List secondaryIndexPartitionsToInit = 
getSecondaryIndexPartitionsToInit();
+if (secondaryIndexPartitionsToInit.isEmpty()) {
+  continue;
+}
+ValidationUtils.checkState(secondaryIndexPartitionsToInit.size() 
== 1, "Only one secondary index at a time is supported for now");

Review Comment:
   While we can create multiple secondary indexes using configs, but can only 
create one at a time using sql due to limitation of sql grammar. To keep the 
user experience consistent, I have added this validation. This will be removed 
later on after we enahcne the sql grammar and parser to support creating 
multiples indexes at a time.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7007] Add bloom_filters index support on read side [hudi]

2024-05-22 Thread via GitHub



hudi-bot commented on PR #11043:
URL: https://github.com/apache/hudi/pull/11043#issuecomment-2125080635

   
   ## CI report:
   
   * 2915ad105601b17cf6915c299c13e9933032a10b Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23489)
 
   * 541b544049e68b3d22cdf0f5159fbd9b0005d345 UNKNOWN
   * 0ca4a414be1438f009aa0b32e3f2b51904b47123 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT] Hudi SQL Based Transformer Fails when trying to provide SQL File as input [hudi]

2024-05-22 Thread via GitHub



soumilshah1995 commented on issue #11258:
URL: https://github.com/apache/hudi/issues/11258#issuecomment-2125075068

   really let me try 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] issue with reading the data using hudi streamer [hudi]

2024-05-22 Thread via GitHub



Pavan792reddy commented on issue #11263:
URL: https://github.com/apache/hudi/issues/11263#issuecomment-2125069736

   the messageid is generating from the pulsar topic but it was generating as 
**__messageId**| not the _MessageId_ , 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7007] Add bloom_filters index support on read side [hudi]

2024-05-22 Thread via GitHub



codope commented on PR #11043:
URL: https://github.com/apache/hudi/pull/11043#issuecomment-2125053288

   @KnightChess Thanks for your review. I have addressed your comments. Can you 
please take a look again?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7007] Add bloom_filters index support on read side [hudi]

2024-05-22 Thread via GitHub



codope commented on PR #11043:
URL: https://github.com/apache/hudi/pull/11043#issuecomment-2125051546

   > Do we need to use parallel stream to improve efficiency in `allFiles.map`?
   
   I tried but it seems like the usual `scala.collection.parallel` does not 
work with Scala 2.13 (which is used with Spark 3.5). To make it work for all 
Scala versions, i'll have to introduce a [new 
dependency](https://github.com/scala/scala-parallel-collections?tab=readme-ov-file#cross-building-dependency).
 I can do it in a followup PR, if necessary. However, I don't other index types 
using parallel collection, plus I have also optimized the code as per your 
other suggestions (suing pruned files instead of all files).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7007] Add bloom_filters index support on read side [hudi]

2024-05-22 Thread via GitHub



codope commented on code in PR #11043:
URL: https://github.com/apache/hudi/pull/11043#discussion_r1610168033


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BloomFiltersIndexSupport.scala:
##
@@ -0,0 +1,87 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.RecordLevelIndexSupport.filterQueryWithRecordKey
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.table.HoodieTableMetaClient
+import org.apache.hudi.metadata.HoodieTableMetadataUtil
+import org.apache.hudi.storage.StoragePath
+import org.apache.hudi.util.JFunction
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.expressions.Expression
+
+class BloomFiltersIndexSupport(spark: SparkSession,
+   metadataConfig: HoodieMetadataConfig,
+   metaClient: HoodieTableMetaClient) extends 
RecordLevelIndexSupport(spark, metadataConfig, metaClient) {
+
+  override def getCandidateFiles(allFiles: Seq[StoragePath], recordKeys: 
List[String]): Set[String] = {
+val fileToBloomFilterMap = allFiles.map { file =>

Review Comment:
   you're right..i followed the record index approach but i have corrected now. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7007] Add bloom_filters index support on read side [hudi]

2024-05-22 Thread via GitHub



codope commented on code in PR #11043:
URL: https://github.com/apache/hudi/pull/11043#discussion_r1610167078


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BloomFiltersIndexSupport.scala:
##
@@ -0,0 +1,87 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.RecordLevelIndexSupport.filterQueryWithRecordKey
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.table.HoodieTableMetaClient
+import org.apache.hudi.metadata.HoodieTableMetadataUtil
+import org.apache.hudi.storage.StoragePath
+import org.apache.hudi.util.JFunction
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.expressions.Expression
+
+class BloomFiltersIndexSupport(spark: SparkSession,
+   metadataConfig: HoodieMetadataConfig,
+   metaClient: HoodieTableMetaClient) extends 
RecordLevelIndexSupport(spark, metadataConfig, metaClient) {
+
+  override def getCandidateFiles(allFiles: Seq[StoragePath], recordKeys: 
List[String]): Set[String] = {
+val fileToBloomFilterMap = allFiles.map { file =>
+  val relativePartitionPath = 
FSUtils.getRelativePartitionPath(metaClient.getBasePathV2, file)
+  val fileName = FSUtils.getFileName(file.getName, relativePartitionPath)
+  val bloomFilter = metadataTable.getBloomFilter(relativePartitionPath, 
fileName)
+  file -> bloomFilter
+}.toMap
+
+recordKeys.flatMap { recordKey =>
+  fileToBloomFilterMap.filter { case (_, bloomFilter) =>
+// If bloom filter is empty, we assume conservatively that the file 
might contain the record key
+bloomFilter.isEmpty || bloomFilter.get.mightContain(recordKey)
+  }.keys
+}.map(_.getName).toSet
+  }
+
+  override def isIndexAvailable: Boolean = {
+metadataConfig.isEnabled && 
metaClient.getTableConfig.getMetadataPartitions.contains(HoodieTableMetadataUtil.PARTITION_NAME_BLOOM_FILTERS)
+  }
+
+  override def filterQueriesWithRecordKey(queryFilters: Seq[Expression]): 
(List[Expression], List[String]) = {
+var recordKeyQueries: List[Expression] = List.empty
+var recordKeys: List[String] = List.empty
+for (query <- queryFilters) {
+  val recordKeyOpt = getRecordKeyConfig

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7007] Add bloom_filters index support on read side [hudi]

2024-05-22 Thread via GitHub



codope commented on code in PR #11043:
URL: https://github.com/apache/hudi/pull/11043#discussion_r1610166499


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BloomFiltersIndexSupport.scala:
##
@@ -0,0 +1,87 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.RecordLevelIndexSupport.filterQueryWithRecordKey
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.table.HoodieTableMetaClient
+import org.apache.hudi.metadata.HoodieTableMetadataUtil
+import org.apache.hudi.storage.StoragePath
+import org.apache.hudi.util.JFunction
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.expressions.Expression
+
+class BloomFiltersIndexSupport(spark: SparkSession,
+   metadataConfig: HoodieMetadataConfig,
+   metaClient: HoodieTableMetaClient) extends 
RecordLevelIndexSupport(spark, metadataConfig, metaClient) {
+
+  override def getCandidateFiles(allFiles: Seq[StoragePath], recordKeys: 
List[String]): Set[String] = {
+val fileToBloomFilterMap = allFiles.map { file =>
+  val relativePartitionPath = 
FSUtils.getRelativePartitionPath(metaClient.getBasePathV2, file)
+  val fileName = FSUtils.getFileName(file.getName, relativePartitionPath)
+  val bloomFilter = metadataTable.getBloomFilter(relativePartitionPath, 
fileName)
+  file -> bloomFilter
+}.toMap
+
+recordKeys.flatMap { recordKey =>
+  fileToBloomFilterMap.filter { case (_, bloomFilter) =>

Review Comment:
   great suggestion! I have changed the code accordingly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT] throw "java.lang.NoSuchMethodError: org.apache.hadoop.hdfs.client.HdfsDataInputStream.getReadStatistics() [hudi]

2024-05-22 Thread via GitHub



dinhphu2k1-gif commented on issue #1:
URL: https://github.com/apache/hudi/issues/1#issuecomment-2124992604

   @kon-si how to package Hudi again? Can you help me?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7007] Add bloom_filters index support on read side [hudi]

2024-05-22 Thread via GitHub



hudi-bot commented on PR #11043:
URL: https://github.com/apache/hudi/pull/11043#issuecomment-2124955598

   
   ## CI report:
   
   * 2915ad105601b17cf6915c299c13e9933032a10b Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23489)
 
   * 541b544049e68b3d22cdf0f5159fbd9b0005d345 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] issue with reading the data using hudi streamer [hudi]

2024-05-22 Thread via GitHub



Pavan792reddy commented on issue #11263:
URL: https://github.com/apache/hudi/issues/11263#issuecomment-2124908753

   @ad1happy2go  i have made all the changes it was working as expected. Now 
the script was failing with below error 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] [DISCUSSION] Deltastreamer - Reading commit checkpoint from Kafka instead of latest Hoodie commit [hudi]

2024-05-22 Thread via GitHub



KishanFairmatic opened a new issue, #11268:
URL: https://github.com/apache/hudi/issues/11268

   
   Our requirement is actually this: Supporting multiple deltastreamers writing 
to a single hudi table
   
   
[https://github.com/apache/hudi/issues/6718](https://github.com/apache/hudi/issues/6718)
   [HUDI-5077](https://issues.apache.org/jira/browse/HUDI-5077)
   
   We reviewed the deltastreamer code and noticed that the deltastreamer can 
read commits from Kafka consumer groups if it doesn't find the last checkpoint 
from Hoodie.
   
   We're considering modifying the code to always read commits from Kafka, 
based on a new flag, by making changes in the Hudi deltastreamer.
   
   Do you foresee any nuances or issues with this approach?
   
   **Environment Description**
   
   * Hudi version : 0.13.0
   
   * Spark version : 3.3.2
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7783] Fix connection leak in FileSystemBasedLockProvider [hudi]

2024-05-22 Thread via GitHub



hudi-bot commented on PR #11267:
URL: https://github.com/apache/hudi/pull/11267#issuecomment-2124564766

   
   ## CI report:
   
   * 7f32187c42483782e078d1094a639af1fdcc0055 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-7783) Fix connection leak in FileSystemBasedLockProvider

2024-05-22 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7783:
-
Labels: pull-request-available  (was: )

> Fix connection leak in FileSystemBasedLockProvider
> --
>
> Key: HUDI-7783
> URL: https://issues.apache.org/jira/browse/HUDI-7783
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: xy
>Assignee: xy
>Priority: Major
>  Labels: pull-request-available
>
> Fix connection leak in FileSystemBasedLockProvider



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[PR] [HUDI-7783] Fix connection leak in FileSystemBasedLockProvider [hudi]

2024-05-22 Thread via GitHub



xuzifu666 opened a new pull request, #11267:
URL: https://github.com/apache/hudi/pull/11267

   ### Change Logs
   
   if fs.hdfs.impl.disable.cache is true，FileSystemBasedLockProvider memory 
would increase all the time to oom due to FileSystem Cache is too large，so we 
should close fs in close method to aviod memory leak.
   
   ### Impact
   
   none
   
   ### Risk level (write none, low medium or high below)
   
   none
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change. If not, put "none"._
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-7783) Fix connection leak in FileSystemBasedLockProvider

2024-05-22 Thread xy (Jira)

xy created HUDI-7783:


 Summary: Fix connection leak in FileSystemBasedLockProvider
 Key: HUDI-7783
 URL: https://issues.apache.org/jira/browse/HUDI-7783
 Project: Apache Hudi
  Issue Type: Improvement
  Components: core
Reporter: xy
Assignee: xy


Fix connection leak in FileSystemBasedLockProvider



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [HUDI-7762] Optimizing Hudi Table Check with Delta Lake by Refining Class Name Checks In Spark3.5 [hudi]

2024-05-22 Thread via GitHub



leesf commented on PR #11224:
URL: https://github.com/apache/hudi/pull/11224#issuecomment-2124474355

   > > When executed on a Delta table, this may result in an error.
   > 
   > What action are we executing here?
   
   like `INSERT OVERWRITE delta.`/tmp/delta-table` SELECT col1 as id FROM 
VALUES 5,6,7,8,9;` in https://docs.delta.io/latest/quick-start.html
   
   we internally use hoodiecatalog to handle delta table and other types of 
table. but 
hoodie(hudi-spark-datasource/hudi-spark3.5.x/src/main/scala/org/apache/spark/sql/adapter/Spark3_5Adapter.scala)
 will call v1Table when the table is delta and delta will throw exception, 
which should not be called when it is not a hudi table.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Closed] (HUDI-7781) Filter wrong partitions when using hoodie.datasource.write.partitions.to.delete

2024-05-22 Thread Danny Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-7781.

Resolution: Fixed

Fixed via master branch:c7d2fc05fd7f285abd36c561217bf67de4e0479f

> Filter wrong partitions when using 
> hoodie.datasource.write.partitions.to.delete
> ---
>
> Key: HUDI-7781
> URL: https://issues.apache.org/jira/browse/HUDI-7781
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Xinyu Zou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7781) Filter wrong partitions when using hoodie.datasource.write.partitions.to.delete

2024-05-22 Thread Danny Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-7781:
-
Fix Version/s: 1.0.0

> Filter wrong partitions when using 
> hoodie.datasource.write.partitions.to.delete
> ---
>
> Key: HUDI-7781
> URL: https://issues.apache.org/jira/browse/HUDI-7781
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Xinyu Zou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

(hudi) branch master updated: [HUDI-7781] Filter wrong partitions when using hoodie.datasource.write.partitions.to.delete (#11260)

2024-05-22 Thread danny0405

This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new c7d2fc05fd7 [HUDI-7781] Filter wrong partitions when using 
hoodie.datasource.write.partitions.to.delete (#11260)
c7d2fc05fd7 is described below

commit c7d2fc05fd7f285abd36c561217bf67de4e0479f
Author: Zouxxyy 
AuthorDate: Wed May 22 17:51:34 2024 +0800

[HUDI-7781] Filter wrong partitions when using 
hoodie.datasource.write.partitions.to.delete (#11260)
---
 .../org/apache/hudi/HoodieSparkSqlWriter.scala | 10 ++---
 .../org/apache/hudi/TestHoodieSparkSqlWriter.scala | 24 +-
 2 files changed, 30 insertions(+), 4 deletions(-)

diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
index 3c28b1a2e0a..87418764dea 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
@@ -573,10 +573,14 @@ class HoodieSparkSqlWriterInternal {
 //note:spark-sql may url-encode special characters (* -> %2A)
 var (wildcardPartitions, fullPartitions) = partitions.partition(partition 
=> partition.matches(".*(\\*|%2A).*"))
 
+val allPartitions = FSUtils.getAllPartitionPaths(new 
HoodieSparkEngineContext(jsc): HoodieEngineContext,
+  HoodieMetadataConfig.newBuilder().fromProperties(cfg.getProps).build(), 
basePath)
+
+if (fullPartitions.nonEmpty) {
+  fullPartitions = fullPartitions.filter(partition => 
allPartitions.contains(partition))
+}
+
 if (wildcardPartitions.nonEmpty) {
-  //get list of all partitions
-  val allPartitions = FSUtils.getAllPartitionPaths(new 
HoodieSparkEngineContext(jsc): HoodieEngineContext,
-
HoodieMetadataConfig.newBuilder().fromProperties(cfg.getProps).build(), 
basePath)
   //go through list of partitions with wildcards and add all partitions 
that match to val fullPartitions
   wildcardPartitions.foreach(partition => {
 //turn wildcard into regex
diff --git 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkSqlWriter.scala
 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkSqlWriter.scala
index 911351f4013..d8a6c9379a3 100644
--- 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkSqlWriter.scala
+++ 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkSqlWriter.scala
@@ -18,8 +18,9 @@
 package org.apache.hudi
 
 import org.apache.hudi.client.SparkRDDWriteClient
-import org.apache.hudi.common.model.{HoodieFileFormat, HoodieRecord, 
HoodieRecordPayload, HoodieTableType, WriteOperationType}
+import org.apache.hudi.common.model.{HoodieFileFormat, HoodieRecord, 
HoodieRecordPayload, HoodieReplaceCommitMetadata, HoodieTableType, 
WriteOperationType}
 import org.apache.hudi.common.table.{HoodieTableConfig, HoodieTableMetaClient, 
TableSchemaResolver}
+import org.apache.hudi.common.table.timeline.TimelineUtils
 import org.apache.hudi.common.testutils.HoodieTestDataGenerator
 import org.apache.hudi.config.{HoodieBootstrapConfig, HoodieIndexConfig, 
HoodieWriteConfig}
 import org.apache.hudi.exception.{HoodieException, 
SchemaCompatibilityException}
@@ -871,6 +872,27 @@ def testBulkInsertForDropPartitionColumn(): Unit = {
 }).count())
   }
 
+  @Test
+  def testDeletePartitionsWithWrongPartition(): Unit = {
+var (_, fooTableModifier) = deletePartitionSetup()
+fooTableModifier = fooTableModifier
+  .updated(DataSourceWriteOptions.PARTITIONS_TO_DELETE.key(), "2016/03/15" 
+ "," + "2025/03")
+  .updated(DataSourceWriteOptions.OPERATION.key(), 
WriteOperationType.DELETE_PARTITION.name())
+HoodieSparkSqlWriter.write(sqlContext, SaveMode.Append, fooTableModifier, 
spark.emptyDataFrame)
+val snapshotDF3 = spark.read.format("org.apache.hudi").load(tempBasePath)
+assertEquals(0, snapshotDF3.filter(entry => {
+  val partitionPath = entry.getString(3)
+  Seq("2015/03/16", "2015/03/17").count(p => partitionPath.equals(p)) != 1
+}).count())
+
+val activeTimeline = createMetaClient(spark, 
tempBasePath).getActiveTimeline
+val metadata = 
TimelineUtils.getCommitMetadata(activeTimeline.lastInstant().get(), 
activeTimeline)
+  .asInstanceOf[HoodieReplaceCommitMetadata]
+
assertTrue(metadata.getOperationType.equals(WriteOperationType.DELETE_PARTITION))
+// "2025/03" should not be in partitionToReplaceFileIds
+assertEquals(Collections.singleton("2016/03/15"), 
metadata.getPartitionToReplaceFileIds.keySet())
+  }
+
   /**
* Test case for non partition table with metatable support.

Re: [PR] [HUDI-7781] Filter wrong partitions when using hoodie.datasource.write.partitions.to.delete [hudi]

2024-05-22 Thread via GitHub



danny0405 merged PR #11260:
URL: https://github.com/apache/hudi/pull/11260


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7146] Implement secondary index write path [hudi]

2024-05-22 Thread via GitHub



danny0405 commented on code in PR #11146:
URL: https://github.com/apache/hudi/pull/11146#discussion_r1609596497


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -540,6 +547,51 @@ private HoodieFunctionalIndexDefinition 
getFunctionalIndexDefinition(String inde
 }
   }
 
+  private Set getSecondaryIndexPartitionsToInit() {
+Set secondaryIndexPartitions = 
dataMetaClient.getFunctionalIndexMetadata().get().getIndexDefinitions().values().stream()
+.map(HoodieFunctionalIndexDefinition::getIndexName)
+.filter(indexName -> 
indexName.startsWith(HoodieTableMetadataUtil.PARTITION_NAME_SECONDARY_INDEX_PREFIX))

Review Comment:
   So secondary index belongs to functional index?



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -410,6 +413,14 @@ private boolean initializeFromFilesystem(String 
initializationTime, List secondaryIndexPartitionsToInit = 
getSecondaryIndexPartitionsToInit();
+if (secondaryIndexPartitionsToInit.isEmpty()) {
+  continue;
+}
+ValidationUtils.checkState(secondaryIndexPartitionsToInit.size() 
== 1, "Only one secondary index at a time is supported for now");

Review Comment:
   not sure why we have the number limit.



##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieFunctionalIndexDefinition.java:
##
@@ -52,7 +55,7 @@ public HoodieFunctionalIndexDefinition(String indexName, 
String indexType, Strin
  Map indexOptions) {
 this.indexName = indexName;
 this.indexType = indexType;
-this.indexFunction = indexFunction;
+this.indexFunction = nonEmpty(indexFunction) ? indexFunction : 
SPARK_IDENTITY;

Review Comment:
   Can you elaborate the semantics for empty index function?



##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java:
##
@@ -783,4 +784,101 @@ public int 
getNumFileGroupsForPartition(MetadataPartitionType partition) {
 metadataFileSystemView, partition.getPartitionPath()));
 return partitionFileSliceMap.get(partition.getPartitionPath()).size();
   }
+
+  @Override
+  protected Map getSecondaryKeysForRecordKeys(List 
recordKeys, String partitionName) {
+if (recordKeys.isEmpty()) {
+  return Collections.emptyMap();
+}
+
+// Load the file slices for the partition. Each file slice is a shard 
which saves a portion of the keys.
+List partitionFileSlices = 
partitionFileSliceMap.computeIfAbsent(partitionName,
+k -> 
HoodieTableMetadataUtil.getPartitionLatestMergedFileSlices(metadataMetaClient, 
metadataFileSystemView, partitionName));
+final int numFileSlices = partitionFileSlices.size();
+ValidationUtils.checkState(numFileSlices > 0, "Number of file slices for 
partition " + partitionName + " should be > 0");

Review Comment:
   Can we just return empty instead of throwing.



##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadata.java:
##
@@ -111,6 +111,14 @@ static boolean isMetadataTable(String basePath) {
 return basePath.endsWith(HoodieTableMetaClient.METADATA_TABLE_FOLDER_PATH);
   }
 
+  static boolean isMetadataTableSecondaryIndexPartition(String basePath, 
Option partitionName) {
+if (!isMetadataTable(basePath) || !partitionName.isPresent()) {
+  return false;
+}
+
+return partitionName.get().startsWith("secondary_index_");

Review Comment:
   It's greate if we can avoid the hard code string.



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -977,16 +1014,17 @@ public void updateFromWriteStatuses(HoodieCommitMetadata 
commitMetadata, HoodieD
   Map> 
partitionToRecordMap =
   HoodieTableMetadataUtil.convertMetadataToRecords(
   engineContext, dataWriteConfig, commitMetadata, instantTime, 
dataMetaClient,
-  enabledPartitionTypes, dataWriteConfig.getBloomFilterType(),
-  dataWriteConfig.getBloomIndexParallelism(), 
dataWriteConfig.isMetadataColumnStatsIndexEnabled(),
-  dataWriteConfig.getColumnStatsIndexParallelism(), 
dataWriteConfig.getColumnsEnabledForColumnStatsIndex(), 
dataWriteConfig.getMetadataConfig());
+  enabledPartitionTypes, dataWriteConfig.getBloomFilterType(),
+  dataWriteConfig.getBloomIndexParallelism(), 
dataWriteConfig.isMetadataColumnStatsIndexEnabled(),
+  dataWriteConfig.getColumnStatsIndexParallelism(), 
dataWriteConfig.getColumnsEnabledForColumnStatsIndex(), 
dataWriteConfig.getMetadataConfig());
 
   // Updates for record index are created by parsing the WriteStatus which 
is a hudi-client object. Hence, we cannot yet move this code
   // to the HoodieTableMetadataUtil class in

Re: [I] [SUPPORT] Reliable ingestion from AWS S3 using Hudi is failing with software.amazon.awssdk.services.sqs.model.EmptyBatchRequestException [hudi]

2024-05-22 Thread via GitHub



SuneethaYamani commented on issue #11168:
URL: https://github.com/apache/hudi/issues/11168#issuecomment-2124294672

   @ad1happy2go 
   I am using 2.12-0.14.0  hudi jar


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7713] Enforce ordering of fields during schema reconciliation [hudi]

2024-05-22 Thread via GitHub



codope commented on code in PR #11154:
URL: https://github.com/apache/hudi/pull/11154#discussion_r1609536039


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSchemaUtils.scala:
##
@@ -93,14 +93,14 @@ object HoodieSchemaUtils {
 // in the table's one we want to proceed aligning nullability 
constraints w/ the table's schema
 // Also, we promote types to the latest table schema if possible.
 val shouldCanonicalizeSchema = opts.getOrElse(CANONICALIZE_SCHEMA.key, 
CANONICALIZE_SCHEMA.defaultValue.toString).toBoolean
+val shouldReconcileSchema = 
opts.getOrElse(DataSourceWriteOptions.RECONCILE_SCHEMA.key(),
+  
DataSourceWriteOptions.RECONCILE_SCHEMA.defaultValue().toString).toBoolean
 val canonicalizedSourceSchema = if (shouldCanonicalizeSchema) {
-  canonicalizeSchema(sourceSchema, latestTableSchema, opts)
+  canonicalizeSchema(sourceSchema, latestTableSchema, opts, 
!shouldReconcileSchema)

Review Comment:
   Reconcile is not necessarily dependent on schema on read. I think the reason 
might have been to not conflict schema reconciliation rules incase that is 
enabled. @jonvex to clarify. Whatever be the reason, let's add a comment for 
reference.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [WIP] HoodieClusteringJob support purge pending clustering job if conflict [hudi]

2024-05-22 Thread via GitHub



Zouxxyy closed pull request #11218: [WIP] HoodieClusteringJob support purge 
pending clustering job if conflict
URL: https://github.com/apache/hudi/pull/11218


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

66 matches

Mail list logo