Re: [PR] [HUDI-5505] Fix counting of delta commits since last compaction in Sc… [hudi]
a-erofeev commented on PR #11251: URL: https://github.com/apache/hudi/pull/11251#issuecomment-2126173947 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] LSMTimeline needs to handle case for tables which has not performed first archived yet [hudi]
bvaradar commented on code in PR #11271: URL: https://github.com/apache/hudi/pull/11271#discussion_r1610840996 ## hudi-common/src/main/java/org/apache/hudi/common/table/timeline/LSMTimeline.java: ## @@ -158,13 +159,18 @@ public static int latestSnapshotVersion(HoodieTableMetaClient metaClient) throws LOG.warn("Error reading version file {}", versionFilePath, e); } } + return allSnapshotVersions(metaClient).stream().max(Integer::compareTo).orElse(-1); } /** * Returns all the valid snapshot versions. */ public static List allSnapshotVersions(HoodieTableMetaClient metaClient) throws IOException { +StoragePath archivedFolderPath = new StoragePath(metaClient.getArchivePath()); Review Comment: Yes, found during testing the 0.14.x writer with 1.x reader testing for a newly created 0.14.x table. LSMTimeline#latestSnapshotVersion was calling LSMTimeline#allSnapshotVersions due to the absence of latest snapshot version file. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7783] Fix connection leak in FileSystemBasedLockProvider [hudi]
xuzifu666 commented on code in PR #11267: URL: https://github.com/apache/hudi/pull/11267#discussion_r1610831406 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/FileSystemBasedLockProvider.java: ## @@ -96,6 +96,7 @@ public void close() { synchronized (LOCK_FILE_NAME) { try { fs.delete(this.lockFile, true); +fs.close(); Review Comment: close it first -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7783] Fix connection leak in FileSystemBasedLockProvider [hudi]
xuzifu666 closed pull request #11267: [HUDI-7783] Fix connection leak in FileSystemBasedLockProvider URL: https://github.com/apache/hudi/pull/11267 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7146] Implement secondary index write path [hudi]
danny0405 commented on code in PR #11146: URL: https://github.com/apache/hudi/pull/11146#discussion_r1610825092 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -540,6 +547,51 @@ private HoodieFunctionalIndexDefinition getFunctionalIndexDefinition(String inde } } + private Set getSecondaryIndexPartitionsToInit() { +Set secondaryIndexPartitions = dataMetaClient.getFunctionalIndexMetadata().get().getIndexDefinitions().values().stream() +.map(HoodieFunctionalIndexDefinition::getIndexName) +.filter(indexName -> indexName.startsWith(HoodieTableMetadataUtil.PARTITION_NAME_SECONDARY_INDEX_PREFIX)) Review Comment: +1 for `HoodieIndexDefinition` which is more general. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7146] Implement secondary index write path [hudi]
danny0405 commented on code in PR #11146: URL: https://github.com/apache/hudi/pull/11146#discussion_r1610824721 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -437,9 +448,7 @@ private boolean initializeFromFilesystem(String initializationTime, List records = fileGroupCountAndRecordsPair.getValue(); bulkCommit(commitTimeForPartition, partitionType, records, fileGroupCount); metadataMetaClient.reloadActiveTimeline(); - String partitionPath = partitionType == FUNCTIONAL_INDEX - ? dataWriteConfig.getFunctionalIndexConfig().getIndexName() - : partitionType.getPartitionPath(); + String partitionPath = (partitionType == FUNCTIONAL_INDEX || partitionType == SECONDARY_INDEX) ? dataWriteConfig.getFunctionalIndexConfig().getIndexName() : partitionType.getPartitionPath(); Review Comment: Yeah, probably we can add some method like `MetadataPartitionType.getPartitionPath` to simplify the code. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7146] Implement secondary index write path [hudi]
danny0405 commented on code in PR #11146: URL: https://github.com/apache/hudi/pull/11146#discussion_r1610823837 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -410,6 +413,14 @@ private boolean initializeFromFilesystem(String initializationTime, List secondaryIndexPartitionsToInit = getSecondaryIndexPartitionsToInit(); +if (secondaryIndexPartitionsToInit.isEmpty()) { + continue; +} +ValidationUtils.checkState(secondaryIndexPartitionsToInit.size() == 1, "Only one secondary index at a time is supported for now"); Review Comment: > but can only create one at a time using sql due to limitation of sql grammar. Reasonable, but it also looks like we only allow one index for MDT bootstrap, is this expected? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7783] Fix connection leak in FileSystemBasedLockProvider [hudi]
danny0405 commented on code in PR #11267: URL: https://github.com/apache/hudi/pull/11267#discussion_r1610822698 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/FileSystemBasedLockProvider.java: ## @@ -96,6 +96,7 @@ public void close() { synchronized (LOCK_FILE_NAME) { try { fs.delete(this.lockFile, true); +fs.close(); Review Comment: Usually there is no need to close the filesystem because there are some reuse strategy on the HADOOP CLASSPATH, if we close the fs instance, it could induce exception for other invokers that are using this fs instance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [DISCUSSION] Deltastreamer - Reading commit checkpoint from Kafka instead of latest Hoodie commit [hudi]
danny0405 commented on issue #11268: URL: https://github.com/apache/hudi/issues/11268#issuecomment-2125996412 > We reviewed the deltastreamer code and noticed that the deltastreamer can read commits from Kafka consumer groups If the consumer `offset` is what the Hoodie checkpoint persisted, sounds more reasonable to always pull the offset from the Kafka server where the offset can be managed in good shape. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7762] Optimizing Hudi Table Check with Delta Lake by Refining Class Name Checks In Spark3.5 [hudi]
danny0405 commented on code in PR #11224: URL: https://github.com/apache/hudi/pull/11224#discussion_r1610819252 ## hudi-spark-datasource/hudi-spark3.5.x/src/main/scala/org/apache/spark/sql/adapter/Spark3_5Adapter.scala: ## @@ -54,7 +54,7 @@ class Spark3_5Adapter extends BaseSpark3Adapter { case plan if !plan.resolved => None // NOTE: When resolving Hudi table we allow [[Filter]]s and [[Project]]s be applied // on top of it -case PhysicalOperation(_, _, DataSourceV2Relation(v2: V2TableWithV1Fallback, _, _, _, _)) if isHoodieTable(v2.v1Table) => +case PhysicalOperation(_, _, DataSourceV2Relation(v2: V2TableWithV1Fallback, _, _, _, _)) if isHoodieTable(v2) => Review Comment: @jonvex can you help for the review? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] LSMTimeline needs to handle case for tables which has not performed first archived yet [hudi]
danny0405 commented on code in PR #11271: URL: https://github.com/apache/hudi/pull/11271#discussion_r1610818043 ## hudi-common/src/main/java/org/apache/hudi/common/table/timeline/LSMTimeline.java: ## @@ -158,13 +159,18 @@ public static int latestSnapshotVersion(HoodieTableMetaClient metaClient) throws LOG.warn("Error reading version file {}", versionFilePath, e); } } + return allSnapshotVersions(metaClient).stream().max(Integer::compareTo).orElse(-1); } /** * Returns all the valid snapshot versions. */ public static List allSnapshotVersions(HoodieTableMetaClient metaClient) throws IOException { +StoragePath archivedFolderPath = new StoragePath(metaClient.getArchivePath()); Review Comment: I search around the invokers, one is archive `LSMTimelineWriter#clean` and another is `LSMTimeline#`latestSnapshotVersion, both would ensure the archving already happens, is this a case specific for backward compatibility? ## hudi-common/src/main/java/org/apache/hudi/common/table/timeline/LSMTimeline.java: ## @@ -158,13 +159,18 @@ public static int latestSnapshotVersion(HoodieTableMetaClient metaClient) throws LOG.warn("Error reading version file {}", versionFilePath, e); } } + return allSnapshotVersions(metaClient).stream().max(Integer::compareTo).orElse(-1); } /** * Returns all the valid snapshot versions. */ public static List allSnapshotVersions(HoodieTableMetaClient metaClient) throws IOException { +StoragePath archivedFolderPath = new StoragePath(metaClient.getArchivePath()); Review Comment: I search around the invokers, one is archive `LSMTimelineWriter#clean` and another is `LSMTimeline#latestSnapshotVersion`, both would ensure the archving already happens, is this a case specific for backward compatibility? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] LSMTimeline needs to handle case for tables which has not performed first archived yet [hudi]
hudi-bot commented on PR #11271: URL: https://github.com/apache/hudi/pull/11271#issuecomment-2125988423 ## CI report: * 818e29cf9483b5b3a17724a490e5c602f5b408c4 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [MINOR] LSMTimeline needs to handle case for tables which has not performed first archived yet [hudi]
bvaradar opened a new pull request, #11271: URL: https://github.com/apache/hudi/pull/11271 Found during backwards compatibility testing. Sometimes, Archiving would not have run for a Hudi table. In that case, LSMTimeline must gracefully handle instead of throwing File not found exception. ### Change Logs Found during backwards compatibility testing. Sometimes, Archiving would not have run for a Hudi table. In that case, LSMTimeline must gracefully handle instead of throwing File not found exception. ### Impact Backwards Compatibility ### Risk level (write none, low medium or high below) None ### Documentation Update none ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Exception org.apache.hudi.exception.HoodieIOException: Could not read commit details [hudi]
Jason-liujc commented on issue #6143: URL: https://github.com/apache/hudi/issues/6143#issuecomment-2125944255 This happened for us and the root cause is we have concurrent running spark jobs writing to the same insert_overwrite table, different partition. Our jobs didn't have proper concurrent locking configurations. After adding DynamoDB Locks for these insert_overwrite write tasks, we stopped seeing these issues anymore. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch branch-0.x updated: [HUDI-7784] Fix serde of HoodieHadoopConfiguration in Spark (#11270)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a commit to branch branch-0.x in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/branch-0.x by this push: new 2e39b41be07 [HUDI-7784] Fix serde of HoodieHadoopConfiguration in Spark (#11270) 2e39b41be07 is described below commit 2e39b41be07d42c0d41fd2cf765732e592954466 Author: Y Ethan Guo AuthorDate: Wed May 22 15:27:48 2024 -0700 [HUDI-7784] Fix serde of HoodieHadoopConfiguration in Spark (#11270) --- .../apache/spark/HoodieSparkKryoRegistrar.scala| 6 +- .../apache/spark/TestHoodieSparkKryoRegistrar.java | 86 ++ 2 files changed, 89 insertions(+), 3 deletions(-) diff --git a/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/HoodieSparkKryoRegistrar.scala b/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/HoodieSparkKryoRegistrar.scala index a8650e5668a..eba3999ea57 100644 --- a/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/HoodieSparkKryoRegistrar.scala +++ b/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/HoodieSparkKryoRegistrar.scala @@ -22,7 +22,7 @@ import org.apache.hudi.client.model.HoodieInternalRow import org.apache.hudi.common.model.{HoodieKey, HoodieSparkRecord} import org.apache.hudi.common.util.HoodieCommonKryoRegistrar import org.apache.hudi.config.HoodieWriteConfig -import org.apache.hudi.storage.StorageConfiguration +import org.apache.hudi.storage.hadoop.HadoopStorageConfiguration import com.esotericsoftware.kryo.io.{Input, Output} import com.esotericsoftware.kryo.serializers.JavaSerializer @@ -64,8 +64,8 @@ class HoodieSparkKryoRegistrar extends HoodieCommonKryoRegistrar with KryoRegist // Hadoop's configuration is not a serializable object by itself, and hence // we're relying on [[SerializableConfiguration]] wrapper to work it around. // We cannot remove this entry; otherwise the ordering is changed. -// So we replace it with [[StorageConfiguration]]. -kryo.register(classOf[StorageConfiguration[_]], new JavaSerializer()) +// So we replace it with [[HadoopStorageConfiguration]] for Spark. +kryo.register(classOf[HadoopStorageConfiguration], new JavaSerializer()) } /** diff --git a/hudi-client/hudi-spark-client/src/test/java/org/apache/spark/TestHoodieSparkKryoRegistrar.java b/hudi-client/hudi-spark-client/src/test/java/org/apache/spark/TestHoodieSparkKryoRegistrar.java new file mode 100644 index 000..4dd297a02b6 --- /dev/null +++ b/hudi-client/hudi-spark-client/src/test/java/org/apache/spark/TestHoodieSparkKryoRegistrar.java @@ -0,0 +1,86 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.spark; + +import org.apache.hudi.storage.hadoop.HadoopStorageConfiguration; + +import com.esotericsoftware.kryo.Kryo; +import com.esotericsoftware.kryo.io.Input; +import com.esotericsoftware.kryo.io.Output; +import org.apache.hadoop.conf.Configuration; +import org.junit.jupiter.api.Test; +import org.objenesis.strategy.StdInstantiatorStrategy; + +import java.io.ByteArrayInputStream; +import java.io.ByteArrayOutputStream; +import java.util.HashMap; +import java.util.Map; + +import static org.junit.jupiter.api.Assertions.assertEquals; + +/** + * Tests {@link HoodieSparkKryoRegistrar} + */ +public class TestHoodieSparkKryoRegistrar { + @Test + public void testSerdeHoodieHadoopConfiguration() { +Kryo kryo = newKryo(); + +HadoopStorageConfiguration conf = new HadoopStorageConfiguration(new Configuration()); + +// Serialize +ByteArrayOutputStream baos = new ByteArrayOutputStream(); +Output output = new Output(baos); +kryo.writeObject(output, conf); +output.close(); + +// Deserialize +ByteArrayInputStream bais = new ByteArrayInputStream(baos.toByteArray()); +Input input = new Input(bais); +HadoopStorageConfiguration deserialized = kryo.readObject(input, HadoopStorageConfiguration.class); +input.close(); + +// Verify +assertEquals(getPropsInMap(conf), getPropsInMap(deserialized)); + } + + private Kryo newKryo() { +Kryo kryo = new
Re: [PR] [HUDI-7784][branch-0.x] Fix serde of HoodieHadoopConfiguration in Spark [hudi]
nsivabalan merged PR #11270: URL: https://github.com/apache/hudi/pull/11270 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated: [HUDI-7784] Fix serde of HoodieHadoopConfiguration in Spark (#11269)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new e1aa1bcb4af [HUDI-7784] Fix serde of HoodieHadoopConfiguration in Spark (#11269) e1aa1bcb4af is described below commit e1aa1bcb4af5cbcad9e985d0333eb5275e128e5b Author: Y Ethan Guo AuthorDate: Wed May 22 15:27:03 2024 -0700 [HUDI-7784] Fix serde of HoodieHadoopConfiguration in Spark (#11269) --- .../apache/spark/HoodieSparkKryoRegistrar.scala| 6 +- .../apache/spark/TestHoodieSparkKryoRegistrar.java | 86 ++ 2 files changed, 89 insertions(+), 3 deletions(-) diff --git a/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/HoodieSparkKryoRegistrar.scala b/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/HoodieSparkKryoRegistrar.scala index a8650e5668a..eba3999ea57 100644 --- a/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/HoodieSparkKryoRegistrar.scala +++ b/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/HoodieSparkKryoRegistrar.scala @@ -22,7 +22,7 @@ import org.apache.hudi.client.model.HoodieInternalRow import org.apache.hudi.common.model.{HoodieKey, HoodieSparkRecord} import org.apache.hudi.common.util.HoodieCommonKryoRegistrar import org.apache.hudi.config.HoodieWriteConfig -import org.apache.hudi.storage.StorageConfiguration +import org.apache.hudi.storage.hadoop.HadoopStorageConfiguration import com.esotericsoftware.kryo.io.{Input, Output} import com.esotericsoftware.kryo.serializers.JavaSerializer @@ -64,8 +64,8 @@ class HoodieSparkKryoRegistrar extends HoodieCommonKryoRegistrar with KryoRegist // Hadoop's configuration is not a serializable object by itself, and hence // we're relying on [[SerializableConfiguration]] wrapper to work it around. // We cannot remove this entry; otherwise the ordering is changed. -// So we replace it with [[StorageConfiguration]]. -kryo.register(classOf[StorageConfiguration[_]], new JavaSerializer()) +// So we replace it with [[HadoopStorageConfiguration]] for Spark. +kryo.register(classOf[HadoopStorageConfiguration], new JavaSerializer()) } /** diff --git a/hudi-client/hudi-spark-client/src/test/java/org/apache/spark/TestHoodieSparkKryoRegistrar.java b/hudi-client/hudi-spark-client/src/test/java/org/apache/spark/TestHoodieSparkKryoRegistrar.java new file mode 100644 index 000..4dd297a02b6 --- /dev/null +++ b/hudi-client/hudi-spark-client/src/test/java/org/apache/spark/TestHoodieSparkKryoRegistrar.java @@ -0,0 +1,86 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.spark; + +import org.apache.hudi.storage.hadoop.HadoopStorageConfiguration; + +import com.esotericsoftware.kryo.Kryo; +import com.esotericsoftware.kryo.io.Input; +import com.esotericsoftware.kryo.io.Output; +import org.apache.hadoop.conf.Configuration; +import org.junit.jupiter.api.Test; +import org.objenesis.strategy.StdInstantiatorStrategy; + +import java.io.ByteArrayInputStream; +import java.io.ByteArrayOutputStream; +import java.util.HashMap; +import java.util.Map; + +import static org.junit.jupiter.api.Assertions.assertEquals; + +/** + * Tests {@link HoodieSparkKryoRegistrar} + */ +public class TestHoodieSparkKryoRegistrar { + @Test + public void testSerdeHoodieHadoopConfiguration() { +Kryo kryo = newKryo(); + +HadoopStorageConfiguration conf = new HadoopStorageConfiguration(new Configuration()); + +// Serialize +ByteArrayOutputStream baos = new ByteArrayOutputStream(); +Output output = new Output(baos); +kryo.writeObject(output, conf); +output.close(); + +// Deserialize +ByteArrayInputStream bais = new ByteArrayInputStream(baos.toByteArray()); +Input input = new Input(bais); +HadoopStorageConfiguration deserialized = kryo.readObject(input, HadoopStorageConfiguration.class); +input.close(); + +// Verify +assertEquals(getPropsInMap(conf), getPropsInMap(deserialized)); + } + + private Kryo newKryo() { +Kryo kryo = new Kryo(); +
Re: [PR] [HUDI-7784] Fix serde of HoodieHadoopConfiguration in Spark [hudi]
nsivabalan merged PR #11269: URL: https://github.com/apache/hudi/pull/11269 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [BUG]hudi cli command with Wrong FS error [hudi]
ehurheap commented on issue #9903: URL: https://github.com/apache/hudi/issues/9903#issuecomment-2125644288 I also ran into this problem when running: `compaction validate --instant 20240516172801913` The validation itself appears to complete ok because I see this output: ``` 24/05/22 17:58:05 INFO DAGScheduler: Job 0 finished: collect at HoodieSparkEngineContext.java:103, took 297.868011 s Result of Validation Operation : 24/05/22 17:58:05 INFO MultipartUploadOutputStream: close closed:false s3://bucketname-redacted/tmp/b0fcc58c-da3a-4a74-90f5-4a63bca5ada8.ser 24/05/22 17:58:05 INFO BlockManagerInfo: Removed broadcast_0_piece0 on ip-10-18-146-49.heap:44129 in memory (size: 598.3 KiB, free: 2.2 GiB) 24/05/22 17:58:05 INFO MultipartUploadOutputStream: close closed:true s3://bucketname-redacted/tmp/b0fcc58c-da3a-4a74-90f5-4a63bca5ada8.ser 24/05/22 17:58:05 INFO Javalin: Stopping Javalin ... 24/05/22 17:58:05 INFO Javalin: Javalin has stopped 24/05/22 17:58:05 INFO SparkUI: Stopped Spark web UI at http://ip-10-18-146-49.heap:4040 24/05/22 17:58:05 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! ``` then it continues shutting down spark but then: ``` 24/05/22 17:58:05 INFO SparkContext: Successfully stopped SparkContext 24/05/22 17:58:05 INFO ShutdownHookManager: Shutdown hook called 24/05/22 17:58:05 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-ab2d5508-e1e0-4f35-af09-d75a368a84b6 24/05/22 17:58:05 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-51ff5e59-d464-4f7a-8ed5-dbd4906ea294 Command failed java.lang.IllegalArgumentException: Wrong FS: s3:/tmp/b0fcc58c-da3a-4a74-90f5-4a63bca5ada8.ser, expected: s3://bucketname-redacted OperationResult{operation=CompactionOperation{baseInstantTime=.. etc the details of the compaction here} ``` hudi version 0.13.1 hudi-cli version 1.0 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7784][branch-0.x] Fix serde of HoodieHadoopConfiguration in Spark [hudi]
hudi-bot commented on PR #11270: URL: https://github.com/apache/hudi/pull/11270#issuecomment-2125580506 ## CI report: * fd1bc98346a08d14b6f110a70827b27e888e77f3 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7784] Fix serde of HoodieHadoopConfiguration in Spark [hudi]
hudi-bot commented on PR #11269: URL: https://github.com/apache/hudi/pull/11269#issuecomment-2125492570 ## CI report: * 6ea8a900dbcd6c815270d07e51ed3683360462e5 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [HUDI-7784][branch-0.x] Fix serde of HoodieHadoopConfiguration in Spark [hudi]
yihua opened a new pull request, #11270: URL: https://github.com/apache/hudi/pull/11270 ### Change Logs PR for master: https://github.com/apache/hudi/pull/11269 This PR targets at `branch-0.x`. This PR fixes the issue that `HoodieHadoopConfiguration` is not properly (de)serialized by Kryo in Spark. A new test is added to validate the (de)serialization. Before the fix, the test failed with NullPointerException. ### Impact Fixes the bug. ### Risk level none ### Documentation Update none ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7784) Fix serde of HoodieHadoopConfiguration in Spark
[ https://issues.apache.org/jira/browse/HUDI-7784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7784: - Labels: hoodie-storage pull-request-available (was: hoodie-storage) > Fix serde of HoodieHadoopConfiguration in Spark > --- > > Key: HUDI-7784 > URL: https://issues.apache.org/jira/browse/HUDI-7784 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage, pull-request-available > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-7784] Fix serde of HoodieHadoopConfiguration in Spark [hudi]
yihua opened a new pull request, #11269: URL: https://github.com/apache/hudi/pull/11269 ### Change Logs This PR fixes the issue that `HoodieHadoopConfiguration` is not properly (de)serialized by Kryo in Spark. A new test is added to validate the (de)serialization. Before the fix, the test failed with NullPointerException. ### Impact Fixes the bug. ### Risk level none ### Documentation Update none ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7784) Fix serde of HoodieHadoopConfiguration in Spark
[ https://issues.apache.org/jira/browse/HUDI-7784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7784: Labels: hoodie-storage (was: ) > Fix serde of HoodieHadoopConfiguration in Spark > --- > > Key: HUDI-7784 > URL: https://issues.apache.org/jira/browse/HUDI-7784 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7784) Fix serde of HoodieHadoopConfiguration in Spark
[ https://issues.apache.org/jira/browse/HUDI-7784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7784: Story Points: 2 > Fix serde of HoodieHadoopConfiguration in Spark > --- > > Key: HUDI-7784 > URL: https://issues.apache.org/jira/browse/HUDI-7784 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Priority: Major > Fix For: 0.15.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7784) Fix serde of HoodieHadoopConfiguration in Spark
Ethan Guo created HUDI-7784: --- Summary: Fix serde of HoodieHadoopConfiguration in Spark Key: HUDI-7784 URL: https://issues.apache.org/jira/browse/HUDI-7784 Project: Apache Hudi Issue Type: Improvement Reporter: Ethan Guo -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7784) Fix serde of HoodieHadoopConfiguration in Spark
[ https://issues.apache.org/jira/browse/HUDI-7784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo reassigned HUDI-7784: --- Assignee: Ethan Guo > Fix serde of HoodieHadoopConfiguration in Spark > --- > > Key: HUDI-7784 > URL: https://issues.apache.org/jira/browse/HUDI-7784 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Fix For: 0.15.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7784) Fix serde of HoodieHadoopConfiguration in Spark
[ https://issues.apache.org/jira/browse/HUDI-7784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7784: Fix Version/s: 0.15.0 > Fix serde of HoodieHadoopConfiguration in Spark > --- > > Key: HUDI-7784 > URL: https://issues.apache.org/jira/browse/HUDI-7784 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Priority: Major > Fix For: 0.15.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7784) Fix serde of HoodieHadoopConfiguration in Spark
[ https://issues.apache.org/jira/browse/HUDI-7784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7784: Fix Version/s: 1.0.0 > Fix serde of HoodieHadoopConfiguration in Spark > --- > > Key: HUDI-7784 > URL: https://issues.apache.org/jira/browse/HUDI-7784 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7146] Implement secondary index write path [hudi]
codope commented on code in PR #11146: URL: https://github.com/apache/hudi/pull/11146#discussion_r1610397009 ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java: ## @@ -783,4 +784,101 @@ public int getNumFileGroupsForPartition(MetadataPartitionType partition) { metadataFileSystemView, partition.getPartitionPath())); return partitionFileSliceMap.get(partition.getPartitionPath()).size(); } + + @Override + protected Map getSecondaryKeysForRecordKeys(List recordKeys, String partitionName) { +if (recordKeys.isEmpty()) { + return Collections.emptyMap(); +} + +// Load the file slices for the partition. Each file slice is a shard which saves a portion of the keys. +List partitionFileSlices = partitionFileSliceMap.computeIfAbsent(partitionName, +k -> HoodieTableMetadataUtil.getPartitionLatestMergedFileSlices(metadataMetaClient, metadataFileSystemView, partitionName)); +final int numFileSlices = partitionFileSlices.size(); +ValidationUtils.checkState(numFileSlices > 0, "Number of file slices for partition " + partitionName + " should be > 0"); + +// Lookup keys from each file slice +// TODO: parallelize this loop +Map reverseSecondaryKeyMap = new HashMap<>(); +for (FileSlice partition : partitionFileSlices) { + reverseLookupSecondaryKeys(partitionName, recordKeys, partition, reverseSecondaryKeyMap); +} + +return reverseSecondaryKeyMap; + } + + private void reverseLookupSecondaryKeys(String partitionName, List recordKeys, FileSlice fileSlice, Map recordKeyMap) { +Set keySet = new HashSet<>(recordKeys.size()); +Map> logRecordsMap = new HashMap<>(); Review Comment: IDE suggests some duplicate blocks in previously written code but none in the new code. However, I think you are talking about few lines below where we get the readers and return early if they are null. I will refactor and reuse as much as possible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7146] Implement secondary index write path [hudi]
codope commented on code in PR #11146: URL: https://github.com/apache/hudi/pull/11146#discussion_r1610379136 ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java: ## @@ -295,6 +311,11 @@ protected HoodieMetadataPayload(String key, int type, this.recordIndexMetadata = recordIndexMetadata; } + public HoodieMetadataPayload(String key, HoodieSecondaryIndexInfo secondaryIndexMetadata) { Review Comment: Should be private. Refactored as suggested. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] throw "java.lang.NoSuchMethodError: org.apache.hadoop.hdfs.client.HdfsDataInputStream.getReadStatistics() [hudi]
dinhphu2k1-gif commented on issue #1: URL: https://github.com/apache/hudi/issues/1#issuecomment-2125370032 @kon-si when i build hbase jar, where can i put it to profile hudi -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7146] Implement secondary index write path [hudi]
codope commented on code in PR #11146: URL: https://github.com/apache/hudi/pull/11146#discussion_r1610368740 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -540,6 +547,51 @@ private HoodieFunctionalIndexDefinition getFunctionalIndexDefinition(String inde } } + private Set getSecondaryIndexPartitionsToInit() { +Set secondaryIndexPartitions = dataMetaClient.getFunctionalIndexMetadata().get().getIndexDefinitions().values().stream() +.map(HoodieFunctionalIndexDefinition::getIndexName) +.filter(indexName -> indexName.startsWith(HoodieTableMetadataUtil.PARTITION_NAME_SECONDARY_INDEX_PREFIX)) +.collect(Collectors.toSet()); +Set completedMetadataPartitions = dataMetaClient.getTableConfig().getMetadataPartitions(); +secondaryIndexPartitions.removeAll(completedMetadataPartitions); +return secondaryIndexPartitions; + } + + private Pair> initializeSecondaryIndexPartition(String indexName) throws IOException { +HoodieFunctionalIndexDefinition indexDefinition = getFunctionalIndexDefinition(indexName); +ValidationUtils.checkState(indexDefinition != null, "Secondary Index definition is not present for index " + indexName); +List> partitionFileSlicePairs = getPartitionFileSlicePairs(); + +// Reuse record index parallelism config to build secondary index +int parallelism = Math.min(partitionFileSlicePairs.size(), dataWriteConfig.getMetadataConfig().getRecordIndexMaxParallelism()); +HoodieData records = readSecondaryKeysFromFileSlices( +engineContext, +partitionFileSlicePairs, +parallelism, +this.getClass().getSimpleName(), +dataMetaClient, +EngineType.SPARK, Review Comment: makes sense.. have added abstract methods which is implemented by engine-specific metadata writer. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] throw "java.lang.NoSuchMethodError: org.apache.hadoop.hdfs.client.HdfsDataInputStream.getReadStatistics() [hudi]
kon-si commented on issue #1: URL: https://github.com/apache/hudi/issues/1#issuecomment-2125323910 > @kon-si how to package Hudi again? Can you help me? There are instruction on how to do it in the README of the hudi project. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7146] Implement secondary index write path [hudi]
codope commented on code in PR #11146: URL: https://github.com/apache/hudi/pull/11146#discussion_r1610329349 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -437,9 +448,7 @@ private boolean initializeFromFilesystem(String initializationTime, List records = fileGroupCountAndRecordsPair.getValue(); bulkCommit(commitTimeForPartition, partitionType, records, fileGroupCount); metadataMetaClient.reloadActiveTimeline(); - String partitionPath = partitionType == FUNCTIONAL_INDEX - ? dataWriteConfig.getFunctionalIndexConfig().getIndexName() - : partitionType.getPartitionPath(); + String partitionPath = (partitionType == FUNCTIONAL_INDEX || partitionType == SECONDARY_INDEX) ? dataWriteConfig.getFunctionalIndexConfig().getIndexName() : partitionType.getPartitionPath(); Review Comment: Will have to supply indexing config to `MetadataPartitionType#getPartitionPath` to do so beacuse unlike functional or secondary index, all other indexes in metadata have same name as well as partition path. Functional or secondary index name are defined by user and that is why need to fetch them from functional index config. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] HiveSyncTool failure - Unable to create a `_ro` table when writing data [hudi]
shubhamn21 commented on issue #11254: URL: https://github.com/apache/hudi/issues/11254#issuecomment-2125294694 I think it may have to do something with AWSGlue compatibility. The [documentation](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-hudi.html#aws-glue-programming-etl-format-hudi-write) said that it is only support upto hudi 0.12.1. As a workaround - I am using `.save` instead of `.saveAsTable`. I am not able to sync with glue/hive but able to I am able to ingest data and query with spark-sql. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7146] Implement secondary index write path [hudi]
codope commented on code in PR #11146: URL: https://github.com/apache/hudi/pull/11146#discussion_r1610294310 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -540,6 +547,51 @@ private HoodieFunctionalIndexDefinition getFunctionalIndexDefinition(String inde } } + private Set getSecondaryIndexPartitionsToInit() { +Set secondaryIndexPartitions = dataMetaClient.getFunctionalIndexMetadata().get().getIndexDefinitions().values().stream() +.map(HoodieFunctionalIndexDefinition::getIndexName) +.filter(indexName -> indexName.startsWith(HoodieTableMetadataUtil.PARTITION_NAME_SECONDARY_INDEX_PREFIX)) Review Comment: actually, i should rename `HoodieFunctionalIndexDefinition` to `HoodieIndexDefinition` as it will contain the definition for both functional index and secondary index, and any future record level index. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi SQL Based Transformer Fails when trying to provide SQL File as input [hudi]
soumilshah1995 closed issue #11258: [SUPPORT] Hudi SQL Based Transformer Fails when trying to provide SQL File as input URL: https://github.com/apache/hudi/issues/11258 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi SQL Based Transformer Fails when trying to provide SQL File as input [hudi]
soumilshah1995 commented on issue #11258: URL: https://github.com/apache/hudi/issues/11258#issuecomment-2125096672 Thanks man -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7146] Implement secondary index write path [hudi]
codope commented on code in PR #11146: URL: https://github.com/apache/hudi/pull/11146#discussion_r1610214658 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -410,6 +413,14 @@ private boolean initializeFromFilesystem(String initializationTime, List secondaryIndexPartitionsToInit = getSecondaryIndexPartitionsToInit(); +if (secondaryIndexPartitionsToInit.isEmpty()) { + continue; +} +ValidationUtils.checkState(secondaryIndexPartitionsToInit.size() == 1, "Only one secondary index at a time is supported for now"); Review Comment: While we can create multiple secondary indexes using configs, but can only create one at a time using sql due to limitation of sql grammar. To keep the user experience consistent, I have added this validation. This will be removed later on after we enahcne the sql grammar and parser to support creating multiples indexes at a time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7007] Add bloom_filters index support on read side [hudi]
hudi-bot commented on PR #11043: URL: https://github.com/apache/hudi/pull/11043#issuecomment-2125080635 ## CI report: * 2915ad105601b17cf6915c299c13e9933032a10b Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23489) * 541b544049e68b3d22cdf0f5159fbd9b0005d345 UNKNOWN * 0ca4a414be1438f009aa0b32e3f2b51904b47123 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi SQL Based Transformer Fails when trying to provide SQL File as input [hudi]
soumilshah1995 commented on issue #11258: URL: https://github.com/apache/hudi/issues/11258#issuecomment-2125075068 really let me try -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] issue with reading the data using hudi streamer [hudi]
Pavan792reddy commented on issue #11263: URL: https://github.com/apache/hudi/issues/11263#issuecomment-2125069736 the messageid is generating from the pulsar topic but it was generating as **__messageId**| not the _MessageId_ , -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7007] Add bloom_filters index support on read side [hudi]
codope commented on PR #11043: URL: https://github.com/apache/hudi/pull/11043#issuecomment-2125053288 @KnightChess Thanks for your review. I have addressed your comments. Can you please take a look again? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7007] Add bloom_filters index support on read side [hudi]
codope commented on PR #11043: URL: https://github.com/apache/hudi/pull/11043#issuecomment-2125051546 > Do we need to use parallel stream to improve efficiency in `allFiles.map`? I tried but it seems like the usual `scala.collection.parallel` does not work with Scala 2.13 (which is used with Spark 3.5). To make it work for all Scala versions, i'll have to introduce a [new dependency](https://github.com/scala/scala-parallel-collections?tab=readme-ov-file#cross-building-dependency). I can do it in a followup PR, if necessary. However, I don't other index types using parallel collection, plus I have also optimized the code as per your other suggestions (suing pruned files instead of all files). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7007] Add bloom_filters index support on read side [hudi]
codope commented on code in PR #11043: URL: https://github.com/apache/hudi/pull/11043#discussion_r1610168033 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BloomFiltersIndexSupport.scala: ## @@ -0,0 +1,87 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi + +import org.apache.hudi.RecordLevelIndexSupport.filterQueryWithRecordKey +import org.apache.hudi.common.config.HoodieMetadataConfig +import org.apache.hudi.common.fs.FSUtils +import org.apache.hudi.common.table.HoodieTableMetaClient +import org.apache.hudi.metadata.HoodieTableMetadataUtil +import org.apache.hudi.storage.StoragePath +import org.apache.hudi.util.JFunction +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.catalyst.expressions.Expression + +class BloomFiltersIndexSupport(spark: SparkSession, + metadataConfig: HoodieMetadataConfig, + metaClient: HoodieTableMetaClient) extends RecordLevelIndexSupport(spark, metadataConfig, metaClient) { + + override def getCandidateFiles(allFiles: Seq[StoragePath], recordKeys: List[String]): Set[String] = { +val fileToBloomFilterMap = allFiles.map { file => Review Comment: you're right..i followed the record index approach but i have corrected now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7007] Add bloom_filters index support on read side [hudi]
codope commented on code in PR #11043: URL: https://github.com/apache/hudi/pull/11043#discussion_r1610167078 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BloomFiltersIndexSupport.scala: ## @@ -0,0 +1,87 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi + +import org.apache.hudi.RecordLevelIndexSupport.filterQueryWithRecordKey +import org.apache.hudi.common.config.HoodieMetadataConfig +import org.apache.hudi.common.fs.FSUtils +import org.apache.hudi.common.table.HoodieTableMetaClient +import org.apache.hudi.metadata.HoodieTableMetadataUtil +import org.apache.hudi.storage.StoragePath +import org.apache.hudi.util.JFunction +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.catalyst.expressions.Expression + +class BloomFiltersIndexSupport(spark: SparkSession, + metadataConfig: HoodieMetadataConfig, + metaClient: HoodieTableMetaClient) extends RecordLevelIndexSupport(spark, metadataConfig, metaClient) { + + override def getCandidateFiles(allFiles: Seq[StoragePath], recordKeys: List[String]): Set[String] = { +val fileToBloomFilterMap = allFiles.map { file => + val relativePartitionPath = FSUtils.getRelativePartitionPath(metaClient.getBasePathV2, file) + val fileName = FSUtils.getFileName(file.getName, relativePartitionPath) + val bloomFilter = metadataTable.getBloomFilter(relativePartitionPath, fileName) + file -> bloomFilter +}.toMap + +recordKeys.flatMap { recordKey => + fileToBloomFilterMap.filter { case (_, bloomFilter) => +// If bloom filter is empty, we assume conservatively that the file might contain the record key +bloomFilter.isEmpty || bloomFilter.get.mightContain(recordKey) + }.keys +}.map(_.getName).toSet + } + + override def isIndexAvailable: Boolean = { +metadataConfig.isEnabled && metaClient.getTableConfig.getMetadataPartitions.contains(HoodieTableMetadataUtil.PARTITION_NAME_BLOOM_FILTERS) + } + + override def filterQueriesWithRecordKey(queryFilters: Seq[Expression]): (List[Expression], List[String]) = { +var recordKeyQueries: List[Expression] = List.empty +var recordKeys: List[String] = List.empty +for (query <- queryFilters) { + val recordKeyOpt = getRecordKeyConfig Review Comment: done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7007] Add bloom_filters index support on read side [hudi]
codope commented on code in PR #11043: URL: https://github.com/apache/hudi/pull/11043#discussion_r1610166499 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BloomFiltersIndexSupport.scala: ## @@ -0,0 +1,87 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi + +import org.apache.hudi.RecordLevelIndexSupport.filterQueryWithRecordKey +import org.apache.hudi.common.config.HoodieMetadataConfig +import org.apache.hudi.common.fs.FSUtils +import org.apache.hudi.common.table.HoodieTableMetaClient +import org.apache.hudi.metadata.HoodieTableMetadataUtil +import org.apache.hudi.storage.StoragePath +import org.apache.hudi.util.JFunction +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.catalyst.expressions.Expression + +class BloomFiltersIndexSupport(spark: SparkSession, + metadataConfig: HoodieMetadataConfig, + metaClient: HoodieTableMetaClient) extends RecordLevelIndexSupport(spark, metadataConfig, metaClient) { + + override def getCandidateFiles(allFiles: Seq[StoragePath], recordKeys: List[String]): Set[String] = { +val fileToBloomFilterMap = allFiles.map { file => + val relativePartitionPath = FSUtils.getRelativePartitionPath(metaClient.getBasePathV2, file) + val fileName = FSUtils.getFileName(file.getName, relativePartitionPath) + val bloomFilter = metadataTable.getBloomFilter(relativePartitionPath, fileName) + file -> bloomFilter +}.toMap + +recordKeys.flatMap { recordKey => + fileToBloomFilterMap.filter { case (_, bloomFilter) => Review Comment: great suggestion! I have changed the code accordingly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] throw "java.lang.NoSuchMethodError: org.apache.hadoop.hdfs.client.HdfsDataInputStream.getReadStatistics() [hudi]
dinhphu2k1-gif commented on issue #1: URL: https://github.com/apache/hudi/issues/1#issuecomment-2124992604 @kon-si how to package Hudi again? Can you help me? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7007] Add bloom_filters index support on read side [hudi]
hudi-bot commented on PR #11043: URL: https://github.com/apache/hudi/pull/11043#issuecomment-2124955598 ## CI report: * 2915ad105601b17cf6915c299c13e9933032a10b Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23489) * 541b544049e68b3d22cdf0f5159fbd9b0005d345 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] issue with reading the data using hudi streamer [hudi]
Pavan792reddy commented on issue #11263: URL: https://github.com/apache/hudi/issues/11263#issuecomment-2124908753 @ad1happy2go i have made all the changes it was working as expected. Now the script was failing with below error -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [DISCUSSION] Deltastreamer - Reading commit checkpoint from Kafka instead of latest Hoodie commit [hudi]
KishanFairmatic opened a new issue, #11268: URL: https://github.com/apache/hudi/issues/11268 Our requirement is actually this: Supporting multiple deltastreamers writing to a single hudi table [https://github.com/apache/hudi/issues/6718](https://github.com/apache/hudi/issues/6718) [HUDI-5077](https://issues.apache.org/jira/browse/HUDI-5077) We reviewed the deltastreamer code and noticed that the deltastreamer can read commits from Kafka consumer groups if it doesn't find the last checkpoint from Hoodie. We're considering modifying the code to always read commits from Kafka, based on a new flag, by making changes in the Hudi deltastreamer. Do you foresee any nuances or issues with this approach? **Environment Description** * Hudi version : 0.13.0 * Spark version : 3.3.2 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7783] Fix connection leak in FileSystemBasedLockProvider [hudi]
hudi-bot commented on PR #11267: URL: https://github.com/apache/hudi/pull/11267#issuecomment-2124564766 ## CI report: * 7f32187c42483782e078d1094a639af1fdcc0055 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7783) Fix connection leak in FileSystemBasedLockProvider
[ https://issues.apache.org/jira/browse/HUDI-7783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7783: - Labels: pull-request-available (was: ) > Fix connection leak in FileSystemBasedLockProvider > -- > > Key: HUDI-7783 > URL: https://issues.apache.org/jira/browse/HUDI-7783 > Project: Apache Hudi > Issue Type: Improvement > Components: core >Reporter: xy >Assignee: xy >Priority: Major > Labels: pull-request-available > > Fix connection leak in FileSystemBasedLockProvider -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-7783] Fix connection leak in FileSystemBasedLockProvider [hudi]
xuzifu666 opened a new pull request, #11267: URL: https://github.com/apache/hudi/pull/11267 ### Change Logs if fs.hdfs.impl.disable.cache is true,FileSystemBasedLockProvider memory would increase all the time to oom due to FileSystem Cache is too large,so we should close fs in close method to aviod memory leak. ### Impact none ### Risk level (write none, low medium or high below) none ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none"._ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-7783) Fix connection leak in FileSystemBasedLockProvider
xy created HUDI-7783: Summary: Fix connection leak in FileSystemBasedLockProvider Key: HUDI-7783 URL: https://issues.apache.org/jira/browse/HUDI-7783 Project: Apache Hudi Issue Type: Improvement Components: core Reporter: xy Assignee: xy Fix connection leak in FileSystemBasedLockProvider -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7762] Optimizing Hudi Table Check with Delta Lake by Refining Class Name Checks In Spark3.5 [hudi]
leesf commented on PR #11224: URL: https://github.com/apache/hudi/pull/11224#issuecomment-2124474355 > > When executed on a Delta table, this may result in an error. > > What action are we executing here? like `INSERT OVERWRITE delta.`/tmp/delta-table` SELECT col1 as id FROM VALUES 5,6,7,8,9;` in https://docs.delta.io/latest/quick-start.html we internally use hoodiecatalog to handle delta table and other types of table. but hoodie(hudi-spark-datasource/hudi-spark3.5.x/src/main/scala/org/apache/spark/sql/adapter/Spark3_5Adapter.scala) will call v1Table when the table is delta and delta will throw exception, which should not be called when it is not a hudi table. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-7781) Filter wrong partitions when using hoodie.datasource.write.partitions.to.delete
[ https://issues.apache.org/jira/browse/HUDI-7781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-7781. Resolution: Fixed Fixed via master branch:c7d2fc05fd7f285abd36c561217bf67de4e0479f > Filter wrong partitions when using > hoodie.datasource.write.partitions.to.delete > --- > > Key: HUDI-7781 > URL: https://issues.apache.org/jira/browse/HUDI-7781 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Xinyu Zou >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7781) Filter wrong partitions when using hoodie.datasource.write.partitions.to.delete
[ https://issues.apache.org/jira/browse/HUDI-7781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-7781: - Fix Version/s: 1.0.0 > Filter wrong partitions when using > hoodie.datasource.write.partitions.to.delete > --- > > Key: HUDI-7781 > URL: https://issues.apache.org/jira/browse/HUDI-7781 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Xinyu Zou >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
(hudi) branch master updated: [HUDI-7781] Filter wrong partitions when using hoodie.datasource.write.partitions.to.delete (#11260)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new c7d2fc05fd7 [HUDI-7781] Filter wrong partitions when using hoodie.datasource.write.partitions.to.delete (#11260) c7d2fc05fd7 is described below commit c7d2fc05fd7f285abd36c561217bf67de4e0479f Author: Zouxxyy AuthorDate: Wed May 22 17:51:34 2024 +0800 [HUDI-7781] Filter wrong partitions when using hoodie.datasource.write.partitions.to.delete (#11260) --- .../org/apache/hudi/HoodieSparkSqlWriter.scala | 10 ++--- .../org/apache/hudi/TestHoodieSparkSqlWriter.scala | 24 +- 2 files changed, 30 insertions(+), 4 deletions(-) diff --git a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala index 3c28b1a2e0a..87418764dea 100644 --- a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala +++ b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala @@ -573,10 +573,14 @@ class HoodieSparkSqlWriterInternal { //note:spark-sql may url-encode special characters (* -> %2A) var (wildcardPartitions, fullPartitions) = partitions.partition(partition => partition.matches(".*(\\*|%2A).*")) +val allPartitions = FSUtils.getAllPartitionPaths(new HoodieSparkEngineContext(jsc): HoodieEngineContext, + HoodieMetadataConfig.newBuilder().fromProperties(cfg.getProps).build(), basePath) + +if (fullPartitions.nonEmpty) { + fullPartitions = fullPartitions.filter(partition => allPartitions.contains(partition)) +} + if (wildcardPartitions.nonEmpty) { - //get list of all partitions - val allPartitions = FSUtils.getAllPartitionPaths(new HoodieSparkEngineContext(jsc): HoodieEngineContext, - HoodieMetadataConfig.newBuilder().fromProperties(cfg.getProps).build(), basePath) //go through list of partitions with wildcards and add all partitions that match to val fullPartitions wildcardPartitions.foreach(partition => { //turn wildcard into regex diff --git a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkSqlWriter.scala b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkSqlWriter.scala index 911351f4013..d8a6c9379a3 100644 --- a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkSqlWriter.scala +++ b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkSqlWriter.scala @@ -18,8 +18,9 @@ package org.apache.hudi import org.apache.hudi.client.SparkRDDWriteClient -import org.apache.hudi.common.model.{HoodieFileFormat, HoodieRecord, HoodieRecordPayload, HoodieTableType, WriteOperationType} +import org.apache.hudi.common.model.{HoodieFileFormat, HoodieRecord, HoodieRecordPayload, HoodieReplaceCommitMetadata, HoodieTableType, WriteOperationType} import org.apache.hudi.common.table.{HoodieTableConfig, HoodieTableMetaClient, TableSchemaResolver} +import org.apache.hudi.common.table.timeline.TimelineUtils import org.apache.hudi.common.testutils.HoodieTestDataGenerator import org.apache.hudi.config.{HoodieBootstrapConfig, HoodieIndexConfig, HoodieWriteConfig} import org.apache.hudi.exception.{HoodieException, SchemaCompatibilityException} @@ -871,6 +872,27 @@ def testBulkInsertForDropPartitionColumn(): Unit = { }).count()) } + @Test + def testDeletePartitionsWithWrongPartition(): Unit = { +var (_, fooTableModifier) = deletePartitionSetup() +fooTableModifier = fooTableModifier + .updated(DataSourceWriteOptions.PARTITIONS_TO_DELETE.key(), "2016/03/15" + "," + "2025/03") + .updated(DataSourceWriteOptions.OPERATION.key(), WriteOperationType.DELETE_PARTITION.name()) +HoodieSparkSqlWriter.write(sqlContext, SaveMode.Append, fooTableModifier, spark.emptyDataFrame) +val snapshotDF3 = spark.read.format("org.apache.hudi").load(tempBasePath) +assertEquals(0, snapshotDF3.filter(entry => { + val partitionPath = entry.getString(3) + Seq("2015/03/16", "2015/03/17").count(p => partitionPath.equals(p)) != 1 +}).count()) + +val activeTimeline = createMetaClient(spark, tempBasePath).getActiveTimeline +val metadata = TimelineUtils.getCommitMetadata(activeTimeline.lastInstant().get(), activeTimeline) + .asInstanceOf[HoodieReplaceCommitMetadata] + assertTrue(metadata.getOperationType.equals(WriteOperationType.DELETE_PARTITION)) +// "2025/03" should not be in partitionToReplaceFileIds +assertEquals(Collections.singleton("2016/03/15"), metadata.getPartitionToReplaceFileIds.keySet()) + } + /** * Test case for non partition table with metatable support.
Re: [PR] [HUDI-7781] Filter wrong partitions when using hoodie.datasource.write.partitions.to.delete [hudi]
danny0405 merged PR #11260: URL: https://github.com/apache/hudi/pull/11260 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7146] Implement secondary index write path [hudi]
danny0405 commented on code in PR #11146: URL: https://github.com/apache/hudi/pull/11146#discussion_r1609596497 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -540,6 +547,51 @@ private HoodieFunctionalIndexDefinition getFunctionalIndexDefinition(String inde } } + private Set getSecondaryIndexPartitionsToInit() { +Set secondaryIndexPartitions = dataMetaClient.getFunctionalIndexMetadata().get().getIndexDefinitions().values().stream() +.map(HoodieFunctionalIndexDefinition::getIndexName) +.filter(indexName -> indexName.startsWith(HoodieTableMetadataUtil.PARTITION_NAME_SECONDARY_INDEX_PREFIX)) Review Comment: So secondary index belongs to functional index? ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -410,6 +413,14 @@ private boolean initializeFromFilesystem(String initializationTime, List secondaryIndexPartitionsToInit = getSecondaryIndexPartitionsToInit(); +if (secondaryIndexPartitionsToInit.isEmpty()) { + continue; +} +ValidationUtils.checkState(secondaryIndexPartitionsToInit.size() == 1, "Only one secondary index at a time is supported for now"); Review Comment: not sure why we have the number limit. ## hudi-common/src/main/java/org/apache/hudi/common/model/HoodieFunctionalIndexDefinition.java: ## @@ -52,7 +55,7 @@ public HoodieFunctionalIndexDefinition(String indexName, String indexType, Strin Map indexOptions) { this.indexName = indexName; this.indexType = indexType; -this.indexFunction = indexFunction; +this.indexFunction = nonEmpty(indexFunction) ? indexFunction : SPARK_IDENTITY; Review Comment: Can you elaborate the semantics for empty index function? ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java: ## @@ -783,4 +784,101 @@ public int getNumFileGroupsForPartition(MetadataPartitionType partition) { metadataFileSystemView, partition.getPartitionPath())); return partitionFileSliceMap.get(partition.getPartitionPath()).size(); } + + @Override + protected Map getSecondaryKeysForRecordKeys(List recordKeys, String partitionName) { +if (recordKeys.isEmpty()) { + return Collections.emptyMap(); +} + +// Load the file slices for the partition. Each file slice is a shard which saves a portion of the keys. +List partitionFileSlices = partitionFileSliceMap.computeIfAbsent(partitionName, +k -> HoodieTableMetadataUtil.getPartitionLatestMergedFileSlices(metadataMetaClient, metadataFileSystemView, partitionName)); +final int numFileSlices = partitionFileSlices.size(); +ValidationUtils.checkState(numFileSlices > 0, "Number of file slices for partition " + partitionName + " should be > 0"); Review Comment: Can we just return empty instead of throwing. ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadata.java: ## @@ -111,6 +111,14 @@ static boolean isMetadataTable(String basePath) { return basePath.endsWith(HoodieTableMetaClient.METADATA_TABLE_FOLDER_PATH); } + static boolean isMetadataTableSecondaryIndexPartition(String basePath, Option partitionName) { +if (!isMetadataTable(basePath) || !partitionName.isPresent()) { + return false; +} + +return partitionName.get().startsWith("secondary_index_"); Review Comment: It's greate if we can avoid the hard code string. ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -977,16 +1014,17 @@ public void updateFromWriteStatuses(HoodieCommitMetadata commitMetadata, HoodieD Map> partitionToRecordMap = HoodieTableMetadataUtil.convertMetadataToRecords( engineContext, dataWriteConfig, commitMetadata, instantTime, dataMetaClient, - enabledPartitionTypes, dataWriteConfig.getBloomFilterType(), - dataWriteConfig.getBloomIndexParallelism(), dataWriteConfig.isMetadataColumnStatsIndexEnabled(), - dataWriteConfig.getColumnStatsIndexParallelism(), dataWriteConfig.getColumnsEnabledForColumnStatsIndex(), dataWriteConfig.getMetadataConfig()); + enabledPartitionTypes, dataWriteConfig.getBloomFilterType(), + dataWriteConfig.getBloomIndexParallelism(), dataWriteConfig.isMetadataColumnStatsIndexEnabled(), + dataWriteConfig.getColumnStatsIndexParallelism(), dataWriteConfig.getColumnsEnabledForColumnStatsIndex(), dataWriteConfig.getMetadataConfig()); // Updates for record index are created by parsing the WriteStatus which is a hudi-client object. Hence, we cannot yet move this code // to the HoodieTableMetadataUtil class in
Re: [I] [SUPPORT] Reliable ingestion from AWS S3 using Hudi is failing with software.amazon.awssdk.services.sqs.model.EmptyBatchRequestException [hudi]
SuneethaYamani commented on issue #11168: URL: https://github.com/apache/hudi/issues/11168#issuecomment-2124294672 @ad1happy2go I am using 2.12-0.14.0 hudi jar -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7713] Enforce ordering of fields during schema reconciliation [hudi]
codope commented on code in PR #11154: URL: https://github.com/apache/hudi/pull/11154#discussion_r1609536039 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSchemaUtils.scala: ## @@ -93,14 +93,14 @@ object HoodieSchemaUtils { // in the table's one we want to proceed aligning nullability constraints w/ the table's schema // Also, we promote types to the latest table schema if possible. val shouldCanonicalizeSchema = opts.getOrElse(CANONICALIZE_SCHEMA.key, CANONICALIZE_SCHEMA.defaultValue.toString).toBoolean +val shouldReconcileSchema = opts.getOrElse(DataSourceWriteOptions.RECONCILE_SCHEMA.key(), + DataSourceWriteOptions.RECONCILE_SCHEMA.defaultValue().toString).toBoolean val canonicalizedSourceSchema = if (shouldCanonicalizeSchema) { - canonicalizeSchema(sourceSchema, latestTableSchema, opts) + canonicalizeSchema(sourceSchema, latestTableSchema, opts, !shouldReconcileSchema) Review Comment: Reconcile is not necessarily dependent on schema on read. I think the reason might have been to not conflict schema reconciliation rules incase that is enabled. @jonvex to clarify. Whatever be the reason, let's add a comment for reference. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [WIP] HoodieClusteringJob support purge pending clustering job if conflict [hudi]
Zouxxyy closed pull request #11218: [WIP] HoodieClusteringJob support purge pending clustering job if conflict URL: https://github.com/apache/hudi/pull/11218 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org