[jira] [Updated] (HUDI-7361) Fix a concurrency issue caused by rollbackFailedWrites
[ https://issues.apache.org/jira/browse/HUDI-7361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] eric updated HUDI-7361: --- Description: {quote}CREATE TABLE tbl ( .. ) WITH ( 'connector' = 'hudi', 'path' = '/tblpath', 'table.type' = 'COPY_ON_WRITE', 'write.bucket_assign.tasks'='5', 'write.operation'='insert', 'write.tasks'='5', 'clustering.schedule.enabled'='true', 'clustering.async.enabled'='true', 'clustering.delta_commits'='3', 'clustering.tasks'='5', 'hoodie.cleaner.policy.failed.writes'='LAZY' ); {quote} *Table parameters are as above* *From jbmanager and taskmanager log, we can summarize the process of abnormal triggering:* before the writeClient complete the commit 20240126154725671, the clean table service starts to work, and the failed Writes rollback needs to be checked and completed during the clean process. This method will verify whether the heartbeats of all inflight instants are overtime and rollback which instants have overtime heartbeats. At the same time, the write client has completed the commit 20240126154725671 and deleted the heartbeat file of this instant. The clean table service client obtained the last heartbeat of 0, so it rolled back this instant. > Fix a concurrency issue caused by rollbackFailedWrites > -- > > Key: HUDI-7361 > URL: https://issues.apache.org/jira/browse/HUDI-7361 > Project: Apache Hudi > Issue Type: Bug > Components: writer-core >Reporter: eric >Priority: Major > Attachments: jobmanager_log.txt, taskmanager_log.txt > > > {quote}CREATE TABLE tbl ( > .. > ) WITH ( > 'connector' = 'hudi', > 'path' = '/tblpath', > 'table.type' = 'COPY_ON_WRITE', > 'write.bucket_assign.tasks'='5', > 'write.operation'='insert', > 'write.tasks'='5', > 'clustering.schedule.enabled'='true', > 'clustering.async.enabled'='true', > 'clustering.delta_commits'='3', > 'clustering.tasks'='5', > 'hoodie.cleaner.policy.failed.writes'='LAZY' > ); > {quote} > *Table parameters are as above* > > *From jbmanager and taskmanager log, we can summarize the process of abnormal > triggering:* > before the writeClient complete the commit 20240126154725671, the clean table > service starts to work, and the failed Writes rollback needs to be checked > and completed during the clean process. > This method will verify whether the heartbeats of all inflight instants are > overtime and rollback which instants have overtime heartbeats. At the same > time, the write client has completed the commit 20240126154725671 and deleted > the heartbeat file of this instant. > The clean table service client obtained the last heartbeat of 0, so it rolled > back this instant. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7361) Fix a concurrency issue caused by rollbackFailedWrites
[ https://issues.apache.org/jira/browse/HUDI-7361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] eric updated HUDI-7361: --- Attachment: jobmanager_log.txt taskmanager_log.txt > Fix a concurrency issue caused by rollbackFailedWrites > -- > > Key: HUDI-7361 > URL: https://issues.apache.org/jira/browse/HUDI-7361 > Project: Apache Hudi > Issue Type: Bug > Components: writer-core >Reporter: eric >Priority: Major > Attachments: jobmanager_log.txt, taskmanager_log.txt > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7361) Fix a concurrency issue caused by rollbackFailedWrites
eric created HUDI-7361: -- Summary: Fix a concurrency issue caused by rollbackFailedWrites Key: HUDI-7361 URL: https://issues.apache.org/jira/browse/HUDI-7361 Project: Apache Hudi Issue Type: Bug Components: writer-core Reporter: eric -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7360) Incremental CDC Query after 0.14.1 upgrade giving Jackson class incompatibility exception
[ https://issues.apache.org/jira/browse/HUDI-7360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aditya Goenka updated HUDI-7360: Description: Github Issue - [https://github.com/apache/hudi/issues/10590] Reproducible code ``` from typing import Any from pyspark import Row from pyspark.sql import SparkSession from pyspark.sql.functions import col spark = SparkSession.builder \ .appName("Hudi Basics") \ .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \ .config("spark.jars.packages", "org.apache.hudi:hudi-spark3.3-bundle_2.12:0.13.1") \ .config("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension") \ .config("spark.sql.legacy.timeParserPolicy", "LEGACY") \ .getOrCreate() sc = spark.sparkContext table_name = "hudi_trips_cdc" base_path = "/tmp/test_issue_10590_4" # Replace for whatever path quickstart_utils = sc._jvm.org.apache.hudi.QuickstartUtils dataGen = quickstart_utils.DataGenerator() inserts = sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateInserts(10)) def create_df(): df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) return df def write_data(): df = create_df() hudi_options = { "hoodie.table.name": table_name, "hoodie.datasource.write.recordkey.field": "uuid", "hoodie.datasource.write.table.type": "MERGE_ON_READ", # This can be either MoR or CoW and the error will still happen "hoodie.datasource.write.partitionpath.field": "partitionpath", "hoodie.datasource.write.table.name": table_name, "hoodie.datasource.write.operation": "upsert", "hoodie.table.cdc.enabled": "true", # This can be left enabled, and won"t affect anything unless actually queried as CDC "hoodie.datasource.write.precombine.field": "ts", "hoodie.upsert.shuffle.parallelism": 2, "hoodie.insert.shuffle.parallelism": 2 } df.write.format("hudi") \ .options(**hudi_options) \ .mode("overwrite") \ .save(base_path) def update_data(): updates = quickstart_utils.convertToStringList(dataGen.generateUpdates(10)) df = spark.read.json(spark.sparkContext.parallelize(updates, 2)) df.write \ .format("hudi") \ .mode("append") \ .save(base_path) def incremental_query(): ordered_rows: list[Row] = spark.read \ .format("hudi") \ .load(base_path) \ .select(col("_hoodie_commit_time").alias("commit_time")) \ .orderBy(col("commit_time")) \ .collect() commits: list[Any] = list(map(lambda row: row[0], ordered_rows)) begin_time = commits[0] incremental_read_options = { 'hoodie.datasource.query.incremental.format': "cdc", # Uncomment this line to Query as CDC, crashes in 0.14.1 'hoodie.datasource.query.type': 'incremental', 'hoodie.datasource.read.begin.instanttime': begin_time, } trips_incremental_df = spark.read \ .format("hudi") \ .options(**incremental_read_options) \ .load(base_path) # Error also occurs when using the "from_hudi_table_changes" in 0.14.1 # sql_query = f""" SELECT * FROM hudi_table_changes ('\{base_path}', 'cdc', 'earliest')""" # trips_incremental_df = spark.sql(sql_query) trips_incremental_df.show() trips_incremental_df.printSchema() if __name__ == "__main__": write_data() update_data() incremental_query() ``` was:Github Issue - [https://github.com/apache/hudi/issues/10590] > Incremental CDC Query after 0.14.1 upgrade giving Jackson class > incompatibility exception > - > > Key: HUDI-7360 > URL: https://issues.apache.org/jira/browse/HUDI-7360 > Project: Apache Hudi > Issue Type: Bug > Components: reader-core >Reporter: Aditya Goenka >Priority: Critical > Fix For: 1.1.0 > > > Github Issue - [https://github.com/apache/hudi/issues/10590] > Reproducible code > ``` > from typing import Any > from pyspark import Row > from pyspark.sql import SparkSession > from pyspark.sql.functions import col > spark = SparkSession.builder \ > .appName("Hudi Basics") \ > .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \ > .config("spark.jars.packages", > "org.apache.hudi:hudi-spark3.3-bundle_2.12:0.13.1") \ > .config("spark.sql.extensions", > "org.apache.spark.sql.hudi.HoodieSparkSessionExtension") \ > .config("spark.sql.legacy.timeParserPolicy", "LEGACY") \ > .getOrCreate() > sc = spark.sparkContext > table_name = "hudi_trips_cdc" > base_path = "/tmp/test_issue_10590_4" # Replace for whatever path > quickstart_utils = sc._jvm.org.apache.hudi.QuickstartUtils > dataGen = quickstart_utils.DataGenerator() > inserts = > sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateInserts(10)) > def create_df(): > df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) > return df > def write_data(): > df = create_df() > hudi_options = { > "hoodie.table.name": table_name, > "hoodie.datasource.write.recordkey.field": "uuid", > "hoodie.da
[jira] [Updated] (HUDI-7360) Incremental CDC Query after 0.14.1 upgrade giving Jackson class incompatibility exception
[ https://issues.apache.org/jira/browse/HUDI-7360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aditya Goenka updated HUDI-7360: Summary: Incremental CDC Query after 0.14.1 upgrade giving Jackson class incompatibility exception (was: Incremental CDC Query after 0.14.X upgrade giving Jackson class incompatibility exception) > Incremental CDC Query after 0.14.1 upgrade giving Jackson class > incompatibility exception > - > > Key: HUDI-7360 > URL: https://issues.apache.org/jira/browse/HUDI-7360 > Project: Apache Hudi > Issue Type: Bug > Components: reader-core >Reporter: Aditya Goenka >Priority: Critical > Fix For: 1.1.0 > > > Github Issue - [https://github.com/apache/hudi/issues/10590] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7360) Incremental CDC Query after 0.14.X upgrade giving Jackson class incompatibility exception
Aditya Goenka created HUDI-7360: --- Summary: Incremental CDC Query after 0.14.X upgrade giving Jackson class incompatibility exception Key: HUDI-7360 URL: https://issues.apache.org/jira/browse/HUDI-7360 Project: Apache Hudi Issue Type: Bug Components: reader-core Reporter: Aditya Goenka Fix For: 1.1.0 Github Issue - [https://github.com/apache/hudi/issues/10590] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-7320) hive-sync unexpectedly loads archived timeline
[ https://issues.apache.org/jira/browse/HUDI-7320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17812499#comment-17812499 ] sivabalan narayanan commented on HUDI-7320: --- We did fix something on these lines already. Can you check if its reproducible w/ 0.14.0 as well ? > hive-sync unexpectedly loads archived timeline > -- > > Key: HUDI-7320 > URL: https://issues.apache.org/jira/browse/HUDI-7320 > Project: Apache Hudi > Issue Type: Bug > Components: meta-sync >Affects Versions: 0.13.1 >Reporter: Raymond Xu >Priority: Critical > Attachments: Screenshot 2024-01-16 at 5.49.25 PM.png, Screenshot > 2024-01-16 at 5.49.30 PM.png > > > investigation shows that hive-sync step loaded archived timeline and caused > long delay in the overall write process. And full scan for changes in all > partitions is not used. need to dig further. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] added new videos for hudi oss site [hudi]
nfarah86 commented on PR #10563: URL: https://github.com/apache/hudi/pull/10563#issuecomment-1917696609 @bhasudha made the aws -> amazon changes -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7357] Introduce generic StorageConfiguration [hudi]
hudi-bot commented on PR #10586: URL: https://github.com/apache/hudi/pull/10586#issuecomment-1917684020 ## CI report: * e6a99b7319648fce943abc73b460239350ff18d3 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7347][Stacked on HUDI-7335] Introduce SeekableDataInputStream for random access [hudi]
hudi-bot commented on PR #10575: URL: https://github.com/apache/hudi/pull/10575#issuecomment-1917510979 ## CI report: * 24d06d5c92ebb9ef98c4689365eabd1e197c7197 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1) * 806bd78e4b5f1bc0de9950daeee59dceccba9941 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7357] Introduce generic StorageConfiguration [hudi]
hudi-bot commented on PR #10586: URL: https://github.com/apache/hudi/pull/10586#issuecomment-1917495130 ## CI report: * e6a99b7319648fce943abc73b460239350ff18d3 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7347][Stacked on HUDI-7335] Introduce SeekableDataInputStream for random access [hudi]
hudi-bot commented on PR #10575: URL: https://github.com/apache/hudi/pull/10575#issuecomment-1917494985 ## CI report: * 24d06d5c92ebb9ef98c4689365eabd1e197c7197 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1) * 806bd78e4b5f1bc0de9950daeee59dceccba9941 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7357] Introduce generic StorageConfiguration [hudi]
hudi-bot commented on PR #10586: URL: https://github.com/apache/hudi/pull/10586#issuecomment-1917480260 ## CI report: * e6a99b7319648fce943abc73b460239350ff18d3 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7347][Stacked on HUDI-7335] Introduce SeekableDataInputStream for random access [hudi]
yihua commented on code in PR #10575: URL: https://github.com/apache/hudi/pull/10575#discussion_r1471603752 ## hudi-hadoop-common/src/main/java/org/apache/hudi/hadoop/fs/HadoopSeekableDataInputStream.java: ## @@ -0,0 +1,48 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.hadoop.fs; + +import org.apache.hudi.io.SeekableDataInputStream; + +import org.apache.hadoop.fs.FSDataInputStream; + +import java.io.IOException; + +/** + * An implementation of {@link SeekableDataInputStream} based on Hadoop's {@link FSDataInputStream} + */ +public class HadoopSeekableDataInputStream extends SeekableDataInputStream { + private final FSDataInputStream stream; + + public HadoopSeekableDataInputStream(FSDataInputStream stream) { +super(stream); +this.stream = stream; + } + + @Override + public long getPosition() throws IOException { Review Comment: Fixed now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated: [HUDI-7343] Replace Path.SEPARATOR with HoodieLocation.SEPARATOR (#10570)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new a078242b19d [HUDI-7343] Replace Path.SEPARATOR with HoodieLocation.SEPARATOR (#10570) a078242b19d is described below commit a078242b19dc3f8b46d08e197d8b77fa34f1808a Author: Y Ethan Guo AuthorDate: Tue Jan 30 08:47:30 2024 -0800 [HUDI-7343] Replace Path.SEPARATOR with HoodieLocation.SEPARATOR (#10570) --- .../apache/hudi/cli/commands/ExportCommand.java| 5 +-- .../cli/commands/TestHoodieLogFileCommand.java | 3 +- .../apache/hudi/cli/commands/TestTableCommand.java | 5 +-- .../hudi/cli/integ/ITTestBootstrapCommand.java | 9 +++--- .../cli/integ/ITTestHDFSParquetImportCommand.java | 5 +-- .../hudi/cli/integ/ITTestMarkersCommand.java | 5 +-- .../hudi/cli/integ/ITTestSavepointsCommand.java| 3 +- .../apache/hudi/cli/integ/ITTestTableCommand.java | 12 .../hudi/client/heartbeat/HeartbeatUtils.java | 3 +- .../client/heartbeat/HoodieHeartbeatClient.java| 6 ++-- .../lock/FileSystemBasedLockProvider.java | 7 +++-- .../BaseHoodieFunctionalIndexClient.java | 5 ++- .../timeline/TestCompletionTimeQueryView.java | 6 ++-- .../utils/TestLegacyArchivedMetaEntryReader.java | 5 +-- .../hudi/client/TestJavaHoodieBackedMetadata.java | 9 +++--- .../client/utils/SparkMetadataWriterUtils.java | 3 +- .../hudi/client/TestHoodieClientMultiWriter.java | 3 +- .../functional/TestHoodieBackedMetadata.java | 19 ++-- .../DirectMarkerBasedDetectionStrategy.java| 3 +- .../hudi/common/fs/inline/InLineFSUtils.java | 12 +--- .../common/heartbeat/HoodieHeartbeatUtils.java | 4 ++- .../hudi/common/table/HoodieTableMetaClient.java | 36 -- .../hudi/metadata/AbstractHoodieTableMetadata.java | 9 +++--- .../hudi/metadata/HoodieMetadataPayload.java | 3 +- .../apache/hudi/metadata/HoodieTableMetadata.java | 11 --- .../hudi/metadata/HoodieTableMetadataUtil.java | 3 +- .../common/fs/TestHoodieWrapperFileSystem.java | 3 +- .../org/apache/hudi/sink/meta/CkpMetadata.java | 4 ++- .../java/org/apache/hudi/source/FileIndex.java | 3 +- .../hudi/table/catalog/TableOptionProperties.java | 3 +- .../apache/hudi/table/format/FilePathUtils.java| 5 +-- .../main/java/org/apache/hudi/util/ClientIds.java | 3 +- .../apache/hudi/util/ViewStorageProperties.java| 3 +- .../apache/hudi/sink/ITTestDataStreamWrite.java| 3 +- .../hudi/sink/bucket/ITTestBucketStreamWrite.java | 3 +- .../org/apache/hudi/sink/utils/TestWriteBase.java | 4 ++- .../test/java/org/apache/hudi/utils/TestUtils.java | 3 +- .../hudi/hadoop/utils/HoodieInputFormatUtils.java | 3 +- .../apache/hudi/hadoop/TestInputPathHandler.java | 13 .../procedures/ExportInstantsProcedure.scala | 16 +- .../apache/hudi/testutils/DataSourceTestUtils.java | 3 +- .../org/apache/hudi/TestHoodieFileIndex.scala | 19 +++- .../hudi/procedure/TestBootstrapProcedure.scala| 25 +++ .../procedure/TestHdfsParquetImportProcedure.scala | 5 +-- .../hudi/analysis/HoodieSpark32PlusAnalysis.scala | 18 +-- .../hudi/hive/testutils/HiveTestService.java | 4 +-- .../MarkerBasedEarlyConflictDetectionRunnable.java | 3 +- .../utilities/streamer/SparkSampleWritesUtils.java | 3 +- 48 files changed, 197 insertions(+), 146 deletions(-) diff --git a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/ExportCommand.java b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/ExportCommand.java index 40e7154b5f9..b196c62d0fb 100644 --- a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/ExportCommand.java +++ b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/ExportCommand.java @@ -44,6 +44,7 @@ import org.apache.hudi.common.table.timeline.TimelineMetadataUtils; import org.apache.hudi.common.util.collection.ClosableIterator; import org.apache.hudi.exception.HoodieException; import org.apache.hudi.hadoop.fs.HadoopFSUtils; +import org.apache.hudi.storage.HoodieLocation; import org.slf4j.Logger; import org.slf4j.LoggerFactory; @@ -168,7 +169,7 @@ public class ExportCommand { LOG.error("Could not load metadata for action " + action + " at instant time " + instantTime); continue; } -final String outPath = localFolder + Path.SEPARATOR + instantTime + "." + action; +final String outPath = localFolder + HoodieLocation.SEPARATOR + instantTime + "." + action; writeToFile(outPath, HoodieAvroUtils.avroToJson(metadata, true)); } } @@ -190,7 +191,7 @@ public class ExportCommand { final HoodieTableMetaClient metaClient = HoodieCLI.getTableMetaClient(); final HoodieActiveTimeline t
Re: [PR] [HUDI-7343] Replace Path.SEPARATOR with HoodieLocation.SEPARATOR [hudi]
yihua merged PR #10570: URL: https://github.com/apache/hudi/pull/10570 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7343] Replace Path.SEPARATOR with HoodieLocation.SEPARATOR [hudi]
yihua commented on PR #10570: URL: https://github.com/apache/hudi/pull/10570#issuecomment-1917304625 > I didn't check all the usages of the `Path.SEPARATOR`, the change looks straight-forward so I approved it. Yes, `Path.SEPARATOR` usages are all replaced. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [WIP] [HUDI-6787] Implement the HoodieFileGroupReader API for Hive [hudi]
jonvex commented on code in PR #10422: URL: https://github.com/apache/hudi/pull/10422#discussion_r1471439782 ## hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/ObjectInspectorCache.java: ## @@ -0,0 +1,103 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.hadoop.utils; + +import com.github.benmanes.caffeine.cache.Cache; +import com.github.benmanes.caffeine.cache.Caffeine; +import org.apache.avro.Schema; +import org.apache.hadoop.hive.ql.io.parquet.serde.ArrayWritableObjectInspector; +import org.apache.hadoop.hive.serde.serdeConstants; +import org.apache.hadoop.hive.serde2.typeinfo.StructTypeInfo; +import org.apache.hadoop.hive.serde2.typeinfo.TypeInfo; +import org.apache.hadoop.hive.serde2.typeinfo.TypeInfoFactory; +import org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils; +import org.apache.hadoop.io.ArrayWritable; +import org.apache.hadoop.mapred.JobConf; + +import java.util.Arrays; +import java.util.HashMap; +import java.util.List; +import java.util.Locale; +import java.util.Map; +import java.util.Set; +import java.util.stream.Collectors; +import java.util.stream.IntStream; + +/** + * To read value from an ArrayWritable, an ObjectInspector is needed. + * Object inspectors are cached here or created using the column type map. + */ +public class ObjectInspectorCache { + private final Map columnTypeMap = new HashMap<>(); + private final Cache + objectInspectorCache = Caffeine.newBuilder().maximumSize(1000).build(); + + public Map getColumnTypeMap() { +return columnTypeMap; + } + + public ObjectInspectorCache(Schema tableSchema, JobConf jobConf) { +//From AbstractRealtimeRecordReader#prepareHiveAvroSerializer +// hive will append virtual columns at the end of column list. we should remove those columns. +// eg: current table is col1, col2, col3; jobConf.get(serdeConstants.LIST_COLUMNS): col1, col2, col3 ,BLOCK__OFFSET__INSIDE__FILE ... +Set writerSchemaColNames = tableSchema.getFields().stream().map(f -> f.name().toLowerCase(Locale.ROOT)).collect(Collectors.toSet()); +List columnNameList = Arrays.stream(jobConf.get(serdeConstants.LIST_COLUMNS).split(",")).collect(Collectors.toList()); +List columnTypeList = TypeInfoUtils.getTypeInfosFromTypeString(jobConf.get(serdeConstants.LIST_COLUMN_TYPES)); + +int columnNameListLen = columnNameList.size() - 1; +for (int i = columnNameListLen; i >= 0; i--) { + String lastColName = columnNameList.get(columnNameList.size() - 1); + // virtual columns will only append at the end of column list. it will be ok to break the loop. + if (writerSchemaColNames.contains(lastColName)) { +break; + } + columnNameList.remove(columnNameList.size() - 1); + columnTypeList.remove(columnTypeList.size() - 1); +} + +//Use columnNameList.size() instead of columnTypeList because the type list is longer for some reason +IntStream.range(0, columnNameList.size()).boxed().forEach(i -> columnTypeMap.put(columnNameList.get(i), + TypeInfoUtils.getTypeInfosFromTypeString(columnTypeList.get(i).getQualifiedName()).get(0))); + +StructTypeInfo rowTypeInfo = (StructTypeInfo) TypeInfoFactory.getStructTypeInfo(columnNameList, columnTypeList); +ArrayWritableObjectInspector objectInspector = new ArrayWritableObjectInspector(rowTypeInfo); Review Comment: FYI this is pretty much a copy of https://github.com/apache/hudi/blob/e9389ffde53fa2b28feba248b7e8f17fd565e458/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/AbstractRealtimeRecordReader.java#L111 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [WIP] [HUDI-6787] Implement the HoodieFileGroupReader API for Hive [hudi]
jonvex commented on code in PR #10422: URL: https://github.com/apache/hudi/pull/10422#discussion_r1471420556 ## hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieParquetInputFormat.java: ## @@ -91,9 +94,42 @@ private void initAvroInputFormat() { } } + private static boolean checkIfHudiTable(final InputSplit split, final JobConf job) { +try { + Option tablePathOpt = TablePathUtils.getTablePath(((FileSplit) split).getPath(), job); + if (!tablePathOpt.isPresent()) { +return false; + } + return tablePathOpt.get().getFileSystem(job).exists(new Path(tablePathOpt.get(), HoodieTableMetaClient.METAFOLDER_NAME)); +} catch (IOException e) { + return false; +} + } + @Override public RecordReader getRecordReader(final InputSplit split, final JobConf job, final Reporter reporter) throws IOException { + +if (HoodieFileGroupReaderRecordReader.useFilegroupReader(job)) { + try { Review Comment: https://github.com/apache/hudi/blob/2c38ef740d3d34e9eb05b59fa147c55623b81a90/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieFileGroupReaderRecordReader.java#L249 I remove the partition fields from the read columns if the parquet file doesn't contain them. Does that help? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7045] Create parquet readers inside the reader context and implement schema.on.read in the filegroup reader in spark [hudi]
jonvex commented on PR #10278: URL: https://github.com/apache/hudi/pull/10278#issuecomment-1917113112 Azure CI all passing @yihua -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] added new videos for hudi oss site [hudi]
bhasudha commented on code in PR #10563: URL: https://github.com/apache/hudi/pull/10563#discussion_r1471267966 ## website/videoBlog/2023-10-14-Accelerating-Data-Processing-Leveraging-Apache-Hudi-with-DynamoDB-for-Faster-Commit-Time-Retrieval.md: ## @@ -8,7 +8,7 @@ image: /assets/images/video_blogs/2023-10-14-Accelerating-Data-Processing-Levera navigate: "https://www.youtube.com/watch?v=YF8zq_nuSHE"; tags: - guide -- dyanmodb +- aws dyanmodb Review Comment: similar as above -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] added new videos for hudi oss site [hudi]
bhasudha commented on code in PR #10563: URL: https://github.com/apache/hudi/pull/10563#discussion_r1471267410 ## website/videoBlog/2023-08-06-Easy_Step_by_Step_Guide_for_Beginner_Setup_AWS_Transfer_Family_SFTP_with_S3.md: ## @@ -11,7 +11,7 @@ tags: - third-party data - sftp - aws transfer family -- amazon s3 +- aws s3 Review Comment: similar as above -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] added new videos for hudi oss site [hudi]
bhasudha commented on code in PR #10563: URL: https://github.com/apache/hudi/pull/10563#discussion_r1471266750 ## website/videoBlog/2023-08-03-Powering_EventDriven_Workloads_with_Hudi_Read_Stream_AWS_Glue_Streaming_JOBS.md: ## @@ -14,6 +14,6 @@ tags: - streaming - near real-time analytics - event bus -- amazon sqs +- aws sqs Review Comment: Lets leave it as amazon instead of aws. The reason, I added this tag is based on how the sqs documentation calls it. For ex: https://aws.amazon.com/sqs/ names it `Amazon SQS` not `aws sqs`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [WIP] [HUDI-6787] Implement the HoodieFileGroupReader API for Hive [hudi]
xiarixiaoyao commented on code in PR #10422: URL: https://github.com/apache/hudi/pull/10422#discussion_r1471121556 ## hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/ObjectInspectorCache.java: ## @@ -0,0 +1,103 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.hadoop.utils; + +import com.github.benmanes.caffeine.cache.Cache; +import com.github.benmanes.caffeine.cache.Caffeine; +import org.apache.avro.Schema; +import org.apache.hadoop.hive.ql.io.parquet.serde.ArrayWritableObjectInspector; +import org.apache.hadoop.hive.serde.serdeConstants; +import org.apache.hadoop.hive.serde2.typeinfo.StructTypeInfo; +import org.apache.hadoop.hive.serde2.typeinfo.TypeInfo; +import org.apache.hadoop.hive.serde2.typeinfo.TypeInfoFactory; +import org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils; +import org.apache.hadoop.io.ArrayWritable; +import org.apache.hadoop.mapred.JobConf; + +import java.util.Arrays; +import java.util.HashMap; +import java.util.List; +import java.util.Locale; +import java.util.Map; +import java.util.Set; +import java.util.stream.Collectors; +import java.util.stream.IntStream; + +/** + * To read value from an ArrayWritable, an ObjectInspector is needed. + * Object inspectors are cached here or created using the column type map. + */ +public class ObjectInspectorCache { + private final Map columnTypeMap = new HashMap<>(); + private final Cache + objectInspectorCache = Caffeine.newBuilder().maximumSize(1000).build(); + + public Map getColumnTypeMap() { +return columnTypeMap; + } + + public ObjectInspectorCache(Schema tableSchema, JobConf jobConf) { +//From AbstractRealtimeRecordReader#prepareHiveAvroSerializer +// hive will append virtual columns at the end of column list. we should remove those columns. +// eg: current table is col1, col2, col3; jobConf.get(serdeConstants.LIST_COLUMNS): col1, col2, col3 ,BLOCK__OFFSET__INSIDE__FILE ... +Set writerSchemaColNames = tableSchema.getFields().stream().map(f -> f.name().toLowerCase(Locale.ROOT)).collect(Collectors.toSet()); +List columnNameList = Arrays.stream(jobConf.get(serdeConstants.LIST_COLUMNS).split(",")).collect(Collectors.toList()); +List columnTypeList = TypeInfoUtils.getTypeInfosFromTypeString(jobConf.get(serdeConstants.LIST_COLUMN_TYPES)); + +int columnNameListLen = columnNameList.size() - 1; +for (int i = columnNameListLen; i >= 0; i--) { + String lastColName = columnNameList.get(columnNameList.size() - 1); + // virtual columns will only append at the end of column list. it will be ok to break the loop. + if (writerSchemaColNames.contains(lastColName)) { +break; + } + columnNameList.remove(columnNameList.size() - 1); + columnTypeList.remove(columnTypeList.size() - 1); +} + +//Use columnNameList.size() instead of columnTypeList because the type list is longer for some reason +IntStream.range(0, columnNameList.size()).boxed().forEach(i -> columnTypeMap.put(columnNameList.get(i), + TypeInfoUtils.getTypeInfosFromTypeString(columnTypeList.get(i).getQualifiedName()).get(0))); + +StructTypeInfo rowTypeInfo = (StructTypeInfo) TypeInfoFactory.getStructTypeInfo(columnNameList, columnTypeList); +ArrayWritableObjectInspector objectInspector = new ArrayWritableObjectInspector(rowTypeInfo); Review Comment: > There may be compatibility issues between hive2 and hive3. DATE, TIMESTAMP I think hive will handle this itself. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [hudi bucket prune] [hudi]
lookingUpAtTheSky opened a new issue, #10589: URL: https://github.com/apache/hudi/issues/10589 **_Tips before filing an issue_** - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? - Join the mailing list to engage in conversations and get faster support at dev-subscr...@hudi.apache.org. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. **Describe the problem you faced** A clear and concise description of the problem. Hello, I have two questions. Firstly, is there a clear plan for spark to support bucket prune? Second, when we calculate bucketId of field value, which method is right to format value, HoodieAvroUtils.convertValueForSpecificDataTypes or ExpressionUtils.getKeyFromLiteral or any other? Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Querying Hudi tables with Spark+Velox(C++), ObjectSizeCalculator.getObjectSize hangs causing about a 50-second delay in queries [hudi]
majian1998 commented on issue #10580: URL: https://github.com/apache/hudi/issues/10580#issuecomment-1916596628 @ad1happy2go I understand that the issue started when the PR [HUDI-4687] introduced the use of jol to estimate object size. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Querying Hudi tables with Spark+Velox(C++), ObjectSizeCalculator.getObjectSize hangs causing about a 50-second delay in queries [hudi]
ad1happy2go commented on issue #10580: URL: https://github.com/apache/hudi/issues/10580#issuecomment-1916559418 @majian1998 Is this issue occurring after 0.14.0 upgrade or it was happening with older Hudi version too? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Can't delete key (row) for all commits in HUDI Table (history)? [hudi]
ad1happy2go commented on issue #10581: URL: https://github.com/apache/hudi/issues/10581#issuecomment-1916557526 @jens4doc Dont think there is a way to achieve that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Hudi 6868 - Support extracting passwords from credential store for Hive Sync [hudi]
hudi-bot commented on PR #10577: URL: https://github.com/apache/hudi/pull/10577#issuecomment-1916344140 ## CI report: * 40cbc324442334d3e1313f995c8ae9feed7d0db7 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [Support] An error occurred while calling o1748.load.\n: java.io.FileNotFoundException [hudi]
gsudhanshu commented on issue #10503: URL: https://github.com/apache/hudi/issues/10503#issuecomment-1916285606 @ad1happy2go thanks for your inputs. I had made changes in path and removed unneccary keys. But still facing the same issue of Filenotfound exception. It seems like I will have to downgrade to 0.13.1 and standalone mode -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org