Re: [I] [SUPPORT] Poor parallelism in BLOOM indexing stage with Hudi 0.12.3 [hudi]
ad1happy2go commented on issue #10115: URL: https://github.com/apache/hudi/issues/10115#issuecomment-1813949629 @ChiehFu We should derive based on input data size. Too many partitions will create extra tasks and create extra overhead time. Are those 2 tsv files gzipped. If yes that is causing the job only have 3 tasks. gzip is unsplittable format. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Poor parallelism in BLOOM indexing stage with Hudi 0.12.3 [hudi]
ChiehFu commented on issue #10115: URL: https://github.com/apache/hudi/issues/10115#issuecomment-1813946952 @ad1happy2go I see. In this case there were only 2 tsv files with a total size of 115.7 MiB. https://github.com/apache/hudi/assets/11819388/da21e2b8-4061-455d-bdd2-d9a33ebba051";> Is repartitioning by 1 something we could apply universally to all our tables regardless of input data size? Or it would be better to derive the value based on some factor like input size and would it cause any harm to upsert performance if we over re-partition? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7099] Providing metrics for archive and defining some string constants [hudi]
hudi-bot commented on PR #10101: URL: https://github.com/apache/hudi/pull/10101#issuecomment-1813945486 ## CI report: * 178ef4eadac6ab6d009d86ab86d35babe952 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20942) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7071] Throw exceptions when clustering/index job fail [hudi]
hudi-bot commented on PR #10050: URL: https://github.com/apache/hudi/pull/10050#issuecomment-1813945205 ## CI report: * 40caf2cf77aa03c17ee84077b6c2d4752c542d48 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20815) * a46978c942649269675db590f2f65186b636e70a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20945) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7090]Set the maxParallelism for singleton operator [hudi]
danny0405 commented on code in PR #10090: URL: https://github.com/apache/hudi/pull/10090#discussion_r1395276883 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/utils/Pipelines.java: ## @@ -410,10 +410,11 @@ public static DataStream hoodieStreamWrite(Configuration conf, DataStrea * @return the compaction pipeline */ public static DataStreamSink compact(Configuration conf, DataStream dataStream) { -return dataStream.transform("compact_plan_generate", +DataStreamSink compactionCommitEventDataStream = dataStream.transform("compact_plan_generate", TypeInformation.of(CompactionPlanEvent.class), new CompactionPlanOperator(conf)) .setParallelism(1) // plan generate must be singleton +.setMaxParallelism(1) Review Comment: Is this line compatible with flink release before 1.18? ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSource.java: ## @@ -207,6 +207,7 @@ public DataStream produceDataStream(StreamExecutionEnvironment execEnv) SingleOutputStreamOperator source = execEnv.addSource(monitoringFunction, getSourceOperatorName("split_monitor")) .uid(Pipelines.opUID("split_monitor", conf)) .setParallelism(1) + .setMaxParallelism(1) Review Comment: Is this line compatible with flink release before 1.18? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Modified description to include missing trigger strategy [hudi]
voonhous commented on code in PR #10114: URL: https://github.com/apache/hudi/pull/10114#discussion_r1395275760 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java: ## @@ -41,6 +41,7 @@ import org.apache.hudi.keygen.constant.KeyGeneratorType; import org.apache.hudi.sink.overwrite.PartitionOverwriteMode; import org.apache.hudi.table.action.cluster.ClusteringPlanPartitionFilterMode; +import org.apache.hudi.table.action.compact.CompactionTriggerStrategy; import org.apache.hudi.util.ClientIds; Review Comment: Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7105] support filesystem view configuable [hudi]
danny0405 commented on code in PR #10116: URL: https://github.com/apache/hudi/pull/10116#discussion_r1395275031 ## hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewManager.java: ## @@ -279,7 +278,13 @@ public static FileSystemViewManager createViewManager(final HoodieEngineContext throw new IllegalArgumentException("Secondary Storage type can only be in-memory or spillable. Was :" + viewConfig.getSecondaryStorageType()); } - return new PriorityBasedFileSystemView(remoteFileSystemView, secondaryView); + if (config.isRemoteViewFirst()) { +LOG.info("Creating remote table view first"); +return new PriorityBasedFileSystemView(remoteFileSystemView, secondaryView); + } else { +LOG.info("Creating secondary table view first"); Review Comment: cc @zhedoubushishi , who have also encountered OOM for async cleaning. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Modified description to include missing trigger strategy [hudi]
voonhous commented on code in PR #10114: URL: https://github.com/apache/hudi/pull/10114#discussion_r1395274432 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java: ## @@ -41,6 +41,7 @@ import org.apache.hudi.keygen.constant.KeyGeneratorType; import org.apache.hudi.sink.overwrite.PartitionOverwriteMode; import org.apache.hudi.table.action.cluster.ClusteringPlanPartitionFilterMode; +import org.apache.hudi.table.action.compact.CompactionTriggerStrategy; import org.apache.hudi.util.ClientIds; Review Comment: My bad, was doing some debugging and forgot to remove it. Will remove it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Modified description to include missing trigger strategy [hudi]
danny0405 commented on code in PR #10114: URL: https://github.com/apache/hudi/pull/10114#discussion_r1395272032 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java: ## @@ -41,6 +41,7 @@ import org.apache.hudi.keygen.constant.KeyGeneratorType; import org.apache.hudi.sink.overwrite.PartitionOverwriteMode; import org.apache.hudi.table.action.cluster.ClusteringPlanPartitionFilterMode; +import org.apache.hudi.table.action.compact.CompactionTriggerStrategy; import org.apache.hudi.util.ClientIds; Review Comment: Why the importation of the class is needed? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7099] Providing metrics for archive and defining some string constants [hudi]
danny0405 commented on code in PR #10101: URL: https://github.com/apache/hudi/pull/10101#discussion_r1395270648 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/timeline/HoodieTimelineArchiver.java: ## @@ -117,6 +122,10 @@ public boolean archiveIfRequired(HoodieEngineContext context, boolean acquireLoc } else { LOG.info("No Instants to archive"); } + if (success && timerContext != null) { +long durationMs = metrics.getDurationInMs(timerContext.stop()); Review Comment: Can we move the metrics handling to the write client or the service client, the cleaning and rollback alreay follow this pattern. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] NotSerializableException using SparkRDDWriteClient with OCC and DynamoDBBasedLockProvider [hudi]
chym1303 commented on issue #9807: URL: https://github.com/apache/hudi/issues/9807#issuecomment-1813925321 Hi @ad1happy2go DynamoDBBasedLockProvider and HiveMetastoreBasedLockProvider have the same issue like https://issues.apache.org/jira/browse/HUDI-3638, task not serializable in clean action. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Poor parallelism in BLOOM indexing stage with Hudi 0.12.3 [hudi]
ad1happy2go commented on issue #10115: URL: https://github.com/apache/hudi/issues/10115#issuecomment-1813912870 @ChiehFu The tasks in tagging step depends n how many partitions are there in input DataFrame. By any chance are you getting large zipped files in source? You can do repartition before write to hudi. `df.repartition(1).write.format` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7105) Add FileSystemViewManager configuable
[ https://issues.apache.org/jira/browse/HUDI-7105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7105: - Labels: clean pull-request-available (was: clean) > Add FileSystemViewManager configuable > - > > Key: HUDI-7105 > URL: https://issues.apache.org/jira/browse/HUDI-7105 > Project: Apache Hudi > Issue Type: Improvement >Reporter: kwang >Priority: Major > Labels: clean, pull-request-available > > If there exists many partitions and files When generating the clean plan, > it's easy to throw oom exception. Using secondaryFileSystemView first is more > stable than remoteFileSystemView. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-7105] support filesystem view configuable [hudi]
ksmou opened a new pull request, #10116: URL: https://github.com/apache/hudi/pull/10116 ### Change Logs If there are many partitions and files When generating the clean plan, it's easy to throw oom exception. The default way is remote table view first, it can not fall back to secondary table first if remote view throws oom exception. Using secondary view first is more stable than remoteFileSystemView. ### Impact N/A ### Risk level (write none, low medium or high below) none ### Documentation Update N/A - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7071] Throw exceptions when clustering/index job fail [hudi]
hudi-bot commented on PR #10050: URL: https://github.com/apache/hudi/pull/10050#issuecomment-1813899879 ## CI report: * 40caf2cf77aa03c17ee84077b6c2d4752c542d48 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20815) * a46978c942649269675db590f2f65186b636e70a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Modified description to include missing trigger strategy [hudi]
hudi-bot commented on PR #10114: URL: https://github.com/apache/hudi/pull/10114#issuecomment-1813884877 ## CI report: * 5152ea66bd6f4a3c3f506bfe051ef4122973e908 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20941) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7090]Set the maxParallelism for singleton operator [hudi]
hudi-bot commented on PR #10090: URL: https://github.com/apache/hudi/pull/10090#issuecomment-1813884740 ## CI report: * 35219c2180342faea6e09987e69271508a3f0096 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20912) * 36d5e48d7b41740a2f94be92dd0fb45cbe4806de Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20944) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7090]Set the maxParallelism for singleton operator [hudi]
hudi-bot commented on PR #10090: URL: https://github.com/apache/hudi/pull/10090#issuecomment-1813877522 ## CI report: * 35219c2180342faea6e09987e69271508a3f0096 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20912) * 36d5e48d7b41740a2f94be92dd0fb45cbe4806de UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6658] inject filters for incremental query [hudi]
hudi-bot commented on PR #10063: URL: https://github.com/apache/hudi/pull/10063#issuecomment-1813877422 ## CI report: * edb9997799c672e69a5a81271f32504e270846d2 UNKNOWN * 34efaac278dde7fd73515e6d54418a6ff8815326 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20939) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Query failure due to replacecommit being archived [hudi]
ad1happy2go commented on issue #10107: URL: https://github.com/apache/hudi/issues/10107#issuecomment-1813869857 @haoxie-aws I tried to reproduce this with OSS version but couldn't able to reproduce. Can you try with the later version. Below is the code I used. Writer ``` spark = get_spark_session(spark_version="3.2", hudi_version="0.11.0") def generateDataFrame(): # Define the schema for the DataFrame schema = StructType([ StructField("uuid", StringType(), True), StructField("index", StringType(), True), StructField("timestamp", StringType(), True) ]) # Create a list of Row objects data = [Row(str(uuid.uuid4()), str(i), str(datetime.now())) for i in range(11)] # Parallelize the data using SparkContext and create an RDD rdd = spark.sparkContext.parallelize(data) # Create a DataFrame from the RDD and schema df = spark.createDataFrame(rdd, schema) return df def loop(): # Concatenate Hudi options into a single string hudi_options = { "hoodie.table.name": TABLE_NAME, "hoodie.table.type": "COPY_ON_WRITE", "hoodie.datasource.write.recordkey.field": "uuid", "hoodie.datasource.write.precombine.field": "timestamp", "hoodie.datasource.write.operation": "upsert", "hoodie.parquet.max.file.size" : "20971520", "hoodie.parquet.small.file.limit" : "0", # 20MB "hoodie.keep.max.commits" : "12", "hoodie.keep.min.commits" : "11", "hoodie.bulkinsert.sort.mode" : "NONE", "hoodie.clustering.inline" : "true", "hoodie.clustering.inline.max.commits" : "2", "hoodie.clustering.plan.strategy.small.file.limit" : "20971520" , # 20MB "clustering.plan.strategy.target.file.max.bytes" : "31457280", # 30 MB "hoodie.metadata.enable" : "true" } # Write DataFrame to Hudi generateDataFrame().write.options(**hudi_options).format("org.apache.hudi") \ .option("hoodie.datasource.write.hive_style_partitioning", "true") \ .mode("append") \ .save(PATH) if __name__ == "__main__": for _ in range(1001): loop() ``` READER ``` spark = get_spark_session(spark_version="3.2", hudi_version="0.11.0") def loop(): print(spark.read.format("hudi").load(PATH).count()) spark.read.format("hudi").load(PATH).show() if __name__ == "__main__": for _ in range(1001): loop() ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Poor parallelism in BLOOM indexing stage with Hudi 0.12.3 [hudi]
ChiehFu commented on issue #10115: URL: https://github.com/apache/hudi/issues/10115#issuecomment-1813862633 @ad1happy2go I am not sure why it only had 3 tasks. This particular upsert job upserted 328,550 records. https://github.com/apache/hudi/assets/11819388/4826e4b8-9a90-4973-b040-6decb711fda2";> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7105) Add FileSystemViewManager configuable
[ https://issues.apache.org/jira/browse/HUDI-7105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kwang updated HUDI-7105: Description: If there exists many partitions and files When generating the clean plan, it's easy to throw oom exception. Using secondaryFileSystemView first is more stable than remoteFileSystemView. (was: If there exists mang partitions and files When generating the clean plan, it's easy to throw oom exception. Using secondaryFileSystemView first is more stable than remoteFileSystemView.) > Add FileSystemViewManager configuable > - > > Key: HUDI-7105 > URL: https://issues.apache.org/jira/browse/HUDI-7105 > Project: Apache Hudi > Issue Type: Improvement >Reporter: kwang >Priority: Major > Labels: clean > > If there exists many partitions and files When generating the clean plan, > it's easy to throw oom exception. Using secondaryFileSystemView first is more > stable than remoteFileSystemView. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7105) Add FileSystemViewManager configuable
[ https://issues.apache.org/jira/browse/HUDI-7105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kwang updated HUDI-7105: Description: If there exists mang partitions and files When generating the clean plan, it's easy to throw oom exception. Using secondaryFileSystemView first is more stable than remoteFileSystemView. (was: If there exists mang partitions and files When generating the clean plan, it's easy to throw oom exception. Using secondaryFileSystemView is more stable than remoteFileSystemView.) > Add FileSystemViewManager configuable > - > > Key: HUDI-7105 > URL: https://issues.apache.org/jira/browse/HUDI-7105 > Project: Apache Hudi > Issue Type: Improvement >Reporter: kwang >Priority: Major > Labels: clean > > If there exists mang partitions and files When generating the clean plan, > it's easy to throw oom exception. Using secondaryFileSystemView first is more > stable than remoteFileSystemView. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7099] Providing metrics for archive and defining some string constants [hudi]
hudi-bot commented on PR #10101: URL: https://github.com/apache/hudi/pull/10101#issuecomment-1813839813 ## CI report: * 2f97634b8b59e9f61dc05b649e78f9fe747c5ee5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20922) * 178ef4eadac6ab6d009d86ab86d35babe952 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20942) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6887] Add test for Record Index and MIT queries [hudi]
lokeshj1703 commented on PR #9760: URL: https://github.com/apache/hudi/pull/9760#issuecomment-1813833900 The test added here is passing locally but failing in the CI. I have to debug and fix the CI failure. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7099] Providing metrics for archive and defining some string constants [hudi]
hudi-bot commented on PR #10101: URL: https://github.com/apache/hudi/pull/10101#issuecomment-1813833815 ## CI report: * 2f97634b8b59e9f61dc05b649e78f9fe747c5ee5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20922) * 178ef4eadac6ab6d009d86ab86d35babe952 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated (35af64db466 -> 874b5dec5e9)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from 35af64db466 [Minor] Throw exceptions when cleaner/compactor fail (#10108) add 874b5dec5e9 [HUDI-6806] Support Spark 3.5.0 (#9717) No new revisions were added by this update. Summary of changes: .github/workflows/bot.yml | 13 +++ .../scala/org/apache/hudi/HoodieSparkUtils.scala | 2 + .../org/apache/hudi/SparkAdapterSupport.scala | 4 +- .../scala/org/apache/spark/sql/DataFrameUtil.scala | 6 +- .../spark/sql/HoodieCatalystExpressionUtils.scala | 16 ++-- .../org/apache/spark/sql/HoodieSchemaUtils.scala | 9 +++ .../org/apache/spark/sql/HoodieUnsafeUtils.scala | 13 +-- .../HoodieSparkPartitionedFileUtils.scala | 20 +++-- .../org/apache/spark/sql/hudi/SparkAdapter.scala | 5 +- .../org/apache/hudi/avro/TestHoodieAvroUtils.java | 4 +- .../hudi/common/util/TestClusteringUtils.java | 2 + .../dag/nodes/BaseValidateDatasetNode.java | 13 +-- .../scala/org/apache/hudi/HoodieBaseRelation.scala | 4 +- .../scala/org/apache/hudi/HoodieCDCFileIndex.scala | 2 +- .../scala/org/apache/hudi/HoodieFileIndex.scala| 9 ++- .../apache/hudi/HoodieIncrementalFileIndex.scala | 9 ++- .../datasources/HoodieInMemoryFileIndex.scala | 5 +- .../hudi/testutils/SparkDatasetTestUtils.java | 19 ++--- hudi-spark-datasource/hudi-spark/pom.xml | 30 +++ .../spark/sql/hudi/analysis/HoodieAnalysis.scala | 19 - .../hudi/command/CallProcedureHoodieCommand.scala | 6 +- .../hudi/command/CompactionHoodiePathCommand.scala | 5 +- .../command/CompactionHoodieTableCommand.scala | 5 +- .../command/CompactionShowHoodiePathCommand.scala | 5 +- .../command/CompactionShowHoodieTableCommand.scala | 5 +- .../command/InsertIntoHoodieTableCommand.scala | 10 ++- .../TestBulkInsertInternalPartitionerForRows.java | 0 .../TestHoodieDatasetBulkInsertHelper.java | 19 ++--- .../row/TestHoodieInternalRowParquetWriter.java| 0 .../io/storage/row/TestHoodieRowCreateHandle.java | 14 +++- .../hudi/testutils/KeyGeneratorTestUtilities.java | 20 ++--- .../org/apache/hudi/TestAvroConversionUtils.scala | 2 +- .../read/TestHoodieFileGroupReaderOnSpark.scala| 9 ++- .../apache/spark/sql/hudi/TestInsertTable.scala| 22 +- hudi-spark-datasource/hudi-spark2/pom.xml | 8 ++ .../sql/HoodieSpark2CatalystExpressionUtils.scala | 7 +- .../apache/spark/sql/HoodieSpark2SchemaUtils.scala | 6 ++ .../apache/spark/sql/adapter/Spark2Adapter.scala | 7 +- .../HoodieSpark2PartitionedFileUtils.scala | 12 ++- .../HoodieBulkInsertInternalWriterTestBase.java| 0 .../apache/hudi/spark3/internal/ReflectUtil.java | 8 +- .../spark/sql/adapter/BaseSpark3Adapter.scala | 6 +- hudi-spark-datasource/hudi-spark3.0.x/pom.xml | 15 .../sql/HoodieSpark30CatalystExpressionUtils.scala | 7 +- .../spark/sql/HoodieSpark30SchemaUtils.scala | 6 ++ .../HoodieSpark30PartitionedFileUtils.scala| 12 ++- .../HoodieBulkInsertInternalWriterTestBase.java| 0 .../TestHoodieBulkInsertDataInternalWriter.java| 0 .../TestHoodieDataSourceInternalBatchWrite.java| 0 hudi-spark-datasource/hudi-spark3.1.x/pom.xml | 15 .../sql/HoodieSpark31CatalystExpressionUtils.scala | 8 +- .../spark/sql/HoodieSpark31SchemaUtils.scala | 6 ++ .../HoodieSpark31PartitionedFileUtils.scala| 12 ++- .../HoodieBulkInsertInternalWriterTestBase.java| 0 .../TestHoodieBulkInsertDataInternalWriter.java| 0 .../TestHoodieDataSourceInternalBatchWrite.java| 0 hudi-spark-datasource/hudi-spark3.2.x/pom.xml | 8 +- .../sql/HoodieSpark32CatalystExpressionUtils.scala | 7 +- .../spark/sql/HoodieSpark32SchemaUtils.scala | 6 ++ .../HoodieSpark32PartitionedFileUtils.scala| 12 ++- .../parquet/Spark32DataSourceUtils.scala} | 2 +- .../Spark32LegacyHoodieParquetFileFormat.scala | 10 +-- .../sql/hudi/analysis/HoodieSpark32Analysis.scala | 66 .../HoodieBulkInsertInternalWriterTestBase.java| 0 .../TestHoodieBulkInsertDataInternalWriter.java| 0 .../TestHoodieDataSourceInternalBatchWrite.java| 0 .../hudi/analysis/HoodieSpark32PlusAnalysis.scala | 28 --- .../sql/HoodieSpark33CatalystExpressionUtils.scala | 9 ++- .../spark/sql/HoodieSpark33SchemaUtils.scala | 6 ++ .../HoodieSpark33PartitionedFileUtils.scala| 12 ++- .../parquet/Spark33DataSourceUtils.scala} | 2 +- .../Spark33LegacyHoodieParquetFileFormat.scala | 10 +-- .../sql/hudi/analysis/HoodieSpark33Analysis.scala | 66 .../HoodieBulkInsertInternalWriterTestBase.java| 0 .../hudi/spark3/internal/TestReflectUtil.java | 3 +- .../sql/HoodieSpark34CatalystExpressionUtils.scala | 7 +- .../spark/sql/Hood
Re: [PR] [HUDI-6806] Support Spark 3.5.0 [hudi]
yihua merged PR #9717: URL: https://github.com/apache/hudi/pull/9717 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6806] Support Spark 3.5.0 [hudi]
yihua commented on PR #9717: URL: https://github.com/apache/hudi/pull/9717#issuecomment-1813833441 Azure CI on master also fails on the fourth task. Merging this PR. https://github.com/apache/hudi/assets/2497195/fdcd6011-d90a-4861-a9a4-c21ed62414ce";> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR][DNM] Add logs to test runs [hudi]
hudi-bot commented on PR #10111: URL: https://github.com/apache/hudi/pull/10111#issuecomment-1813828330 ## CI report: * 65c56d302e05ac18639929442f9b533d11f38ed5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20938) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6658] inject filters for incremental query [hudi]
hudi-bot commented on PR #10063: URL: https://github.com/apache/hudi/pull/10063#issuecomment-1813828231 ## CI report: * edb9997799c672e69a5a81271f32504e270846d2 UNKNOWN * 2c51a6c39ee41fac34110a41f943a3f1dee93f0f Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20936) * 34efaac278dde7fd73515e6d54418a6ff8815326 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20939) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]When hudi integrates hive, an error is reported when the hive external table is queried [hudi]
Jackkaabe commented on issue #10084: URL: https://github.com/apache/hudi/issues/10084#issuecomment-1813823608 > @Jackkaabe This happens due to conflict with the parquet dependency. You can try shade the parquet jars and rebuild it by adding following configuration to the Flink-bundle pom.xml. > > ``` > > org.apache.parquet > ${flink.bundle.shade.prefix}org.apache.parquet > > ``` > > cc @danny0405 I did it, but still got the same error. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Poor parallelism in BLOOM indexing stage with Hudi 0.12.3 [hudi]
ad1happy2go commented on issue #10115: URL: https://github.com/apache/hudi/issues/10115#issuecomment-1813816561 @ChiehFu Do you know if any particular reason why it's taking only 3 tasks. Can you paste the full UI for one of the job. Need to check how many tasks it create for Tagging stage. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (HUDI-7104) Cleaner could miss to clean up some files w/ savepoint interplay
[ https://issues.apache.org/jira/browse/HUDI-7104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan reassigned HUDI-7104: - Assignee: sivabalan narayanan > Cleaner could miss to clean up some files w/ savepoint interplay > - > > Key: HUDI-7104 > URL: https://issues.apache.org/jira/browse/HUDI-7104 > Project: Apache Hudi > Issue Type: Improvement > Components: cleaning >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Major > > Lets say partitioning is day based and is based on created date. So, older > partitions generally does not get any new data after few days. > > Lets say we have savepoints added to a day and later removed. > day 1: cleaned up. > day2: savepoint added. and so cleaner ignord. > day3: cleaned up > day4: earliest commit to retain based on cleaner configs. > > So, w/ this table/timeline state, if we remove the savepointed commit, data > pertaining to day2 will never be cleaned by the cleaner since its lesser than > the earliest commit to retain. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [I] [SUPPORT] hudi sql task hang java.lang.System.exit block [hudi]
zyclove commented on issue #10112: URL: https://github.com/apache/hudi/issues/10112#issuecomment-1813780480 I will check it and retry. Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] RFC 63 Functional Index Hudi 0.1.0-beta [hudi]
codope commented on issue #10110: URL: https://github.com/apache/hudi/issues/10110#issuecomment-1813780193 Hi @soumilshah1995 , thanks for giving it a try! Currently, the `FUNCTION` keyword is not integrated. I need to update the RFC with the exact syntax which can be found here in the SQL DDL docs - https://hudi.apache.org/docs/next/sql_ddl#create-index-experimental We are tracking the issue to simplify the syntax. Ideally, we want users to be able to just say `CREATE INDEX func_index_abc on xyz_hudi_table USING column_stats(hour(ts))` without using `FUNCTION` keyword or provide extra options to specify the function. We will have it in 1.0 GA. Feel free to reach out to me directly on Hudi Slack if you're more interested in this feature. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] Poor parallelism in BLOOM indexing stage with Hudi 0.12.3 [hudi]
ChiehFu opened a new issue, #10115: URL: https://github.com/apache/hudi/issues/10115 **_Tips before filing an issue_** - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? - Join the mailing list to engage in conversations and get faster support at dev-subscr...@hudi.apache.org. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. **Describe the problem you faced** Hello, Recently we migrated our datasets from Hudi 0.8 to Hudi 0.12.3 and started experiencing slowness in indexing stage in some tables during upserts. After looking into spark steps, we found out that there was one particular stage where Hudi set a very low value parallelism for a indexing stage (stage 32 in the screenshoot) and ended up causing long duration and shuffle spill which further slowdown the stage. We set `hoodie.bloom.index.parallelism=2000` however, it doesn't seem to affect the parallelism of that particular stage. In Hudi 0.8, Hudi used to use the value we set in `hoodie.upsert.shuffle.parallelism` for parallelism for this stage, however it seems in Hudi 0.12, the parallelism is being calculated dynamically. Can you please help us understand if there is any Hudi configuration we should use to increase the parallelism for the stage? We also tried setting `hoodie.copyonwrite.record.size.estimate` to a very small value as it seems help forcing Hudi to use a larger parallelism for indexing initially, but it's very inconsistent as we still see small values being set for the stage across upsert jobs. **Environment Description** * Hudi version : 0.12.3 * Spark version : 3.1.3 * Hive version : 3.1.3 * Hadoop version : 3.3.3 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : no * EMR: 6.10.0 **Additional context** Hudi configs ``` hoodie.metadata.enable: true hoodie.metadata.validate: true hoodie.cleaner.commits.retained: 72 hoodie.keep.min.commits: 100 hoodie.keep.max.commits: 150 hoodie.datasource.write.payload.class: org.apache.hudi.common.model.DefaultHoodieRecordPayload hoodie.index.type: BLOOM hoodie.bloom.index.parallelism: 2000 hoodie.copyonwrite.record.size.estimate: 1 hoodie.metadata.enable: true hoodie.datasource.write.table.type: COPY_ON_WRITE hoodie.insert.shuffle.parallelism: 1500 hoodie.datasource.write.operation: upsert hoodie.datasource.hive_sync.partition_extractor_class: org.apache.hudi.hive.MultiPartKeysValueExtractor hoodie.datasource.write.keygenerator.class: org.apache.hudi.keygen.ComplexKeyGenerator ``` https://github.com/apache/hudi/assets/11819388/e0381e62-0690-4bce-8fe3-15f6590870bb";> https://github.com/apache/hudi/assets/11819388/3a481bb8-c6ff-456d-baca-b60b22788abf";> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Modified description to include missing trigger strategy [hudi]
hudi-bot commented on PR #10114: URL: https://github.com/apache/hudi/pull/10114#issuecomment-1813755365 ## CI report: * 5152ea66bd6f4a3c3f506bfe051ef4122973e908 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20941) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Modified description to include missing trigger strategy [hudi]
hudi-bot commented on PR #10114: URL: https://github.com/apache/hudi/pull/10114#issuecomment-1813750459 ## CI report: * 5152ea66bd6f4a3c3f506bfe051ef4122973e908 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7090]Set the maxParallelism for singleton operator [hudi]
hudi-bot commented on PR #10090: URL: https://github.com/apache/hudi/pull/10090#issuecomment-1813750327 ## CI report: * 35219c2180342faea6e09987e69271508a3f0096 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20912) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR][DNM] Full test runtime 2 [hudi]
hudi-bot commented on PR #10113: URL: https://github.com/apache/hudi/pull/10113#issuecomment-1813750425 ## CI report: * 272f308766e7bfeaf03d7d5bfc9b15cd4bf92a15 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20940) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6658] inject filters for incremental query [hudi]
hudi-bot commented on PR #10063: URL: https://github.com/apache/hudi/pull/10063#issuecomment-1813750267 ## CI report: * edb9997799c672e69a5a81271f32504e270846d2 UNKNOWN * d22fcb976c5c468cb129abf9c4ee200eb249fb73 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20934) * 2c51a6c39ee41fac34110a41f943a3f1dee93f0f Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20936) * 34efaac278dde7fd73515e6d54418a6ff8815326 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20939) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-7105) Add FileSystemViewManager configuable
kwang created HUDI-7105: --- Summary: Add FileSystemViewManager configuable Key: HUDI-7105 URL: https://issues.apache.org/jira/browse/HUDI-7105 Project: Apache Hudi Issue Type: Improvement Reporter: kwang If there exists mang partitions and files When generating the clean plan, it's easy to throw oom exception. Using secondaryFileSystemView is more stable than remoteFileSystemView. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [MINOR][DNM] Full test runtime 2 [hudi]
hudi-bot commented on PR #10113: URL: https://github.com/apache/hudi/pull/10113#issuecomment-1813745670 ## CI report: * 272f308766e7bfeaf03d7d5bfc9b15cd4bf92a15 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7103] Support time travel queies for COW tables [hudi]
hudi-bot commented on PR #10109: URL: https://github.com/apache/hudi/pull/10109#issuecomment-1813745641 ## CI report: * 01cd726aff602316f444f98e6e61bf2433fa3e95 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20931) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7102] Fix a bug for time travel queries on MOR tables [hudi]
hudi-bot commented on PR #10102: URL: https://github.com/apache/hudi/pull/10102#issuecomment-1813745621 ## CI report: * c3ff2511a30564e5a5ff0cb407326ff6ef0584e3 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20930) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7090]Set the maxParallelism for singleton operator [hudi]
hudi-bot commented on PR #10090: URL: https://github.com/apache/hudi/pull/10090#issuecomment-1813745573 ## CI report: * 35219c2180342faea6e09987e69271508a3f0096 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20912) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6658] inject filters for incremental query [hudi]
hudi-bot commented on PR #10063: URL: https://github.com/apache/hudi/pull/10063#issuecomment-1813745503 ## CI report: * edb9997799c672e69a5a81271f32504e270846d2 UNKNOWN * d22fcb976c5c468cb129abf9c4ee200eb249fb73 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20934) * 2c51a6c39ee41fac34110a41f943a3f1dee93f0f Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20936) * 34efaac278dde7fd73515e6d54418a6ff8815326 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [MINOR] Modified description to include missing trigger strategy [hudi]
voonhous opened a new pull request, #10114: URL: https://github.com/apache/hudi/pull/10114 ### Change Logs In https://github.com/apache/hudi/pull/6144, a new compaction trigger strategy was added named `NUM_COMMITS_AFTER_LAST_REQUEST`, org.apache.hudi.table.action.compact.CompactionTriggerStrategy. However, the FlinkOptions description as never updated to include this new trigger strategy. Adding it in so that configs page on doc-site will reflect this trigger strategy for completeness. TODO: Might need to do some refactoring to centralise these common config so that we do not have to worry about these de-sync in the future. Might also make it easier for testing. ### Impact None ### Risk level (write none, low medium or high below) None ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7071] Throw exception when clustering/compactin job fail [hudi]
ksmou commented on PR #10050: URL: https://github.com/apache/hudi/pull/10050#issuecomment-1813740241 > Is it fixed via: #10108 ? It's good. All services those calling `UtilHelpers.retry` have similar problems. I fix the clustering/index job like this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7090]Set the maxParallelism for singleton operator [hudi]
hehuiyuan commented on code in PR #10090: URL: https://github.com/apache/hudi/pull/10090#discussion_r1395108190 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSource.java: ## @@ -207,6 +207,7 @@ public DataStream produceDataStream(StreamExecutionEnvironment execEnv) SingleOutputStreamOperator source = execEnv.addSource(monitoringFunction, getSourceOperatorName("split_monitor")) .uid(Pipelines.opUID("split_monitor", conf)) .setParallelism(1) + .setMaxParallelism(1) Review Comment: single operator. https://github.com/apache/flink/blob/012704d9884f92274495fbf6fdb7234373944212/flink-connectors/flink-connector-files/src/main/java/org/apache/flink/connector/file/table/stream/StreamingSink.java#L124 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7090]Set the maxParallelism for singleton operator [hudi]
hehuiyuan commented on code in PR #10090: URL: https://github.com/apache/hudi/pull/10090#discussion_r1395108190 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSource.java: ## @@ -207,6 +207,7 @@ public DataStream produceDataStream(StreamExecutionEnvironment execEnv) SingleOutputStreamOperator source = execEnv.addSource(monitoringFunction, getSourceOperatorName("split_monitor")) .uid(Pipelines.opUID("split_monitor", conf)) .setParallelism(1) + .setMaxParallelism(1) Review Comment: single operator -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7090]Set the maxParallelism for singleton operator [hudi]
hehuiyuan commented on PR #10090: URL: https://github.com/apache/hudi/pull/10090#issuecomment-1813732678 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR][DNM] Add logs to test runs [hudi]
hudi-bot commented on PR #10111: URL: https://github.com/apache/hudi/pull/10111#issuecomment-1813718687 ## CI report: * 65c56d302e05ac18639929442f9b533d11f38ed5 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20938) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [MINOR][DNM] Full test runtime 2 [hudi]
yihua opened a new pull request, #10113: URL: https://github.com/apache/hudi/pull/10113 ### Change Logs As above. This reverts #9260 to fix CI. ### Impact Testing only. ### Risk level (write none, low medium or high below) none ### Documentation Update N/A ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR][DNM] Add logs to test runs [hudi]
hudi-bot commented on PR #10111: URL: https://github.com/apache/hudi/pull/10111#issuecomment-1813712301 ## CI report: * 65c56d302e05ac18639929442f9b533d11f38ed5 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Can hudi support updating only specific columns ? (not rewrite base columns) [hudi]
danny0405 commented on issue #10086: URL: https://github.com/apache/hudi/issues/10086#issuecomment-1813711254 The release 1.0 doc is not released yet. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7090]Set the maxParallelism for singleton operator [hudi]
danny0405 commented on code in PR #10090: URL: https://github.com/apache/hudi/pull/10090#discussion_r1395092204 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSource.java: ## @@ -207,6 +207,7 @@ public DataStream produceDataStream(StreamExecutionEnvironment execEnv) SingleOutputStreamOperator source = execEnv.addSource(monitoringFunction, getSourceOperatorName("split_monitor")) .uid(Pipelines.opUID("split_monitor", conf)) .setParallelism(1) + .setMaxParallelism(1) Review Comment: Is the `setMaxParallelism` takes effect with per-operator scope or global scope? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7071] Throw exception when clustering/compactin job fail [hudi]
danny0405 commented on PR #10050: URL: https://github.com/apache/hudi/pull/10050#issuecomment-1813708197 Is it fixed via: https://github.com/apache/hudi/pull/10108 ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6658] inject filters for incremental query [hudi]
hudi-bot commented on PR #10063: URL: https://github.com/apache/hudi/pull/10063#issuecomment-1813707163 ## CI report: * edb9997799c672e69a5a81271f32504e270846d2 UNKNOWN * d22fcb976c5c468cb129abf9c4ee200eb249fb73 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20934) * 411f1e09cc33590a4a1f7cc93c65db083494633b Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20935) * 2c51a6c39ee41fac34110a41f943a3f1dee93f0f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20936) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6806] Support Spark 3.5.0 [hudi]
hudi-bot commented on PR #9717: URL: https://github.com/apache/hudi/pull/9717#issuecomment-1813706840 ## CI report: * 9b8fdd2d1b69da528069e364790b53af1d6150af UNKNOWN * afe70daf89229ab3ac4153d69b511121b8a31d9e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20933) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Query failure due to replacecommit being archived [hudi]
danny0405 commented on issue #10107: URL: https://github.com/apache/hudi/issues/10107#issuecomment-1813704292 Should be fixed in recent releases, cc @ad1happy2go for double check. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] hudi sql task hang java.lang.System.exit block [hudi]
danny0405 commented on issue #10112: URL: https://github.com/apache/hudi/issues/10112#issuecomment-1813702995 Not sure whether this fix is related with your issue: https://github.com/apache/hudi/pull/10108 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7102] Fix a bug for time travel queries on MOR tables [hudi]
danny0405 commented on code in PR #10102: URL: https://github.com/apache/hudi/pull/10102#discussion_r1395085791 ## hudi-common/src/main/java/org/apache/hudi/common/table/log/BaseHoodieLogRecordReader.java: ## @@ -260,7 +260,7 @@ private void scanInternalV1(Option keySpecOpt) { && !HoodieTimeline.compareTimestamps(logBlock.getLogBlockHeader().get(INSTANT_TIME), HoodieTimeline.LESSER_THAN_OR_EQUALS, this.latestInstantTime )) { // hit a block with instant time greater than should be processed, stop processing further - break; + continue; } Review Comment: The reader consumption upper threshold is introduced for unnecessary reading of log block, should we drop it? I don't think so, maybe you shoud just fix the threshold itself. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] hudi sql task hang java.lang.System.exit block [hudi]
zyclove commented on issue #10112: URL: https://github.com/apache/hudi/issues/10112#issuecomment-1813701762 Thread 8953: (state = IN_NATIVE_TRANS) - org.apache.hadoop.net.unix.DomainSocketWatcher.doPoll0(int, org.apache.hadoop.net.unix.DomainSocketWatcher$FdSet) @bci=0 (Interpreted frame) - org.apache.hadoop.net.unix.DomainSocketWatcher.access$900(int, org.apache.hadoop.net.unix.DomainSocketWatcher$FdSet) @bci=2, line=52 (Interpreted frame) - org.apache.hadoop.net.unix.DomainSocketWatcher$2.run() @bci=763, line=503 (Interpreted frame) - java.lang.Thread.run() @bci=11, line=750 (Interpreted frame) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] hudi sql task hang java.lang.System.exit block [hudi]
zyclove opened a new issue, #10112: URL: https://github.com/apache/hudi/issues/10112 **Describe the problem you faced** The sql task is over, bug the drive can not exit some times . If the same task is run many times, there is a small chance that it will exit abnormally. Tens of thousands of tasks are executed every day, and this problem has never occurred for non-hudi spark tasks. Hudi task have occasionally appeared several times before. ![企业微信截图_28aec49f-d1c0-45d0-b9e0-dc1f31b21ee0](https://github.com/apache/hudi/assets/15028279/16c6762f-afde-47ee-ac12-bb2d2c590f45) ![企业微信截图_4dcf7b0c-1c6b-44b7-99e7-d8d5134434a8](https://github.com/apache/hudi/assets/15028279/6b156a0f-43c4-4fa4-adaa-b75458f16a3f) ![企业微信截图_b8884f5e-ff16-4115-a826-f5a50b281df9](https://github.com/apache/hudi/assets/15028279/2b79c34c-694b-49ca-839b-65c7cd2c4769) **To Reproduce** Steps to reproduce the behavior: 1. /usr/lib/spark/bin/spark-sql --name 63130__VOLCANO_JOB_1699949768615_004319 -f /tmp/VOLCANO_JOB_1699949768615_004319.sql --master yarn --queue hadoop --driver-memory 8g --executor-memory 4G --executor-cores 2 --num-executors 8 --packages org.apache.hudi:hudi-spark3.2-bundle_2.12:0.14.0 --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension --conf spark.sql.autoBroadcastJoinThreshold=2G --conf spark.sql.broadcastTimeout=6 --conf spark.memory.storageFraction=0.7 --conf spark.yarn.priority=5 --conf spark.sql.adaptive.enabled=true **Expected behavior** A clear and concise description of what you expected to happen. **Environment Description** * Hudi version :0.14.0 * Spark version :3.2.1 * Hive version :3.1.3 * Hadoop version :3.2.2 * Storage (HDFS/S3/GCS..) :s3 * Running on Docker? (yes/no) :no **Additional context** Add any other context about the problem here. **Stacktrace** ``` Attaching to process ID 8854, please wait... Debugger attached successfully. Server compiler detected. JVM version is 25.382-b05 Deadlock Detection: No deadlocks found. Thread 23860: (state = BLOCKED) - java.lang.Thread.sleep(long) @bci=0 (Compiled frame; information may be imprecise) - io.netty.util.concurrent.SingleThreadEventExecutor.confirmShutdown() @bci=153, line=787 (Interpreted frame) - io.netty.channel.nio.NioEventLoop.run() @bci=406, line=530 (Interpreted frame) - io.netty.util.concurrent.SingleThreadEventExecutor$4.run() @bci=44, line=986 (Interpreted frame) - io.netty.util.internal.ThreadExecutorMap$2.run() @bci=11, line=74 (Interpreted frame) - io.netty.util.concurrent.FastThreadLocalRunnable.run() @bci=4, line=30 (Interpreted frame) - java.lang.Thread.run() @bci=11, line=750 (Compiled frame) Thread 23859: (state = BLOCKED) - java.lang.Thread.sleep(long) @bci=0 (Compiled frame; information may be imprecise) - io.netty.util.concurrent.SingleThreadEventExecutor.confirmShutdown() @bci=153, line=787 (Interpreted frame) - io.netty.channel.nio.NioEventLoop.run() @bci=406, line=530 (Interpreted frame) - io.netty.util.concurrent.SingleThreadEventExecutor$4.run() @bci=44, line=986 (Interpreted frame) - io.netty.util.internal.ThreadExecutorMap$2.run() @bci=11, line=74 (Interpreted frame) - io.netty.util.concurrent.FastThreadLocalRunnable.run() @bci=4, line=30 (Interpreted frame) - java.lang.Thread.run() @bci=11, line=750 (Compiled frame) Thread 23858: (state = BLOCKED) - java.lang.Thread.sleep(long) @bci=0 (Compiled frame; information may be imprecise) - io.netty.util.concurrent.SingleThreadEventExecutor.confirmShutdown() @bci=153, line=787 (Interpreted frame) - io.netty.channel.nio.NioEventLoop.run() @bci=406, line=530 (Interpreted frame) - io.netty.util.concurrent.SingleThreadEventExecutor$4.run() @bci=44, line=986 (Interpreted frame) - io.netty.util.internal.ThreadExecutorMap$2.run() @bci=11, line=74 (Interpreted frame) - io.netty.util.concurrent.FastThreadLocalRunnable.run() @bci=4, line=30 (Interpreted frame) - java.lang.Thread.run() @bci=11, line=750 (Compiled frame) Thread 23857: (state = BLOCKED) - java.lang.Thread.sleep(long) @bci=0 (Compiled frame; information may be imprecise) - io.netty.util.concurrent.SingleThreadEventExecutor.confirmShutdown() @bci=153, line=787 (Interpreted frame) - io.netty.channel.nio.NioEventLoop.run() @bci=406, line=530 (Interpreted frame) - io.netty.util.concurrent.SingleThreadEventExecutor$4.run() @bci=44, line=986 (Interpreted frame) - io.netty.util.internal.ThreadExecutorMap$2.run() @bci=11, line=74 (Interpreted frame) - io.netty.util.concurrent.FastThreadLocalRunnable.run() @bci=4, line=30 (Interpreted frame) - java.lang
Re: [PR] [MINOR] CLAZZ_CACHE get should be synchonized avoid thread safe problem [hudi]
danny0405 closed pull request #9788: [MINOR] CLAZZ_CACHE get should be synchonized avoid thread safe problem URL: https://github.com/apache/hudi/pull/9788 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] CLAZZ_CACHE get should be synchonized avoid thread safe problem [hudi]
danny0405 commented on PR #9788: URL: https://github.com/apache/hudi/pull/9788#issuecomment-1813699169 Close because it been fixed via https://github.com/apache/hudi/pull/9786. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [MINOR][DNM] Add logs to test runs [hudi]
yihua opened a new pull request, #10111: URL: https://github.com/apache/hudi/pull/10111 ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ ### Impact _Describe any public API or user-facing feature change or any performance impact._ ### Risk level (write none, low medium or high below) _If medium or high, explain what verification was done to mitigate the risks._ ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated: [Minor] Throw exceptions when cleaner/compactor fail (#10108)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 35af64db466 [Minor] Throw exceptions when cleaner/compactor fail (#10108) 35af64db466 is described below commit 35af64db46668115dc7c9cd9b05844819cb1157e Author: Shawn Chang <42792772+c...@users.noreply.github.com> AuthorDate: Wed Nov 15 18:36:42 2023 -0800 [Minor] Throw exceptions when cleaner/compactor fail (#10108) Co-authored-by: Shawn Chang --- .../main/java/org/apache/hudi/utilities/HoodieCleaner.java | 13 +++-- .../java/org/apache/hudi/utilities/HoodieCompactor.java | 13 - 2 files changed, 11 insertions(+), 15 deletions(-) diff --git a/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieCleaner.java b/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieCleaner.java index 53b80e55b25..49aed0b 100644 --- a/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieCleaner.java +++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieCleaner.java @@ -26,6 +26,7 @@ import org.apache.hudi.config.HoodieWriteConfig; import com.beust.jcommander.JCommander; import com.beust.jcommander.Parameter; import org.apache.hadoop.fs.Path; +import org.apache.hudi.exception.HoodieException; import org.apache.spark.api.java.JavaSparkContext; import org.slf4j.Logger; import org.slf4j.LoggerFactory; @@ -103,28 +104,20 @@ public class HoodieCleaner { JCommander cmd = new JCommander(cfg, null, args); if (cfg.help || args.length == 0) { cmd.usage(); - System.exit(1); + throw new HoodieException("Failed to run cleaning for " + cfg.basePath); } String dirName = new Path(cfg.basePath).getName(); JavaSparkContext jssc = UtilHelpers.buildSparkContext("hoodie-cleaner-" + dirName, cfg.sparkMaster); -boolean success = true; try { new HoodieCleaner(cfg, jssc).run(); } catch (Throwable throwable) { - success = false; - LOG.error("Failed to run cleaning for " + cfg.basePath, throwable); + throw new HoodieException("Failed to run cleaning for " + cfg.basePath, throwable); } finally { jssc.stop(); } -if (!success) { - // Return a non-zero exit code to properly notify any resource manager - // that cleaning was not successful - System.exit(1); -} - LOG.info("Cleaner ran successfully"); } } diff --git a/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieCompactor.java b/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieCompactor.java index 9b03cb7a724..c8bdf0da3a0 100644 --- a/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieCompactor.java +++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieCompactor.java @@ -29,6 +29,7 @@ import org.apache.hudi.common.table.timeline.HoodieInstant; import org.apache.hudi.common.util.Option; import org.apache.hudi.common.util.StringUtils; import org.apache.hudi.config.HoodieCleanConfig; +import org.apache.hudi.exception.HoodieException; import org.apache.hudi.table.action.HoodieWriteMetadata; import org.apache.hudi.table.action.compact.strategy.LogFileSizeBasedCompactionStrategy; @@ -168,18 +169,20 @@ public class HoodieCompactor { JCommander cmd = new JCommander(cfg, null, args); if (cfg.help || args.length == 0) { cmd.usage(); - System.exit(1); + throw new HoodieException("Fail to run compaction for " + cfg.tableName + ", return code: " + 1); } final JavaSparkContext jsc = UtilHelpers.buildSparkContext("compactor-" + cfg.tableName, cfg.sparkMaster, cfg.sparkMemory); int ret = 0; try { - HoodieCompactor compactor = new HoodieCompactor(jsc, cfg); - ret = compactor.compact(cfg.retry); + ret = new HoodieCompactor(jsc, cfg).compact(cfg.retry); } catch (Throwable throwable) { - LOG.error("Fail to run compaction for " + cfg.tableName, throwable); + throw new HoodieException("Fail to run compaction for " + cfg.tableName + ", return code: " + ret, throwable); } finally { jsc.stop(); - System.exit(ret); +} + +if (ret != 0) { + throw new HoodieException("Fail to run compaction for " + cfg.tableName + ", return code: " + ret); } }
Re: [PR] [MINOR] Throw exceptions when cleaner/compactor fail [hudi]
danny0405 merged PR #10108: URL: https://github.com/apache/hudi/pull/10108 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Throw exceptions when cleaner/compactor fail [hudi]
danny0405 commented on PR #10108: URL: https://github.com/apache/hudi/pull/10108#issuecomment-1813697919 The failure is not relevent: https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=20928&view=logs&j=dcedfe73-9485-5cc5-817a-73b61fc5dcb0&t=746585d8-b50a-55c3-26c5-517d93af9934&l=14572 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6658] inject filters for incremental query [hudi]
hudi-bot commented on PR #10063: URL: https://github.com/apache/hudi/pull/10063#issuecomment-1813667907 ## CI report: * edb9997799c672e69a5a81271f32504e270846d2 UNKNOWN * 97424b66af6de869a7feba00c6e8c24f80eb90a4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20927) * d22fcb976c5c468cb129abf9c4ee200eb249fb73 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20934) * 411f1e09cc33590a4a1f7cc93c65db083494633b Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20935) * 2c51a6c39ee41fac34110a41f943a3f1dee93f0f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7099] Providing metrics for archive and defining some string constants [hudi]
stream2000 commented on code in PR #10101: URL: https://github.com/apache/hudi/pull/10101#discussion_r1395063596 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/HoodieMetrics.java: ## @@ -255,48 +277,57 @@ private void updateCommitTimingMetrics(long commitEpochTimeInMs, long durationIn Pair, Option> eventTimePairMinMax = metadata.getMinAndMaxEventTime(); if (eventTimePairMinMax.getLeft().isPresent()) { long commitLatencyInMs = commitEpochTimeInMs + durationInMs - eventTimePairMinMax.getLeft().get(); -metrics.registerGauge(getMetricsName(actionType, "commitLatencyInMs"), commitLatencyInMs); +metrics.registerGauge(getMetricsName(actionType, COMMIT_LATENCY_STR), commitLatencyInMs); } if (eventTimePairMinMax.getRight().isPresent()) { long commitFreshnessInMs = commitEpochTimeInMs + durationInMs - eventTimePairMinMax.getRight().get(); -metrics.registerGauge(getMetricsName(actionType, "commitFreshnessInMs"), commitFreshnessInMs); +metrics.registerGauge(getMetricsName(actionType, COMMIT_FRESHNESS_STR), commitFreshnessInMs); } - metrics.registerGauge(getMetricsName(actionType, "commitTime"), commitEpochTimeInMs); - metrics.registerGauge(getMetricsName(actionType, "duration"), durationInMs); + metrics.registerGauge(getMetricsName(actionType, COMMIT_TIME_STR), commitEpochTimeInMs); + metrics.registerGauge(getMetricsName(actionType, DURATION_STR), durationInMs); } } public void updateRollbackMetrics(long durationInMs, long numFilesDeleted) { if (config.isMetricsOn()) { LOG.info( String.format("Sending rollback metrics (duration=%d, numFilesDeleted=%d)", durationInMs, numFilesDeleted)); - metrics.registerGauge(getMetricsName("rollback", "duration"), durationInMs); - metrics.registerGauge(getMetricsName("rollback", "numFilesDeleted"), numFilesDeleted); + metrics.registerGauge(getMetricsName(HoodieTimeline.ROLLBACK_ACTION, DURATION_STR), durationInMs); + metrics.registerGauge(getMetricsName(HoodieTimeline.ROLLBACK_ACTION, DELETE_FILES_NUM_STR), numFilesDeleted); } } public void updateCleanMetrics(long durationInMs, int numFilesDeleted) { if (config.isMetricsOn()) { LOG.info( String.format("Sending clean metrics (duration=%d, numFilesDeleted=%d)", durationInMs, numFilesDeleted)); - metrics.registerGauge(getMetricsName("clean", "duration"), durationInMs); - metrics.registerGauge(getMetricsName("clean", "numFilesDeleted"), numFilesDeleted); + metrics.registerGauge(getMetricsName(HoodieTimeline.CLEAN_ACTION, DURATION_STR), durationInMs); + metrics.registerGauge(getMetricsName(HoodieTimeline.CLEAN_ACTION, DELETE_FILES_NUM_STR), numFilesDeleted); +} + } + + public void updateArchiveMetrics(long durationInMs, int numFilesDeleted) { +if (config.isMetricsOn()) { + LOG.info( + String.format("Sending archive metrics (duration=%d, numFilesDeleted=%d)", durationInMs, numFilesDeleted)); + metrics.registerGauge(getMetricsName(ARCHIVE_ACTION, DURATION_STR), durationInMs); + metrics.registerGauge(getMetricsName(ARCHIVE_ACTION, DELETE_FILES_NUM_STR), numFilesDeleted); } } public void updateFinalizeWriteMetrics(long durationInMs, long numFilesFinalized) { if (config.isMetricsOn()) { LOG.info(String.format("Sending finalize write metrics (duration=%d, numFilesFinalized=%d)", durationInMs, numFilesFinalized)); - metrics.registerGauge(getMetricsName("finalize", "duration"), durationInMs); - metrics.registerGauge(getMetricsName("finalize", "numFilesFinalized"), numFilesFinalized); + metrics.registerGauge(getMetricsName(FINALIZE_ACTION, DURATION_STR), durationInMs); + metrics.registerGauge(getMetricsName(FINALIZE_ACTION, FINALIZED_FILES_NUM_STR), numFilesFinalized); } } public void updateIndexMetrics(final String action, final long durationInMs) { if (config.isMetricsOn()) { LOG.info(String.format("Sending index metrics (%s.duration, %d)", action, durationInMs)); Review Comment: We can also update the string literal in the log here ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/HoodieMetrics.java: ## @@ -92,20 +106,21 @@ public HoodieMetrics(HoodieWriteConfig config) { this.tableName = config.getTableName(); if (config.isMetricsOn()) { metrics = Metrics.getInstance(config); - this.rollbackTimerName = getMetricsName("timer", HoodieTimeline.ROLLBACK_ACTION); - this.cleanTimerName = getMetricsName("timer", HoodieTimeline.CLEAN_ACTION); - this.commitTimerName = getMetricsName("timer", HoodieTimeline.COMMIT_ACTION); - this.deltaCommitTimerName = getMetricsName("timer", HoodieTimeline.DELTA_COMMIT_ACTION); - this.replaceCommitTimerName = getMetricsName("tim
Re: [PR] [HUDI-6658] inject filters for incremental query [hudi]
hudi-bot commented on PR #10063: URL: https://github.com/apache/hudi/pull/10063#issuecomment-1813653979 ## CI report: * edb9997799c672e69a5a81271f32504e270846d2 UNKNOWN * 97424b66af6de869a7feba00c6e8c24f80eb90a4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20927) * d22fcb976c5c468cb129abf9c4ee200eb249fb73 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20934) * 411f1e09cc33590a4a1f7cc93c65db083494633b UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6806] Support Spark 3.5.0 [hudi]
hudi-bot commented on PR #9717: URL: https://github.com/apache/hudi/pull/9717#issuecomment-1813653007 ## CI report: * 9b8fdd2d1b69da528069e364790b53af1d6150af UNKNOWN * 017a37588ccb55c0df8a98a48a251146256d9406 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20929) * afe70daf89229ab3ac4153d69b511121b8a31d9e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20933) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6658] inject filters for incremental query [hudi]
hudi-bot commented on PR #10063: URL: https://github.com/apache/hudi/pull/10063#issuecomment-1813638155 ## CI report: * edb9997799c672e69a5a81271f32504e270846d2 UNKNOWN * 97424b66af6de869a7feba00c6e8c24f80eb90a4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20927) * d22fcb976c5c468cb129abf9c4ee200eb249fb73 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-7104) Cleaner could miss to clean up some files w/ savepoint interplay
sivabalan narayanan created HUDI-7104: - Summary: Cleaner could miss to clean up some files w/ savepoint interplay Key: HUDI-7104 URL: https://issues.apache.org/jira/browse/HUDI-7104 Project: Apache Hudi Issue Type: Improvement Components: cleaning Reporter: sivabalan narayanan Lets say partitioning is day based and is based on created date. So, older partitions generally does not get any new data after few days. Lets say we have savepoints added to a day and later removed. day 1: cleaned up. day2: savepoint added. and so cleaner ignord. day3: cleaned up day4: earliest commit to retain based on cleaner configs. So, w/ this table/timeline state, if we remove the savepointed commit, data pertaining to day2 will never be cleaned by the cleaner since its lesser than the earliest commit to retain. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-6806] Support Spark 3.5.0 [hudi]
hudi-bot commented on PR #9717: URL: https://github.com/apache/hudi/pull/9717#issuecomment-1813637324 ## CI report: * 9b8fdd2d1b69da528069e364790b53af1d6150af UNKNOWN * 017a37588ccb55c0df8a98a48a251146256d9406 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20929) * afe70daf89229ab3ac4153d69b511121b8a31d9e UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6658] inject filters for incremental query [hudi]
hudi-bot commented on PR #10063: URL: https://github.com/apache/hudi/pull/10063#issuecomment-1813623184 ## CI report: * edb9997799c672e69a5a81271f32504e270846d2 UNKNOWN * 97424b66af6de869a7feba00c6e8c24f80eb90a4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20927) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6806] Support Spark 3.5.0 [hudi]
hudi-bot commented on PR #9717: URL: https://github.com/apache/hudi/pull/9717#issuecomment-1813622358 ## CI report: * 9b8fdd2d1b69da528069e364790b53af1d6150af UNKNOWN * 017a37588ccb55c0df8a98a48a251146256d9406 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20929) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7102] Fix a bug for time travel queries on MOR tables [hudi]
hudi-bot commented on PR #10102: URL: https://github.com/apache/hudi/pull/10102#issuecomment-1813546968 ## CI report: * c3ff2511a30564e5a5ff0cb407326ff6ef0584e3 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20930) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7103] Support time travel queies for COW tables [hudi]
hudi-bot commented on PR #10109: URL: https://github.com/apache/hudi/pull/10109#issuecomment-1813547074 ## CI report: * 01cd726aff602316f444f98e6e61bf2433fa3e95 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20931) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6806] Support Spark 3.5.0 [hudi]
hudi-bot commented on PR #9717: URL: https://github.com/apache/hudi/pull/9717#issuecomment-1813545833 ## CI report: * 9b8fdd2d1b69da528069e364790b53af1d6150af UNKNOWN * af280647acca3e0cbf9f52c7bbe189f326cd8df6 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20926) * 017a37588ccb55c0df8a98a48a251146256d9406 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20929) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7103] Support time travel queies for COW tables [hudi]
hudi-bot commented on PR #10109: URL: https://github.com/apache/hudi/pull/10109#issuecomment-1813536258 ## CI report: * 01cd726aff602316f444f98e6e61bf2433fa3e95 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7102) A bug for the time travel queries for MOR tables
[ https://issues.apache.org/jira/browse/HUDI-7102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7102: - Labels: pull-request-available (was: ) > A bug for the time travel queries for MOR tables > > > Key: HUDI-7102 > URL: https://issues.apache.org/jira/browse/HUDI-7102 > Project: Apache Hudi > Issue Type: Task >Reporter: Lin Liu >Assignee: Lin Liu >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Issue: > # Based on the provided TIMESTAMP_AS_OF, a list of file slices are returned. > However, these file slices that are returned are based on their base file > timestamp. That means, these slices may contain log files whose timestamps > are higher than the provided timestamp. > # Such that, when we try to merge the logs in the reverse order, we may see > these unqualified log files first, which triggers the "break" operation, and > no merging will be done. > > Solution: > # The first solution is to filter the log files as well as the base files > for the file slices. > # The second solution is to skip these unqualified log files, and keep > merging. > > Risk: > * Not sure if new bugs would be introduced by changing the current behavior. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7102] Fix a bug for time travel queries on MOR tables [hudi]
hudi-bot commented on PR #10102: URL: https://github.com/apache/hudi/pull/10102#issuecomment-1813536200 ## CI report: * c3ff2511a30564e5a5ff0cb407326ff6ef0584e3 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6806] Support Spark 3.5.0 [hudi]
hudi-bot commented on PR #9717: URL: https://github.com/apache/hudi/pull/9717#issuecomment-1813535629 ## CI report: * 9b8fdd2d1b69da528069e364790b53af1d6150af UNKNOWN * af280647acca3e0cbf9f52c7bbe189f326cd8df6 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20926) * 017a37588ccb55c0df8a98a48a251146256d9406 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-5936] Fix serialization problem when FileStatus is not serializable [hudi]
yihua merged PR #10065: URL: https://github.com/apache/hudi/pull/10065 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated (dcd5a8182a1 -> bada5d91a8d)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from dcd5a8182a1 [HUDI-7069] Optimize metaclient construction and include table config options (#10048) add bada5d91a8d [HUDI-5936] Fix serialization problem when FileStatus is not serializable (#10065) No new revisions were added by this update. Summary of changes: .../hudi/common/fs/NonSerializableFileSystem.java | 115 .../fs/TestHoodieSerializableFileStatus.java | 86 .../common/fs/HoodieSerializableFileStatus.java| 144 + .../metadata/FileSystemBackedTableMetadata.java| 28 ++-- 4 files changed, 361 insertions(+), 12 deletions(-) create mode 100644 hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/common/fs/NonSerializableFileSystem.java create mode 100644 hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/common/fs/TestHoodieSerializableFileStatus.java create mode 100644 hudi-common/src/main/java/org/apache/hudi/common/fs/HoodieSerializableFileStatus.java
Re: [PR] [MINOR] Throw exceptions when cleaner/compactor fail [hudi]
hudi-bot commented on PR #10108: URL: https://github.com/apache/hudi/pull/10108#issuecomment-1813529460 ## CI report: * 0165912015447a8ce331afa757ff764809113b9e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20928) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7103) Enable Time travel queries for COW
[ https://issues.apache.org/jira/browse/HUDI-7103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7103: - Labels: pull-request-available (was: ) > Enable Time travel queries for COW > -- > > Key: HUDI-7103 > URL: https://issues.apache.org/jira/browse/HUDI-7103 > Project: Apache Hudi > Issue Type: Task >Reporter: Lin Liu >Assignee: Lin Liu >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > This goal of this task is to enable time travel queries for COW tables based > on HadoopFsRelation. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-7103] Support time travel queies for COW tables [hudi]
linliu-code opened a new pull request, #10109: URL: https://github.com/apache/hudi/pull/10109 ### Change Logs This is based on HadoopFsRelation, and new file format and file group reader. ### Impact Time travel queries should be more stable. ### Risk level (write none, low medium or high below) LOW since this is for 1.0.0. ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-7103) Enable Time travel queries for COW
Lin Liu created HUDI-7103: - Summary: Enable Time travel queries for COW Key: HUDI-7103 URL: https://issues.apache.org/jira/browse/HUDI-7103 Project: Apache Hudi Issue Type: Task Reporter: Lin Liu Assignee: Lin Liu Fix For: 1.0.0 This goal of this task is to enable time travel queries for COW tables based on HadoopFsRelation. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-6806] Support Spark 3.5.0 [hudi]
yihua commented on code in PR #9717: URL: https://github.com/apache/hudi/pull/9717#discussion_r1394831451 ## .github/workflows/bot.yml: ## @@ -284,29 +294,33 @@ jobs: matrix: include: - flinkProfile: 'flink1.17' -sparkProfile: 'spark3.4' -sparkRuntime: 'spark3.4.0' - - flinkProfile: 'flink1.17' -sparkProfile: 'spark3.3' -sparkRuntime: 'spark3.3.2' - - flinkProfile: 'flink1.16' -sparkProfile: 'spark3.3' -sparkRuntime: 'spark3.3.2' - - flinkProfile: 'flink1.15' -sparkProfile: 'spark3.3' -sparkRuntime: 'spark3.3.1' - - flinkProfile: 'flink1.14' -sparkProfile: 'spark3.2' -sparkRuntime: 'spark3.2.3' - - flinkProfile: 'flink1.13' -sparkProfile: 'spark3.1' -sparkRuntime: 'spark3.1.3' - - flinkProfile: 'flink1.14' -sparkProfile: 'spark3.0' -sparkRuntime: 'spark3.0.2' - - flinkProfile: 'flink1.13' -sparkProfile: 'spark2.4' -sparkRuntime: 'spark2.4.8' +sparkProfile: 'spark3.5' +sparkRuntime: 'spark3.5.0' +# - flinkProfile: 'flink1.17' +#sparkProfile: 'spark3.4' +#sparkRuntime: 'spark3.4.0' +# - flinkProfile: 'flink1.17' +#sparkProfile: 'spark3.3' +#sparkRuntime: 'spark3.3.2' +# - flinkProfile: 'flink1.16' +#sparkProfile: 'spark3.3' +#sparkRuntime: 'spark3.3.2' +# - flinkProfile: 'flink1.15' +#sparkProfile: 'spark3.3' +#sparkRuntime: 'spark3.3.1' +# - flinkProfile: 'flink1.14' +#sparkProfile: 'spark3.2' +#sparkRuntime: 'spark3.2.3' +# - flinkProfile: 'flink1.13' +#sparkProfile: 'spark3.1' +#sparkRuntime: 'spark3.1.3' +# - flinkProfile: 'flink1.14' +#sparkProfile: 'spark3.0' +#sparkRuntime: 'spark3.0.2' +# - flinkProfile: 'flink1.13' +#sparkProfile: 'spark2.4' +#sparkRuntime: 'spark2.4.8' + Review Comment: I've built and uploaded the bundle validation image `apachehudi/hudi-ci-bundle-validation-base:flink1180hive313spark350`. It's ready for use now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7102) A bug for the time travel queries for MOR tables
[ https://issues.apache.org/jira/browse/HUDI-7102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lin Liu updated HUDI-7102: -- Description: Issue: # Based on the provided TIMESTAMP_AS_OF, a list of file slices are returned. However, these file slices that are returned are based on their base file timestamp. That means, these slices may contain log files whose timestamps are higher than the provided timestamp. # Such that, when we try to merge the logs in the reverse order, we may see these unqualified log files first, which triggers the "break" operation, and no merging will be done. Solution: # The first solution is to filter the log files as well as the base files for the file slices. # The second solution is to skip these unqualified log files, and keep merging. Risk: * 1. Not sure if new bugs would be introduced by changing the current behavior. was: The issue is: # Based on the provided TIMESTAMP_AS_OF, a list of file slices are returned. However, these file slices that are returned are based on their base file timestamp. That means, these slices may contain log files whose timestamps are higher than the provided timestamp. # Such that, when we try to merge the logs in the reverse order, we may see these unqualified log files first, which triggers the "break" operation, and no merging will be done. Solution: # The first solution is to filter the log files as well as the base files for the file slices. But not sure if any other logic will be affected. # The second solution is to skip these unqualified log files, and keep merging. Not sure if any existing processing logic are based on this "break" logic. > A bug for the time travel queries for MOR tables > > > Key: HUDI-7102 > URL: https://issues.apache.org/jira/browse/HUDI-7102 > Project: Apache Hudi > Issue Type: Task >Reporter: Lin Liu >Assignee: Lin Liu >Priority: Major > Fix For: 1.0.0 > > > Issue: > # Based on the provided TIMESTAMP_AS_OF, a list of file slices are returned. > However, these file slices that are returned are based on their base file > timestamp. That means, these slices may contain log files whose timestamps > are higher than the provided timestamp. > # Such that, when we try to merge the logs in the reverse order, we may see > these unqualified log files first, which triggers the "break" operation, and > no merging will be done. > > Solution: > # The first solution is to filter the log files as well as the base files > for the file slices. > # The second solution is to skip these unqualified log files, and keep > merging. > > Risk: > * 1. Not sure if new bugs would be introduced by changing the current > behavior. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7102) A bug for the time travel queries for MOR tables
[ https://issues.apache.org/jira/browse/HUDI-7102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lin Liu updated HUDI-7102: -- Description: Issue: # Based on the provided TIMESTAMP_AS_OF, a list of file slices are returned. However, these file slices that are returned are based on their base file timestamp. That means, these slices may contain log files whose timestamps are higher than the provided timestamp. # Such that, when we try to merge the logs in the reverse order, we may see these unqualified log files first, which triggers the "break" operation, and no merging will be done. Solution: # The first solution is to filter the log files as well as the base files for the file slices. # The second solution is to skip these unqualified log files, and keep merging. Risk: * Not sure if new bugs would be introduced by changing the current behavior. was: Issue: # Based on the provided TIMESTAMP_AS_OF, a list of file slices are returned. However, these file slices that are returned are based on their base file timestamp. That means, these slices may contain log files whose timestamps are higher than the provided timestamp. # Such that, when we try to merge the logs in the reverse order, we may see these unqualified log files first, which triggers the "break" operation, and no merging will be done. Solution: # The first solution is to filter the log files as well as the base files for the file slices. # The second solution is to skip these unqualified log files, and keep merging. Risk: * 1. Not sure if new bugs would be introduced by changing the current behavior. > A bug for the time travel queries for MOR tables > > > Key: HUDI-7102 > URL: https://issues.apache.org/jira/browse/HUDI-7102 > Project: Apache Hudi > Issue Type: Task >Reporter: Lin Liu >Assignee: Lin Liu >Priority: Major > Fix For: 1.0.0 > > > Issue: > # Based on the provided TIMESTAMP_AS_OF, a list of file slices are returned. > However, these file slices that are returned are based on their base file > timestamp. That means, these slices may contain log files whose timestamps > are higher than the provided timestamp. > # Such that, when we try to merge the logs in the reverse order, we may see > these unqualified log files first, which triggers the "break" operation, and > no merging will be done. > > Solution: > # The first solution is to filter the log files as well as the base files > for the file slices. > # The second solution is to skip these unqualified log files, and keep > merging. > > Risk: > * Not sure if new bugs would be introduced by changing the current behavior. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-6702] Fix a bug for time travel queries on MOR tables [hudi]
linliu-code commented on PR #10102: URL: https://github.com/apache/hudi/pull/10102#issuecomment-1813516391 > for bug fixes, we should have the jira fild and call out the scenarios where bugs could happen. Can you please file one and add details on what exact issue we are runing into. @linliu-code Also, is it possible to add tests. This change fixed existing broken tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7102) A bug for the time travel queries for MOR tables
[ https://issues.apache.org/jira/browse/HUDI-7102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lin Liu updated HUDI-7102: -- Summary: A bug for the time travel queries for MOR tables (was: Fixed a bug for the time travel queries for MOR tables) > A bug for the time travel queries for MOR tables > > > Key: HUDI-7102 > URL: https://issues.apache.org/jira/browse/HUDI-7102 > Project: Apache Hudi > Issue Type: Task >Reporter: Lin Liu >Assignee: Lin Liu >Priority: Major > Fix For: 1.0.0 > > > The issue is: > # Based on the provided TIMESTAMP_AS_OF, a list of file slices are returned. > However, these file slices that are returned are based on their base file > timestamp. That means, these slices may contain log files whose timestamps > are higher than the provided timestamp. > # Such that, when we try to merge the logs in the reverse order, we may see > these unqualified log files first, which triggers the "break" operation, and > no merging will be done. > > Solution: > # The first solution is to filter the log files as well as the base files > for the file slices. But not sure if any other logic will be affected. > # The second solution is to skip these unqualified log files, and keep > merging. Not sure if any existing processing logic are based on this "break" > logic. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7102) Fixed a bug for the time travel queries for MOR tables
[ https://issues.apache.org/jira/browse/HUDI-7102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lin Liu reassigned HUDI-7102: - Assignee: Lin Liu > Fixed a bug for the time travel queries for MOR tables > -- > > Key: HUDI-7102 > URL: https://issues.apache.org/jira/browse/HUDI-7102 > Project: Apache Hudi > Issue Type: Task >Reporter: Lin Liu >Assignee: Lin Liu >Priority: Major > Fix For: 1.0.0 > > > The issue is: > # Based on the provided TIMESTAMP_AS_OF, a list of file slices are returned. > However, these file slices that are returned are based on their base file > timestamp. That means, these slices may contain log files whose timestamps > are higher than the provided timestamp. > # Such that, when we try to merge the logs in the reverse order, we may see > these unqualified log files first, which triggers the "break" operation, and > no merging will be done. > > Solution: > # The first solution is to filter the log files as well as the base files > for the file slices. But not sure if any other logic will be affected. > # The second solution is to skip these unqualified log files, and keep > merging. Not sure if any existing processing logic are based on this "break" > logic. -- This message was sent by Atlassian Jira (v8.20.10#820010)