Re: [PR] [HUDI-7819] Fix OptionsResolver#allowCommitOnEmptyBatch default value bug [hudi]
hudi-bot commented on PR #11370: URL: https://github.com/apache/hudi/pull/11370#issuecomment-2141282627 ## CI report: * dcf9a4a7947b75943814493f528b90b68ee2b9aa Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24160) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7819] Fix OptionsResolver#allowCommitOnEmptyBatch default value bug [hudi]
hudi-bot commented on PR #11370: URL: https://github.com/apache/hudi/pull/11370#issuecomment-2141237293 ## CI report: * dcf9a4a7947b75943814493f528b90b68ee2b9aa Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24160) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7819] Fix OptionsResolver#allowCommitOnEmptyBatch default value bug [hudi]
danny0405 commented on code in PR #11370: URL: https://github.com/apache/hudi/pull/11370#discussion_r1621717283 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/OptionsResolver.java: ## @@ -388,7 +388,7 @@ public static ConflictResolutionStrategy getConflictResolutionStrategy(Configura * Returns whether to commit even when current batch has no data, for flink defaults false */ public static boolean allowCommitOnEmptyBatch(Configuration conf) { -return conf.getBoolean(HoodieWriteConfig.ALLOW_EMPTY_COMMIT.key(), false); +return conf.getBoolean(HoodieWriteConfig.ALLOW_EMPTY_COMMIT.key(), HoodieWriteConfig.ALLOW_EMPTY_COMMIT.defaultValue()); Review Comment: you can tun ITTestHoodieDataSource in local env and make it pass first. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7819] Fix OptionsResolver#allowCommitOnEmptyBatch default value bug [hudi]
hudi-bot commented on PR #11370: URL: https://github.com/apache/hudi/pull/11370#issuecomment-2141231803 ## CI report: * dcf9a4a7947b75943814493f528b90b68ee2b9aa UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7817] Use Jackson Core instead of org.codehaus.jackson for JSON encoding [hudi]
yihua merged PR #11369: URL: https://github.com/apache/hudi/pull/11369 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated: [HUDI-7817] Use Jackson Core instead of org.codehaus.jackson for JSON encoding (#11369)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 0e55f0900d8 [HUDI-7817] Use Jackson Core instead of org.codehaus.jackson for JSON encoding (#11369) 0e55f0900d8 is described below commit 0e55f0900d80b64398d9e8d50b32e8e1680df9f0 Author: Y Ethan Guo AuthorDate: Thu May 30 21:39:23 2024 -0700 [HUDI-7817] Use Jackson Core instead of org.codehaus.jackson for JSON encoding (#11369) --- hudi-common/src/main/java/org/apache/hudi/avro/JsonEncoder.java | 8 style/checkstyle.xml| 3 ++- style/scalastyle.xml| 8 +--- 3 files changed, 7 insertions(+), 12 deletions(-) diff --git a/hudi-common/src/main/java/org/apache/hudi/avro/JsonEncoder.java b/hudi-common/src/main/java/org/apache/hudi/avro/JsonEncoder.java index 86d6a6ad9e2..01b44ead24f 100644 --- a/hudi-common/src/main/java/org/apache/hudi/avro/JsonEncoder.java +++ b/hudi-common/src/main/java/org/apache/hudi/avro/JsonEncoder.java @@ -19,6 +19,10 @@ package org.apache.hudi.avro; +import com.fasterxml.jackson.core.JsonEncoding; +import com.fasterxml.jackson.core.JsonFactory; +import com.fasterxml.jackson.core.JsonGenerator; +import com.fasterxml.jackson.core.util.MinimalPrettyPrinter; import org.apache.avro.AvroTypeException; import org.apache.avro.Schema; import org.apache.avro.io.Encoder; @@ -27,10 +31,6 @@ import org.apache.avro.io.parsing.JsonGrammarGenerator; import org.apache.avro.io.parsing.Parser; import org.apache.avro.io.parsing.Symbol; import org.apache.avro.util.Utf8; -import org.codehaus.jackson.JsonEncoding; -import org.codehaus.jackson.JsonFactory; -import org.codehaus.jackson.JsonGenerator; -import org.codehaus.jackson.util.MinimalPrettyPrinter; import java.io.IOException; import java.io.OutputStream; diff --git a/style/checkstyle.xml b/style/checkstyle.xml index 92883af6ff5..24fd704ba46 100644 --- a/style/checkstyle.xml +++ b/style/checkstyle.xml @@ -267,7 +267,8 @@ - + diff --git a/style/scalastyle.xml b/style/scalastyle.xml index 463ceebef30..dd4ddb3b801 100644 --- a/style/scalastyle.xml +++ b/style/scalastyle.xml @@ -57,7 +57,7 @@ - + @@ -130,10 +130,4 @@ scala\..* - - - - - -
Re: [PR] [HUDI-7810] Fix OptionsResolver#allowCommitOnEmptyBatch default value… [hudi]
usberkeley commented on PR #11359: URL: https://github.com/apache/hudi/pull/11359#issuecomment-2141196482 There are many conflicts between my local code and Remote. This is my mistake. To make the PR record beautiful, I opened a new PR: https://github.com/apache/hudi/pull/11370 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7810] Fix OptionsResolver#allowCommitOnEmptyBatch default value… [hudi]
usberkeley closed pull request #11359: [HUDI-7810] Fix OptionsResolver#allowCommitOnEmptyBatch default value… URL: https://github.com/apache/hudi/pull/11359 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7819) Fix OptionsResolver#allowCommitOnEmptyBatch default value bug
[ https://issues.apache.org/jira/browse/HUDI-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7819: - Labels: pull-request-available (was: ) > Fix OptionsResolver#allowCommitOnEmptyBatch default value bug > - > > Key: HUDI-7819 > URL: https://issues.apache.org/jira/browse/HUDI-7819 > Project: Apache Hudi > Issue Type: Bug >Reporter: bradley >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-7819] Fix OptionsResolver#allowCommitOnEmptyBatch default value bug [hudi]
usberkeley opened a new pull request, #11370: URL: https://github.com/apache/hudi/pull/11370 ### Change Logs OptionsResolver#allowCommitOnEmptyBatch has a hardcoded default value of "false", while ALLOW_EMPTY_COMMIT (hoodie.allow.empty.commit) defaults to "true", this function returns the wrong default value. In addition, TestHoodieFlinkQuickstart was modified to avoid being affected by empty commits (hoodie.allow.empty.commit=true). ### Impact none ### Risk level (write none, low medium or high below) none ### Documentation Update none ### Contributor's checklist - [1] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [1] Change Logs and Impact were stated clearly - [1] Adequate tests were added if applicable - [1] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7007] Add bloom_filters index support on read side [hudi]
KnightChess commented on code in PR #11043: URL: https://github.com/apache/hudi/pull/11043#discussion_r1621639791 ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestBloomFiltersIndexSupport.scala: ## @@ -0,0 +1,261 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.functional + +import org.apache.hudi.DataSourceWriteOptions._ +import org.apache.hudi.common.config.{HoodieMetadataConfig, TypedProperties} +import org.apache.hudi.common.model.{FileSlice, HoodieTableType} +import org.apache.hudi.common.table.{HoodieTableConfig, HoodieTableMetaClient} +import org.apache.hudi.common.testutils.RawTripTestPayload.recordsToStrings +import org.apache.hudi.config.HoodieWriteConfig +import org.apache.hudi.metadata.HoodieMetadataFileSystemView +import org.apache.hudi.testutils.HoodieSparkClientTestBase +import org.apache.hudi.util.{JFunction, JavaConversions} +import org.apache.hudi.{DataSourceReadOptions, DataSourceWriteOptions, HoodieFileIndex} +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, EqualTo, Expression, Literal} +import org.apache.spark.sql.functions.{col, not} +import org.apache.spark.sql.types.StringType +import org.apache.spark.sql.{DataFrame, Row, SaveMode, SparkSession} +import org.junit.jupiter.api.Assertions.{assertEquals, assertTrue} +import org.junit.jupiter.api.{AfterEach, BeforeEach, Test} +import org.junit.jupiter.params.ParameterizedTest +import org.junit.jupiter.params.provider.EnumSource + +import java.util.concurrent.atomic.AtomicInteger +import java.util.stream.Collectors +import scala.collection.JavaConverters._ +import scala.collection.{JavaConverters, mutable} + +class TestBloomFiltersIndexSupport extends HoodieSparkClientTestBase { + + val sqlTempTable = "hudi_tbl_bloom" + var spark: SparkSession = _ + var instantTime: AtomicInteger = _ + val metadataOpts: Map[String, String] = Map( +HoodieMetadataConfig.ENABLE.key -> "true", +HoodieMetadataConfig.ENABLE_METADATA_INDEX_BLOOM_FILTER.key -> "true", +HoodieMetadataConfig.BLOOM_FILTER_INDEX_FOR_COLUMNS.key -> "_row_key" + ) + val commonOpts: Map[String, String] = Map( +"hoodie.insert.shuffle.parallelism" -> "4", +"hoodie.upsert.shuffle.parallelism" -> "4", +HoodieWriteConfig.TBL_NAME.key -> "hoodie_test", +RECORDKEY_FIELD.key -> "_row_key", +PARTITIONPATH_FIELD.key -> "partition", +PRECOMBINE_FIELD.key -> "timestamp", +HoodieTableConfig.POPULATE_META_FIELDS.key -> "true" + ) ++ metadataOpts + var mergedDfList: List[DataFrame] = List.empty + + @BeforeEach + override def setUp(): Unit = { +initPath() +initSparkContexts() +initHoodieStorage() +initTestDataGenerator() + +setTableName("hoodie_test") +initMetaClient() + +instantTime = new AtomicInteger(1) + +spark = sqlContext.sparkSession + } + + @AfterEach + override def tearDown(): Unit = { +cleanupFileSystem() +cleanupSparkContexts() + } + + @ParameterizedTest + @EnumSource(classOf[HoodieTableType]) + def testIndexInitialization(tableType: HoodieTableType): Unit = { +val hudiOpts = commonOpts + (DataSourceWriteOptions.TABLE_TYPE.key -> tableType.name()) +doWriteAndValidateBloomFilters( + hudiOpts, + operation = DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL, + saveMode = SaveMode.Overwrite) + } + + /** + * Test case to do a write with updates and then validate file pruning using bloom filters. + */ + @Test + def testBloomFiltersIndexFilePruning(): Unit = { +var hudiOpts = commonOpts +hudiOpts = hudiOpts + ( + DataSourceReadOptions.ENABLE_DATA_SKIPPING.key -> "true") + +doWriteAndValidateBloomFilters( + hudiOpts, + operation = DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL, + saveMode = SaveMode.Overwrite, + shouldValidate = false) +doWriteAndValidateBloomFilters( + hudiOpts, + operation = DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL, + saveMode = SaveMode.Append) + +createTempTable(hudiOpts) +verifyQueryPredicate(hudiOpts) + } + + private def createTempTable(hudiOpts: Map[String, String]): Unit = { +val readDf =
Re: [PR] [HUDI-7810] Fix OptionsResolver#allowCommitOnEmptyBatch default value… [hudi]
hudi-bot commented on PR #11359: URL: https://github.com/apache/hudi/pull/11359#issuecomment-2141151906 ## CI report: * 4b149d9085498be66c6426b0c3fde90ddf382cec UNKNOWN * c8b14bd35eb233306750d8b31780d3da8ba2547d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24157) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7810] Fix OptionsResolver#allowCommitOnEmptyBatch default value… [hudi]
hudi-bot commented on PR #11359: URL: https://github.com/apache/hudi/pull/11359#issuecomment-2141146409 ## CI report: * 4b149d9085498be66c6426b0c3fde90ddf382cec UNKNOWN * 9ce101ca9d0c194af5b31b533c83fb21549ca8d3 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24146) * 0bc90bdc0865275eb0e3650a5bc82c3b3d65d11f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24156) * c8b14bd35eb233306750d8b31780d3da8ba2547d UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-7810) Fix OptionsResolver#allowCommitOnEmptyBatch default value bug
[ https://issues.apache.org/jira/browse/HUDI-7810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] bradley closed HUDI-7810. - Resolution: Later > Fix OptionsResolver#allowCommitOnEmptyBatch default value bug > - > > Key: HUDI-7810 > URL: https://issues.apache.org/jira/browse/HUDI-7810 > Project: Apache Hudi > Issue Type: Bug >Reporter: bradley >Priority: Major > Labels: pull-request-available > > Fixed in PR: [https://github.com/apache/hudi/pull/11359] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7819) Fix OptionsResolver#allowCommitOnEmptyBatch default value bug
bradley created HUDI-7819: - Summary: Fix OptionsResolver#allowCommitOnEmptyBatch default value bug Key: HUDI-7819 URL: https://issues.apache.org/jira/browse/HUDI-7819 Project: Apache Hudi Issue Type: Bug Reporter: bradley -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7810] Fix OptionsResolver#allowCommitOnEmptyBatch default value… [hudi]
hudi-bot commented on PR #11359: URL: https://github.com/apache/hudi/pull/11359#issuecomment-2141115466 ## CI report: * 4b149d9085498be66c6426b0c3fde90ddf382cec UNKNOWN * 9ce101ca9d0c194af5b31b533c83fb21549ca8d3 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24146) * 0bc90bdc0865275eb0e3650a5bc82c3b3d65d11f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24156) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7810] Fix OptionsResolver#allowCommitOnEmptyBatch default value… [hudi]
hudi-bot commented on PR #11359: URL: https://github.com/apache/hudi/pull/11359#issuecomment-2141109747 ## CI report: * 4b149d9085498be66c6426b0c3fde90ddf382cec UNKNOWN * 9ce101ca9d0c194af5b31b533c83fb21549ca8d3 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24146) * 0bc90bdc0865275eb0e3650a5bc82c3b3d65d11f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR][TESTING][DNM] Validating 0.15.0 RC2 bundles [hudi]
yihua closed pull request #11340: [MINOR][TESTING][DNM] Validating 0.15.0 RC2 bundles URL: https://github.com/apache/hudi/pull/11340 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR][Test][DNM] Test Azure CI on branch-0.x [hudi]
yihua closed pull request #10766: [MINOR][Test][DNM] Test Azure CI on branch-0.x URL: https://github.com/apache/hudi/pull/10766 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7818) Flink Table planner not loading problem
[ https://issues.apache.org/jira/browse/HUDI-7818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-7818: - Sprint: Sprint 2023-04-26 > Flink Table planner not loading problem > --- > > Key: HUDI-7818 > URL: https://issues.apache.org/jira/browse/HUDI-7818 > Project: Apache Hudi > Issue Type: Improvement > Components: writer-core >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7818) Flink Table planner not loading problem
Danny Chen created HUDI-7818: Summary: Flink Table planner not loading problem Key: HUDI-7818 URL: https://issues.apache.org/jira/browse/HUDI-7818 Project: Apache Hudi Issue Type: Improvement Components: writer-core Reporter: Danny Chen Assignee: Danny Chen Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7817] Use Jackson Core instead of org.codehaus.jackson for JSON encoding [hudi]
hudi-bot commented on PR #11369: URL: https://github.com/apache/hudi/pull/11369#issuecomment-2140985222 ## CI report: * 1718840e241dd32dc4c11885ba2bf1311bf822ec Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24155) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]How to improve the speed of Flink writing to hudi ? [hudi]
HuangZhenQiu commented on issue #8071: URL: https://github.com/apache/hudi/issues/8071#issuecomment-2140979488 @danny0405 Do we have some best practices of (COW and MOR ) for Flink ingestion to Hudi? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7817] Use Jackson Core instead of org.codehaus.jackson for JSON encoding [hudi]
hudi-bot commented on PR #11369: URL: https://github.com/apache/hudi/pull/11369#issuecomment-2140935788 ## CI report: * 1718840e241dd32dc4c11885ba2bf1311bf822ec Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24155) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7816]: Provide SourceProfileSupplier option into the SnapshotLoadQuerySplitter [hudi]
hudi-bot commented on PR #11368: URL: https://github.com/apache/hudi/pull/11368#issuecomment-2140935768 ## CI report: * 1dde761d4147e9c1a94914759ca0bfd0f7d23ec7 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24154) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7817] Use Jackson Core instead of org.codehaus.jackson for JSON encoding [hudi]
hudi-bot commented on PR #11369: URL: https://github.com/apache/hudi/pull/11369#issuecomment-2140928147 ## CI report: * 1718840e241dd32dc4c11885ba2bf1311bf822ec UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7814] Exclude unused transitive dependencies that introduce vulnerabilities [hudi]
hudi-bot commented on PR #11364: URL: https://github.com/apache/hudi/pull/11364#issuecomment-2140920017 ## CI report: * ff1e3d8a934fe1a2c92e341be610516476bf5d7a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24153) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7817) Use Jackson Core instead of org.codehaus.jackson for JSON encoding
[ https://issues.apache.org/jira/browse/HUDI-7817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7817: - Labels: pull-request-available (was: ) > Use Jackson Core instead of org.codehaus.jackson for JSON encoding > -- > > Key: HUDI-7817 > URL: https://issues.apache.org/jira/browse/HUDI-7817 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > org.codehaus.jackson is a older version of Jackson Core > (com.fasterxml.jackson.core:jackson-core). > org.codehaus.jackson:jackson-mapper-asl has critical vulnerabilities which > should be avoided. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7817) Use Jackson Core instead of org.codehaus.jackson for JSON encoding
[ https://issues.apache.org/jira/browse/HUDI-7817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7817: Description: org.codehaus.jackson is a older version of Jackson Core (com.fasterxml.jackson.core:jackson-core). org.codehaus.jackson:jackson-mapper-asl has critical vulnerabilities which should be avoided. (was: org.codehaus.jackson is a older version of Jackson Core (com.fasterxml.jackson.core:jackson-core). org.codehaus.jackson:jackson-mapper-asl has critical vulnerabilities which should be avoid.) > Use Jackson Core instead of org.codehaus.jackson for JSON encoding > -- > > Key: HUDI-7817 > URL: https://issues.apache.org/jira/browse/HUDI-7817 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Fix For: 1.0.0 > > > org.codehaus.jackson is a older version of Jackson Core > (com.fasterxml.jackson.core:jackson-core). > org.codehaus.jackson:jackson-mapper-asl has critical vulnerabilities which > should be avoided. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-7817] Use Jackson Core instead of org.codehaus.jackson for JSON encoding [hudi]
yihua opened a new pull request, #11369: URL: https://github.com/apache/hudi/pull/11369 ### Change Logs `org.codehaus.jackson` is a older version of Jackson Core (`com.fasterxml.jackson.core:jackson-core`). `org.codehaus.jackson:jackson-mapper-asl` has critical vulnerabilities which should be avoided. This PR changes `JsonEncoder` to use Jackson Core and adds rules to check illegal imports of `org.codehaus.jackson`. ### Impact Unifies usage of JSON encoding. ### Risk level none ### Documentation Update none ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7817) Use Jackson Core instead of org.codehaus.jackson for JSON encoding
[ https://issues.apache.org/jira/browse/HUDI-7817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7817: Description: org.codehaus.jackson is a older version of Jackson Core (com.fasterxml.jackson.core:jackson-core). org.codehaus.jackson:jackson-mapper-asl has critical vulnerabilities which should be avoid. (was: org.codehaus.jackson is a older version of ) > Use Jackson Core instead of org.codehaus.jackson for JSON encoding > -- > > Key: HUDI-7817 > URL: https://issues.apache.org/jira/browse/HUDI-7817 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Fix For: 1.0.0 > > > org.codehaus.jackson is a older version of Jackson Core > (com.fasterxml.jackson.core:jackson-core). > org.codehaus.jackson:jackson-mapper-asl has critical vulnerabilities which > should be avoid. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7817) Use Jackson Core instead of org.codehaus.jackson for JSON encoding
[ https://issues.apache.org/jira/browse/HUDI-7817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7817: Description: org.codehaus.jackson is a older version of > Use Jackson Core instead of org.codehaus.jackson for JSON encoding > -- > > Key: HUDI-7817 > URL: https://issues.apache.org/jira/browse/HUDI-7817 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Fix For: 1.0.0 > > > org.codehaus.jackson is a older version of -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7817) Use Jackson Core instead of org.codehaus.jackson for JSON encoding
[ https://issues.apache.org/jira/browse/HUDI-7817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo reassigned HUDI-7817: --- Assignee: Ethan Guo > Use Jackson Core instead of org.codehaus.jackson for JSON encoding > -- > > Key: HUDI-7817 > URL: https://issues.apache.org/jira/browse/HUDI-7817 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7817) Use Jackson Core instead of org.codehaus.jackson for JSON encoding
[ https://issues.apache.org/jira/browse/HUDI-7817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7817: Fix Version/s: 1.0.0 > Use Jackson Core instead of org.codehaus.jackson for JSON encoding > -- > > Key: HUDI-7817 > URL: https://issues.apache.org/jira/browse/HUDI-7817 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7817) Use Jackson Core instead of org.codehaus.jackson for JSON encoding
Ethan Guo created HUDI-7817: --- Summary: Use Jackson Core instead of org.codehaus.jackson for JSON encoding Key: HUDI-7817 URL: https://issues.apache.org/jira/browse/HUDI-7817 Project: Apache Hudi Issue Type: Improvement Reporter: Ethan Guo -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7816]: Provide SourceProfileSupplier option into the SnapshotLoadQuerySplitter [hudi]
hudi-bot commented on PR #11368: URL: https://github.com/apache/hudi/pull/11368#issuecomment-2140868406 ## CI report: * 1dde761d4147e9c1a94914759ca0bfd0f7d23ec7 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24154) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7814] Exclude unused transitive dependencies that introduce vulnerabilities [hudi]
hudi-bot commented on PR #11364: URL: https://github.com/apache/hudi/pull/11364#issuecomment-2140858167 ## CI report: * 3337f90b44d58d07c8a4055c9544f0e957d93226 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24138) * ff1e3d8a934fe1a2c92e341be610516476bf5d7a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24153) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7816]: Add SourceProfileSupplier option to SnapshotLoadQuerySplitter [hudi]
hudi-bot commented on PR #11368: URL: https://github.com/apache/hudi/pull/11368#issuecomment-2140858256 ## CI report: * 1dde761d4147e9c1a94914759ca0bfd0f7d23ec7 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7814] Exclude unused transitive dependencies that introduce vulnerabilities [hudi]
hudi-bot commented on PR #11364: URL: https://github.com/apache/hudi/pull/11364#issuecomment-2140848183 ## CI report: * 3337f90b44d58d07c8a4055c9544f0e957d93226 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24138) * ff1e3d8a934fe1a2c92e341be610516476bf5d7a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7816) Pass the source profile to the snapshot query splitter
[ https://issues.apache.org/jira/browse/HUDI-7816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7816: - Labels: pull-request-available (was: ) > Pass the source profile to the snapshot query splitter > -- > > Key: HUDI-7816 > URL: https://issues.apache.org/jira/browse/HUDI-7816 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Rajesh Mahindra >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-7816]: Add SourceProfileSupplier option to SnapshotLoadQuerySplitter [hudi]
mattwong949 opened a new pull request, #11368: URL: https://github.com/apache/hudi/pull/11368 ### Change Logs Expanding the interface of the SnapshotLoadQuerySplitter to accept SourceProfileSupplier option. ### Impact Some SnapshotLoadQuerySplitter implementations may want to consider a SourceProfileSupplier in their logic, allowing source estimations to be used when splitting queries. ### Risk level (write none, low medium or high below) Low, small change to the API but no logic change within hudi itself. ### Documentation Update Updated javadocs for the modified interface ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-7816) Pass the source profile to the snapshot query splitter
Rajesh Mahindra created HUDI-7816: - Summary: Pass the source profile to the snapshot query splitter Key: HUDI-7816 URL: https://issues.apache.org/jira/browse/HUDI-7816 Project: Apache Hudi Issue Type: Improvement Reporter: Rajesh Mahindra -- This message was sent by Atlassian Jira (v8.20.10#820010)
(hudi) branch master updated (c758508b62f -> db7480820e3)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from c758508b62f [HUDI-7769] Fix Hudi CDC read on Spark 3.3.4 and 3.4.3 (#11242) add db7480820e3 [MINOR] Fix GitHub CI concurrency (#11361) No new revisions were added by this update. Summary of changes: .github/workflows/bot.yml | 1 + .github/workflows/release_candidate_validation.yml | 4 2 files changed, 1 insertion(+), 4 deletions(-)
Re: [PR] [MINOR] Fix GitHub CI concurrency [hudi]
yihua merged PR #11361: URL: https://github.com/apache/hudi/pull/11361 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7146] Integrate secondary index on reader path [hudi]
hudi-bot commented on PR #11162: URL: https://github.com/apache/hudi/pull/11162#issuecomment-2140765294 ## CI report: * 3c52961bdbcb210e4c7140f5939143cfda7adb50 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24151) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-5863] Fix HoodieMetadataFileSystemView serving stale view at the timeline server [hudi]
Gatsby-Lee commented on PR #8079: URL: https://github.com/apache/hudi/pull/8079#issuecomment-2140708258 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch branch-0.x updated: [MINOR] Fix GitHub CI concurrency (#11362)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a commit to branch branch-0.x in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/branch-0.x by this push: new 70094deb391 [MINOR] Fix GitHub CI concurrency (#11362) 70094deb391 is described below commit 70094deb391f612c13babd3cdf49dd88ebb0eec0 Author: Y Ethan Guo AuthorDate: Thu May 30 11:26:39 2024 -0700 [MINOR] Fix GitHub CI concurrency (#11362) --- .github/workflows/bot.yml | 1 + .github/workflows/release_candidate_validation.yml | 4 2 files changed, 1 insertion(+), 4 deletions(-) diff --git a/.github/workflows/bot.yml b/.github/workflows/bot.yml index 951eecdcc57..72200c4822d 100644 --- a/.github/workflows/bot.yml +++ b/.github/workflows/bot.yml @@ -25,6 +25,7 @@ on: concurrency: group: ${{ github.ref }} + cancel-in-progress: ${{ !contains(github.ref, 'master') && !contains(github.ref, 'branch-0.x') && !contains(github.ref, 'release-') }} env: MVN_ARGS: -e -ntp -B -V -Dgpg.skip -Djacoco.skip -Pwarn-log -Dorg.slf4j.simpleLogger.log.org.apache.maven.plugins.shade=warn -Dorg.slf4j.simpleLogger.log.org.apache.maven.plugins.dependency=warn -Dmaven.wagon.httpconnectionManager.ttlSeconds=25 -Dmaven.wagon.http.retryHandler.count=5 diff --git a/.github/workflows/release_candidate_validation.yml b/.github/workflows/release_candidate_validation.yml index a952ba782e5..b9b668cc80b 100644 --- a/.github/workflows/release_candidate_validation.yml +++ b/.github/workflows/release_candidate_validation.yml @@ -8,10 +8,6 @@ on: branches: - 'release-*' -concurrency: - group: ${{ github.ref }} - cancel-in-progress: ${{ !contains(github.ref, 'master') }} - env: MVN_ARGS: -e -ntp -B -V -Dgpg.skip -Djacoco.skip -Pwarn-log -Dorg.slf4j.simpleLogger.log.org.apache.maven.plugins.shade=warn -Dorg.slf4j.simpleLogger.log.org.apache.maven.plugins.dependency=warn -Dmaven.wagon.httpconnectionManager.ttlSeconds=25 -Dmaven.wagon.http.retryHandler.count=5 SPARK_COMMON_MODULES: hudi-spark-datasource/hudi-spark,hudi-spark-datasource/hudi-spark-common
Re: [PR] [MINOR] Fix GitHub CI concurrency [hudi]
yihua merged PR #11362: URL: https://github.com/apache/hudi/pull/11362 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7779) Guarding archival to not archive unintended commits
[ https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7779: -- Description: Archiving commits from active timeline could lead to data consistency issues on some rarest of occasions. We should come up with proper guards to ensure we do not make such unintended archival. Major gap which we wanted to guard is: if someone disabled cleaner, archival should account for data consistency issues and ensure it bails out. We have a base guarding condition, where archival will stop at the earliest commit to retain based on latest clean commit metadata. But there are few other scenarios that needs to be accounted for. a. Keeping aside replace commits, lets dive into specifics for regular commits and delta commits. Say user configured clean commits to 4 and archival configs to 5 and 6. after t10, cleaner is supposed to clean up all file versions created at or before t6. Say cleaner did not run(for whatever reason for next 5 commits). Archival will certainly be guarded until earliest commit to retain based on latest clean commits. Corner case to consider: A savepoint was added to say t3 and later removed. and still the cleaner was never re-enabled. Even though archival would have been stopped at t3 (until savepoint is present),but once savepoint is removed, if archival is executed, it could archive commit t3. Which means, file versions tracked at t3 is still not yet cleaned by the cleaner. Reasoning: We are good here wrt data consistency. Up until cleaner runs next time, this older file versions might be exposed to the end-user. But time travel query is not intended for already cleaned up commits and hence this is not an issue. None of snapshot, time travel query or incremental query will run into issues as they are not supposed to poll for t3. At any later point, if cleaner is re-enabled, it will take care of cleaning up file versions tracked at t3 commit. Just that for interim period, some older file versions might still be exposed to readers. b. The more tricky part is when replace commits are involved. Since replace commit metadata in active timeline is what ensures the replaced file groups are ignored for reads, before archiving the same, cleaner is expected to clean them up fully. But are there chances that this could go wrong? Corner case to consider. Lets add onto above scenario, where t3 has a savepoint, and t4 is a replace commit which replaced file groups tracked in t3. Cleaner will skip cleaning up files tracked by t3(due to the presence of savepoint), but will clean up t4, t5 and t6. So, earliest commit to retain will be pointing to t6. And say savepoint for t3 is removed, but cleaner was disabled. In this state of the timeline, if archival is executed, (since t3.savepoint is removed), archival might archive t3 and t4.rc. This could lead to data duplicates as both replaced file groups and new file groups from t4.rc would be exposed as valid file groups. In other words, if we were to summarize the different scenarios: i. replaced file group is never cleaned up. - ECTR(Earliest commit to retain) is less than this.rc and we are good. ii. replaced file group is cleaned up. - ECTR is > this.rc and is good to archive. iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full clean up did not happen. After savepoint is removed, and when archival is executed, we should avoid archiving the rc of interest. This is the gap we don't account for as of now. We have 3 options to go about to solve this. Option A: Let Savepoint deletion flow take care of cleaning up the files its tracking. cons: Savepoint's responsibility is not removing any data files. So, from a single user responsibility rule, this may not be right. Also, this clean up might need to do what a clean planner might actually be doing. ie. build file system view, understand if its supposed to be cleaned up already, and then only clean up the files which are supposed to be cleaned up. For eg, if a file group has only one file slice, it should not be cleaned up and scenarios like this. Option B: Since archival is the one which might cause data consistency issues, why not archival do the clean up. We need to account for concurrent cleans, failure and retry scenarios etc. Also, we might need to build the file system view and then take a call whether something needs to be cleaned up before archiving something. Cons: Again, the single user responsibility rule might be broken. Would be neat if cleaner takes care of deleting data files and archival only takes care of deleting/archiving timeline files. Option C: Similar to how cleaner maintain EarliestCommitToRetain, let cleaner track another metadata named "EarliestCommitToArchive". Strictly speaking, earliest commit to
Re: [PR] [HUDI-7146] Integrate secondary index on reader path [hudi]
hudi-bot commented on PR #11162: URL: https://github.com/apache/hudi/pull/11162#issuecomment-2140514821 ## CI report: * a602c9c4234062e66877fc4bf2c50f94f43767bc Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24142) * 3c52961bdbcb210e4c7140f5939143cfda7adb50 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24151) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7146] Integrate secondary index on reader path [hudi]
hudi-bot commented on PR #11162: URL: https://github.com/apache/hudi/pull/11162#issuecomment-2140488986 ## CI report: * a602c9c4234062e66877fc4bf2c50f94f43767bc Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24142) * 3c52961bdbcb210e4c7140f5939143cfda7adb50 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]
hudi-bot commented on PR #10957: URL: https://github.com/apache/hudi/pull/10957#issuecomment-2140461564 ## CI report: * c98242b22fb2518c0cc93c037df558037030500f UNKNOWN * 475a1bc220eaee04fa78ba46a922b434b8306047 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24150) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Spark-Hudi: Unable to perform Hard delete using Pyspark on HUDI table from AWS Glue [hudi]
soumilshah1995 commented on issue #11349: URL: https://github.com/apache/hudi/issues/11349#issuecomment-2140440102 good to hear that your issue is resolved cheers ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Spark-Hudi: Unable to perform Hard delete using Pyspark on HUDI table from AWS Glue [hudi]
Ssv-21 commented on issue #11349: URL: https://github.com/apache/hudi/issues/11349#issuecomment-2140322503 Actually, I was using the native glue-based Hudi. But after going through your blogspot post, I tried using Hudi 0.14.0-Spark 3.3 bundle jar, and it worked. I believe something is wrong with the glue-based Hudi, and it is better to provide the jars than using the native version. and Thank you, Soumil, for your suggestions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]
hudi-bot commented on PR #10957: URL: https://github.com/apache/hudi/pull/10957#issuecomment-2140301377 ## CI report: * c98242b22fb2518c0cc93c037df558037030500f UNKNOWN * 63737caa30a0ba2ccc66b05bbeb3005d185eb4b7 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24148) * 475a1bc220eaee04fa78ba46a922b434b8306047 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24150) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]
hudi-bot commented on PR #10957: URL: https://github.com/apache/hudi/pull/10957#issuecomment-2140271817 ## CI report: * c98242b22fb2518c0cc93c037df558037030500f UNKNOWN * 540d122ed1f6c9ee56730ec85fde9f0355b5d67a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23935) * 63737caa30a0ba2ccc66b05bbeb3005d185eb4b7 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24148) * 475a1bc220eaee04fa78ba46a922b434b8306047 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated: [HUDI-7769] Fix Hudi CDC read on Spark 3.3.4 and 3.4.3 (#11242)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new c758508b62f [HUDI-7769] Fix Hudi CDC read on Spark 3.3.4 and 3.4.3 (#11242) c758508b62f is described below commit c758508b62f0617ac95e33a490dde62cc897ab3a Author: Y Ethan Guo AuthorDate: Thu May 30 09:29:00 2024 -0700 [HUDI-7769] Fix Hudi CDC read on Spark 3.3.4 and 3.4.3 (#11242) --- .../src/main/scala/org/apache/hudi/cdc/CDCRelation.scala | 8 1 file changed, 8 insertions(+) diff --git a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/cdc/CDCRelation.scala b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/cdc/CDCRelation.scala index 311383a9c32..f298efc8ed4 100644 --- a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/cdc/CDCRelation.scala +++ b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/cdc/CDCRelation.scala @@ -27,6 +27,7 @@ import org.apache.hudi.common.table.{HoodieTableMetaClient, TableSchemaResolver} import org.apache.hudi.exception.HoodieException import org.apache.hudi.internal.schema.InternalSchema import org.apache.hudi.{AvroConversionUtils, DataSourceReadOptions, HoodieDataSourceHelper, HoodieTableSchema} + import org.apache.spark.internal.Logging import org.apache.spark.rdd.RDD import org.apache.spark.sql.catalyst.InternalRow @@ -50,6 +51,8 @@ class CDCRelation( options: Map[String, String] ) extends BaseRelation with PrunedFilteredScan with Logging { + imbueConfigs(sqlContext) + val spark: SparkSession = sqlContext.sparkSession val (tableAvroSchema, _) = { @@ -118,6 +121,11 @@ class CDCRelation( ) cdcRdd.asInstanceOf[RDD[InternalRow]] } + + def imbueConfigs(sqlContext: SQLContext): Unit = { +// Disable vectorized reading for CDC relation + sqlContext.sparkSession.sessionState.conf.setConfString("spark.sql.parquet.enableVectorizedReader", "false") + } } object CDCRelation {
Re: [PR] [HUDI-7769] Fix Hudi CDC read with legacy parquet file format on Spark [hudi]
yihua merged PR #11242: URL: https://github.com/apache/hudi/pull/11242 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]
hudi-bot commented on PR #10957: URL: https://github.com/apache/hudi/pull/10957#issuecomment-2140088833 ## CI report: * c98242b22fb2518c0cc93c037df558037030500f UNKNOWN * 540d122ed1f6c9ee56730ec85fde9f0355b5d67a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23935) * 63737caa30a0ba2ccc66b05bbeb3005d185eb4b7 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24148) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]
hudi-bot commented on PR #10957: URL: https://github.com/apache/hudi/pull/10957#issuecomment-2140060183 ## CI report: * c98242b22fb2518c0cc93c037df558037030500f UNKNOWN * 540d122ed1f6c9ee56730ec85fde9f0355b5d67a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23935) * 63737caa30a0ba2ccc66b05bbeb3005d185eb4b7 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7810] Fix OptionsResolver#allowCommitOnEmptyBatch default value… [hudi]
hudi-bot commented on PR #11359: URL: https://github.com/apache/hudi/pull/11359#issuecomment-2140031009 ## CI report: * 4b149d9085498be66c6426b0c3fde90ddf382cec UNKNOWN * 9ce101ca9d0c194af5b31b533c83fb21549ca8d3 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24146) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-7407) Add optional clean support to standalone compaction and clustering jobs
[ https://issues.apache.org/jira/browse/HUDI-7407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit closed HUDI-7407. - Resolution: Fixed > Add optional clean support to standalone compaction and clustering jobs > --- > > Key: HUDI-7407 > URL: https://issues.apache.org/jira/browse/HUDI-7407 > Project: Apache Hudi > Issue Type: Improvement > Components: table-service >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > > Lets add top level config to standalone compaction and clustering job to > optionally clean. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7810] Fix OptionsResolver#allowCommitOnEmptyBatch default value… [hudi]
hudi-bot commented on PR #11359: URL: https://github.com/apache/hudi/pull/11359#issuecomment-2139791741 ## CI report: * c8bf966468abfcab8121f7ba7a63f8098bbf965a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24122) * 4b149d9085498be66c6426b0c3fde90ddf382cec UNKNOWN * 9ce101ca9d0c194af5b31b533c83fb21549ca8d3 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24146) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7407] Making clean optional in standalone compaction and clustering jobs [hudi]
codope merged PR #10668: URL: https://github.com/apache/hudi/pull/10668 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated: [HUDI-7407] Making clean optional in standalone compaction and clustering jobs (#10668)
This is an automated email from the ASF dual-hosted git repository. codope pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new f0c1a88f8d0 [HUDI-7407] Making clean optional in standalone compaction and clustering jobs (#10668) f0c1a88f8d0 is described below commit f0c1a88f8d0de9f06d2838c32cdc276444f8afa3 Author: Sivabalan Narayanan AuthorDate: Thu May 30 07:50:08 2024 -0700 [HUDI-7407] Making clean optional in standalone compaction and clustering jobs (#10668) * Making clean optional in standalone compaction and clustering standlaone jobs --- .../apache/hudi/utilities/HoodieClusteringJob.java | 5 +++- .../org/apache/hudi/utilities/HoodieCompactor.java | 8 +++-- .../hudi/utilities/multitable/CleanTask.java | 1 + .../hudi/utilities/multitable/ClusteringTask.java | 1 + .../hudi/utilities/multitable/CompactionTask.java | 1 + .../offlinejob/TestHoodieClusteringJob.java| 34 +- .../offlinejob/TestHoodieCompactorJob.java | 28 ++ 7 files changed, 49 insertions(+), 29 deletions(-) diff --git a/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieClusteringJob.java b/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieClusteringJob.java index 8e017152407..0a0b1f3b886 100644 --- a/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieClusteringJob.java +++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieClusteringJob.java @@ -92,6 +92,8 @@ public class HoodieClusteringJob { public String sparkMemory = null; @Parameter(names = {"--retry", "-rt"}, description = "number of retries") public int retry = 0; +@Parameter(names = {"--skip-clean", "-sc"}, description = "do not trigger clean after compaction", required = false) +public Boolean skipClean = true; @Parameter(names = {"--schedule", "-sc"}, description = "Schedule clustering @desperate soon please use \"--mode schedule\" instead") public Boolean runSchedule = false; @@ -131,6 +133,7 @@ public class HoodieClusteringJob { + " --spark-master " + sparkMaster + ", \n" + " --spark-memory " + sparkMemory + ", \n" + " --retry " + retry + ", \n" + + " --skipClean " + skipClean + ", \n" + " --schedule " + runSchedule + ", \n" + " --retry-last-failed-clustering-job " + retryLastFailedClusteringJob + ", \n" + " --mode " + runningMode + ", \n" @@ -297,7 +300,7 @@ public class HoodieClusteringJob { } private void clean(SparkRDDWriteClient client) { -if (client.getConfig().isAutoClean()) { +if (!cfg.skipClean && client.getConfig().isAutoClean()) { client.clean(); } } diff --git a/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieCompactor.java b/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieCompactor.java index 42633ee5558..e8e94126118 100644 --- a/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieCompactor.java +++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieCompactor.java @@ -94,6 +94,8 @@ public class HoodieCompactor { public String sparkMemory = null; @Parameter(names = {"--retry", "-rt"}, description = "number of retries", required = false) public int retry = 0; +@Parameter(names = {"--skip-clean", "-sc"}, description = "do not trigger clean after compaction", required = false) +public Boolean skipClean = true; @Parameter(names = {"--schedule", "-sc"}, description = "Schedule compaction", required = false) public Boolean runSchedule = false; @Parameter(names = {"--mode", "-m"}, description = "Set job mode: Set \"schedule\" means make a compact plan; " @@ -124,6 +126,7 @@ public class HoodieCompactor { + " --schema-file " + schemaFile + ", \n" + " --spark-master " + sparkMaster + ", \n" + " --spark-memory " + sparkMemory + ", \n" + + " --skipClean " + skipClean + ", \n" + " --retry " + retry + ", \n" + " --schedule " + runSchedule + ", \n" + " --mode " + runningMode + ", \n" @@ -150,6 +153,7 @@ public class HoodieCompactor { && Objects.equals(sparkMaster, config.sparkMaster) && Objects.equals(sparkMemory, config.sparkMemory) && Objects.equals(retry, config.retry) + && Objects.equals(skipClean, config.skipClean) && Objects.equals(runSchedule, config.runSchedule) && Objects.equals(runningMode, config.runningMode) && Objects.equals(strategyClassName, config.strategyClassName) @@ -160,7 +164,7 @@ public class HoodieCompactor { @Override public int hashCode() { return Objects.hash(basePath, tableName, compactionInstantTime, schemaFile, - sparkMaster, parallelism, sparkMemory, retry,
Re: [PR] [HUDI-7810] Fix OptionsResolver#allowCommitOnEmptyBatch default value… [hudi]
hudi-bot commented on PR #11359: URL: https://github.com/apache/hudi/pull/11359#issuecomment-2139638319 ## CI report: * c8bf966468abfcab8121f7ba7a63f8098bbf965a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24122) * 4b149d9085498be66c6426b0c3fde90ddf382cec UNKNOWN * 9ce101ca9d0c194af5b31b533c83fb21549ca8d3 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7810] Fix OptionsResolver#allowCommitOnEmptyBatch default value… [hudi]
usberkeley commented on code in PR #11359: URL: https://github.com/apache/hudi/pull/11359#discussion_r1620794903 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/OptionsResolver.java: ## @@ -370,7 +370,7 @@ public static ConflictResolutionStrategy getConflictResolutionStrategy(Configura * Returns whether to commit even when current batch has no data, for flink defaults false */ public static boolean allowCommitOnEmptyBatch(Configuration conf) { -return conf.getBoolean(HoodieWriteConfig.ALLOW_EMPTY_COMMIT.key(), false); +return conf.getBoolean(HoodieWriteConfig.ALLOW_EMPTY_COMMIT.key(), HoodieWriteConfig.ALLOW_EMPTY_COMMIT.defaultValue()); Review Comment: After correcting the default return value of OptionsResolver#allowCommitOnEmptyBatch to "true", StreamWriteOperatorCoordinator will submit an empty Commit Or DeltaCommit (when Checkpoint is completed), so when the program queries the latest commit, the commit is empty, so the returned result is also empty, and the unit test fails in the end Modification plan: When creating a Hudi table, set hoodie.allow.empty.commit = false Other solutions: We can modify the default value of the "hoodie.allow.empty.commit", but I personally think it is not good enough, the reason: Modify "hoodie.allow.empty.commit" default value to "false", but the default value of the official document and code is "true", and it is very important to submit an empty commit by default in Flink, which can track the entire life cycle. Therefore, do not adopt this solution. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7407] Making clean optional in standalone compaction and clustering jobs [hudi]
hudi-bot commented on PR #10668: URL: https://github.com/apache/hudi/pull/10668#issuecomment-2139636243 ## CI report: * 5a6c7723f716d5719a8011150f73077ab1ba3a1f Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24145) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7810] Fix OptionsResolver#allowCommitOnEmptyBatch default value… [hudi]
usberkeley commented on code in PR #11359: URL: https://github.com/apache/hudi/pull/11359#discussion_r1620734871 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/OptionsResolver.java: ## @@ -370,7 +370,7 @@ public static ConflictResolutionStrategy getConflictResolutionStrategy(Configura * Returns whether to commit even when current batch has no data, for flink defaults false */ public static boolean allowCommitOnEmptyBatch(Configuration conf) { -return conf.getBoolean(HoodieWriteConfig.ALLOW_EMPTY_COMMIT.key(), false); +return conf.getBoolean(HoodieWriteConfig.ALLOW_EMPTY_COMMIT.key(), HoodieWriteConfig.ALLOW_EMPTY_COMMIT.defaultValue()); Review Comment: The original TestHoodieFlinkQuickstart can run successfully because: After correcting the default return value of OptionsResolver#allowCommitOnEmptyBatch to "true", StreamWriteOperatorCoordinator will submit an empty Commit Or DeltaCommit (when Checkpoint is completed), so when the program queries the latest commit, the commit is empty, and the returned result is also empty, and the unit test fails. Modification plan: When creating a Hudi table, set hoodie.allow.empty.commit = false Other solutions: You can modify the default value, but I personally think it is not good enough. The reason is: Modify hoodie.allow.empty.commit to false, but the default value of the official document and code is true, and it is very important to submit an empty commit by default in Flink, which can track the entire life cycle. Therefore, do not adopt this solution. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7810] Fix OptionsResolver#allowCommitOnEmptyBatch default value… [hudi]
hudi-bot commented on PR #11359: URL: https://github.com/apache/hudi/pull/11359#issuecomment-2139622234 ## CI report: * c8bf966468abfcab8121f7ba7a63f8098bbf965a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24122) * 4b149d9085498be66c6426b0c3fde90ddf382cec UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) annotated tag release-0.15.0-rc3 updated (d0df1d4a94d -> 987b4dd1741)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a change to annotated tag release-0.15.0-rc3 in repository https://gitbox.apache.org/repos/asf/hudi.git *** WARNING: tag release-0.15.0-rc3 was modified! *** from d0df1d4a94d (commit) to 987b4dd1741 (tag) tagging d0df1d4a94d13cfc061faaf1a9573c886811c104 (commit) replaces release-0.15.0-rc2 by Y Ethan Guo on Thu May 30 06:57:38 2024 -0700 - Log - release-0.15.0 -BEGIN PGP SIGNATURE- iQIzBAABCAAdFiEEDE0xZCfsqnGiCtlma+HUVMkPXqUFAmZYhdIACgkQa+HUVMkP XqWSBw/8CQjJg5FX+NzO9xqWwNXvki6vAVwF2IDHdcDRh3L10w3WGF8K4J/+aqGP 0UZgC9FlmX/pNQoAHJ5HmUl3lElDj1/K/1ek6wp7HGGuRVvPLOoCWmYnYXqH0apa QNcdTxUG0sgDb0NL7us4eVCVpwgW45+e4NmCJONEMCFtn2MmnWROG1Anj/AaF6WY WXAfh8N8zMnPTE2hopBRRLGCf9wrh8s8GqsL+qx6Jmp9rSYo+9xW/Xc0BflEF2Sl Regg9wWzSN1ukqFf8dI7PA1qnBGhITMCIfurrcKQG5O0jG04lawCPT/VnX06UjRJ 3Q9zX7WpkgrCGi8t2uIOoVOrRvJPospytSNTRpPGAPjqz04d2QYsuzyhwXjfhXqK 5XuKU3Ps8wqhvGsU6rYbZnI41MVOSrwJHjNq1kq+YR7jD/tZIzYr2luMViIkkm+X MQwvl8q/qLo3YxNGjUQrUZtgMlibWaDdsKCFBjDGPBPhS2gNMnnJ9a40U12pdVdr R9y6aWPCuxCImu7PhTwA8GJIPZe4oXQU++0Tdm3ucwvWDuD2wReYfRHs2GezfLHN keo220YqDjxWInJ6TLeHAM98ApBzgmB2lQNzTETR8zDnLCpOE90OfF0OpJQerU7Q R+OMsT/ncoSw3ZZfEq91qxIV4d4fRMjffcSyLKs/bff1ESNCPJc= =0Jvi -END PGP SIGNATURE- --- No new revisions were added by this update. Summary of changes:
svn commit: r69471 - in /dev/hudi/hudi-0.15.0-rc3: ./ hudi-0.15.0-rc3.src.tgz hudi-0.15.0-rc3.src.tgz.asc hudi-0.15.0-rc3.src.tgz.sha512
Author: yihua Date: Thu May 30 13:52:32 2024 New Revision: 69471 Log: Add Apache Hudi 0.15.0 RC3 source release Added: dev/hudi/hudi-0.15.0-rc3/ dev/hudi/hudi-0.15.0-rc3/hudi-0.15.0-rc3.src.tgz (with props) dev/hudi/hudi-0.15.0-rc3/hudi-0.15.0-rc3.src.tgz.asc dev/hudi/hudi-0.15.0-rc3/hudi-0.15.0-rc3.src.tgz.sha512 Added: dev/hudi/hudi-0.15.0-rc3/hudi-0.15.0-rc3.src.tgz == Binary file - no diff available. Propchange: dev/hudi/hudi-0.15.0-rc3/hudi-0.15.0-rc3.src.tgz -- svn:mime-type = application/octet-stream Added: dev/hudi/hudi-0.15.0-rc3/hudi-0.15.0-rc3.src.tgz.asc == --- dev/hudi/hudi-0.15.0-rc3/hudi-0.15.0-rc3.src.tgz.asc (added) +++ dev/hudi/hudi-0.15.0-rc3/hudi-0.15.0-rc3.src.tgz.asc Thu May 30 13:52:32 2024 @@ -0,0 +1,16 @@ +-BEGIN PGP SIGNATURE- + +iQIzBAABCAAdFiEEiIqTQeYA64VQqs1e+xt1BPf3cMkFAmZYJ2cACgkQ+xt1BPf3 +cMmPyBAAmdcXwkeX3aTkvu/GwrgX7KOcfSSRzoNgPnQaLqv7Y8H+8M97O2auIoCY +10ymaSaqEr9gVZ56fwx0q3YAQJKRdrc+jZ4m70OFVM602Gs0itDLT1SEn4c7LOK/ +YAsNFCBD+vfZRH3vUERDTMlmmHgOF48cnNw/SOdKTJOT/LYr+G9CrzWIQhTg44C2 +JElQNZj+3Sv1J0foVm2Fmsva7DB6JOYF3bpy0VZvqJRZgWBMc9Nuj3lcRlK0qGvl +OZ5sr2T6czMt4CELj6wtSMOEL1knlc+luJLbrwueO2srRu3Kl/fhU/KYPmiaWqA9 +e+SUIJ5lJMUU/Dn+rnV4m9SDIOGcHnf+rJFC0C+0ALfIo76GAqYucF6ALfsEFgMn +vAoOzJ0SZ2fIlqavG3U/0YaO92457Tqmsnr6ahsf9LoavUDLleRng0+OiKuzLa/H +Ick1qQVDLSZrf3gfqIWmVldaovWBOo1A1jaFCGjz41b5CUsPbT8VFCGpMyuI49Ns +LUnglCcXfLXzcuLxy3awhJp3YGYC8m3ombg/HtFGBIq/4XH8e8Q1FCdzgr2GSgdz +2F7JvK9ruxq3JTqacPIJKTW4TwuQpQtWvARgWDP5cHGmTGprzF5DG2s2FvHr+PrV ++Fp3G+6B/RG08TO8oj41OdLJ5D2ImXF5VHzOyok4Ijo8SEWb9qE= +=Nl7f +-END PGP SIGNATURE- Added: dev/hudi/hudi-0.15.0-rc3/hudi-0.15.0-rc3.src.tgz.sha512 == --- dev/hudi/hudi-0.15.0-rc3/hudi-0.15.0-rc3.src.tgz.sha512 (added) +++ dev/hudi/hudi-0.15.0-rc3/hudi-0.15.0-rc3.src.tgz.sha512 Thu May 30 13:52:32 2024 @@ -0,0 +1 @@ +3bf244b3a396e66849b9eb6e96e5e2debd5aba282a469249741cb827a0cceb7d92235f6ffe276e80baacec8fe797ec0180fdfd1fa784804bdb5920c9f0d7e892 hudi-0.15.0-rc3.src.tgz
Re: [PR] [HUDI-7810] Fix OptionsResolver#allowCommitOnEmptyBatch default value… [hudi]
usberkeley commented on code in PR #11359: URL: https://github.com/apache/hudi/pull/11359#discussion_r1620734871 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/OptionsResolver.java: ## @@ -370,7 +370,7 @@ public static ConflictResolutionStrategy getConflictResolutionStrategy(Configura * Returns whether to commit even when current batch has no data, for flink defaults false */ public static boolean allowCommitOnEmptyBatch(Configuration conf) { -return conf.getBoolean(HoodieWriteConfig.ALLOW_EMPTY_COMMIT.key(), false); +return conf.getBoolean(HoodieWriteConfig.ALLOW_EMPTY_COMMIT.key(), HoodieWriteConfig.ALLOW_EMPTY_COMMIT.defaultValue()); Review Comment: The original TestHoodieFlinkQuickstart can run successfully because: After correcting the default return value of OptionsResolver#allowCommitOnEmptyBatch to "true", StreamWriteOperatorCoordinator will submit an empty Commit Or DeltaCommit (when Checkpoint is completed), so when the program queries the latest commit, the commit is empty, and the returned result is also empty, and the unit test fails. Modification plan: When creating a Hudi table, set hoodie.allow.empty.commit = false Other solutions: You can modify the default value, but I personally think it is not good enough. The reason is: Modify hoodie.allow.empty.commit to false, but the default value of the official document and code is true, and it is very important to submit an empty commit by default in Flink, which can track the entire life cycle. Therefore, do not adopt this solution. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7407] Making clean optional in standalone compaction and clustering jobs [hudi]
hudi-bot commented on PR #10668: URL: https://github.com/apache/hudi/pull/10668#issuecomment-2139516375 ## CI report: * b24eafcc00d5cf4a27ae7f9d7e70b1bfc5a12b1a Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24111) * 5a6c7723f716d5719a8011150f73077ab1ba3a1f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24145) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7407] Making clean optional in standalone compaction and clustering jobs [hudi]
hudi-bot commented on PR #10668: URL: https://github.com/apache/hudi/pull/10668#issuecomment-2139501982 ## CI report: * b24eafcc00d5cf4a27ae7f9d7e70b1bfc5a12b1a Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24111) * 5a6c7723f716d5719a8011150f73077ab1ba3a1f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7407] Making clean optional in standalone compaction and clustering jobs [hudi]
codope commented on code in PR #10668: URL: https://github.com/apache/hudi/pull/10668#discussion_r1620652100 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieClusteringJob.java: ## @@ -92,6 +92,8 @@ public static class Config implements Serializable { public String sparkMemory = null; @Parameter(names = {"--retry", "-rt"}, description = "number of retries") public int retry = 0; +@Parameter(names = {"--skip-clean", "-sc"}, description = "do not trigger clean after compaction", required = false) +public Boolean skipClean = true; Review Comment: not changing.. should be fine as it's an offline job. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7007] Add bloom_filters index support on read side [hudi]
KnightChess commented on code in PR #11043: URL: https://github.com/apache/hudi/pull/11043#discussion_r1620531072 ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestBloomFiltersIndexSupport.scala: ## @@ -0,0 +1,261 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.functional + +import org.apache.hudi.DataSourceWriteOptions._ +import org.apache.hudi.common.config.{HoodieMetadataConfig, TypedProperties} +import org.apache.hudi.common.model.{FileSlice, HoodieTableType} +import org.apache.hudi.common.table.{HoodieTableConfig, HoodieTableMetaClient} +import org.apache.hudi.common.testutils.RawTripTestPayload.recordsToStrings +import org.apache.hudi.config.HoodieWriteConfig +import org.apache.hudi.metadata.HoodieMetadataFileSystemView +import org.apache.hudi.testutils.HoodieSparkClientTestBase +import org.apache.hudi.util.{JFunction, JavaConversions} +import org.apache.hudi.{DataSourceReadOptions, DataSourceWriteOptions, HoodieFileIndex} +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, EqualTo, Expression, Literal} +import org.apache.spark.sql.functions.{col, not} +import org.apache.spark.sql.types.StringType +import org.apache.spark.sql.{DataFrame, Row, SaveMode, SparkSession} +import org.junit.jupiter.api.Assertions.{assertEquals, assertTrue} +import org.junit.jupiter.api.{AfterEach, BeforeEach, Test} +import org.junit.jupiter.params.ParameterizedTest +import org.junit.jupiter.params.provider.EnumSource + +import java.util.concurrent.atomic.AtomicInteger +import java.util.stream.Collectors +import scala.collection.JavaConverters._ +import scala.collection.{JavaConverters, mutable} + +class TestBloomFiltersIndexSupport extends HoodieSparkClientTestBase { + + val sqlTempTable = "hudi_tbl_bloom" + var spark: SparkSession = _ + var instantTime: AtomicInteger = _ + val metadataOpts: Map[String, String] = Map( +HoodieMetadataConfig.ENABLE.key -> "true", +HoodieMetadataConfig.ENABLE_METADATA_INDEX_BLOOM_FILTER.key -> "true", +HoodieMetadataConfig.BLOOM_FILTER_INDEX_FOR_COLUMNS.key -> "_row_key" + ) + val commonOpts: Map[String, String] = Map( +"hoodie.insert.shuffle.parallelism" -> "4", +"hoodie.upsert.shuffle.parallelism" -> "4", +HoodieWriteConfig.TBL_NAME.key -> "hoodie_test", +RECORDKEY_FIELD.key -> "_row_key", +PARTITIONPATH_FIELD.key -> "partition", +PRECOMBINE_FIELD.key -> "timestamp", +HoodieTableConfig.POPULATE_META_FIELDS.key -> "true" + ) ++ metadataOpts + var mergedDfList: List[DataFrame] = List.empty + + @BeforeEach + override def setUp(): Unit = { +initPath() +initSparkContexts() +initHoodieStorage() +initTestDataGenerator() + +setTableName("hoodie_test") +initMetaClient() + +instantTime = new AtomicInteger(1) + +spark = sqlContext.sparkSession + } + + @AfterEach + override def tearDown(): Unit = { +cleanupFileSystem() +cleanupSparkContexts() + } + + @ParameterizedTest + @EnumSource(classOf[HoodieTableType]) + def testIndexInitialization(tableType: HoodieTableType): Unit = { +val hudiOpts = commonOpts + (DataSourceWriteOptions.TABLE_TYPE.key -> tableType.name()) +doWriteAndValidateBloomFilters( + hudiOpts, + operation = DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL, + saveMode = SaveMode.Overwrite) + } + + /** + * Test case to do a write with updates and then validate file pruning using bloom filters. + */ + @Test + def testBloomFiltersIndexFilePruning(): Unit = { +var hudiOpts = commonOpts +hudiOpts = hudiOpts + ( + DataSourceReadOptions.ENABLE_DATA_SKIPPING.key -> "true") + +doWriteAndValidateBloomFilters( + hudiOpts, + operation = DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL, + saveMode = SaveMode.Overwrite, + shouldValidate = false) +doWriteAndValidateBloomFilters( + hudiOpts, + operation = DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL, + saveMode = SaveMode.Append) + +createTempTable(hudiOpts) +verifyQueryPredicate(hudiOpts) + } + + private def createTempTable(hudiOpts: Map[String, String]): Unit = { +val readDf =
[I] [SUPPORT] using spark's observe feature on dataframes saved by hudi is stuck [hudi]
szingerpeter opened a new issue, #11367: URL: https://github.com/apache/hudi/issues/11367 **Describe the problem you faced** When trying to use the [observe](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.observe.html) function on dataframes saved by hudi the application gets stuck after saving the data and trying to retrieve the statistics. **To Reproduce** Steps to reproduce the behavior: ``` from pyspark.sql import DataFrame, Observation from pyspark.sql import functions as F observation = Observation() df = spark.createDataFrame([[1, 1], [2, 2], [3, 3], [4, 4]]) df = df.observe(observation, F.count(F.lit(1)).alias('row_count')) df.write.format('csv').mode('overwrite').save('file:/opt/spark/work-dir/test_csv') observation.get # returns: {'row_count': 4} observation2 = Observation() df2 = spark.createDataFrame([[1, 1], [2, 2], [3, 3], [4, 4]]) hudi_options = { 'hoodie.table.name': 'test', 'hoodie.datasource.write.recordkey.field': '_1', 'hoodie.datasource.write.partitionpath.field': '', 'hoodie.datasource.write.table.name': 'test', 'hoodie.datasource.write.operation': 'insert_overwrite', 'hoodie.datasource.write.precombine.field': '_2', } df2 = df2.observe(observation2, F.count(F.lit(1)).alias('row_count')) df.write.format("hudi").\ options(**hudi_options).\ mode("overwrite").\ save('file:/opt/spark/work-dir/test') observation2.get # gets stuck ``` Disclaimer: I know there are hudi metrics and callbacks; however, i would like to add some more advanced quality checks to our applications **Environment Description** * Hudi version : 0.13.0-amzn-0 * Spark version : 3.3.2 * Hive version : 3.1.3 * Hadoop version : 3.3.3 * Storage (HDFS/S3/GCS..) : HDFS/S3 both * Running on Docker? (yes/no) : both -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7146] Implement secondary index write path [hudi]
codope merged PR #11146: URL: https://github.com/apache/hudi/pull/11146 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated: [HUDI-7146] Implement secondary index write path (#11146)
This is an automated email from the ASF dual-hosted git repository. codope pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new cd62c31f368 [HUDI-7146] Implement secondary index write path (#11146) cd62c31f368 is described below commit cd62c31f368d6939c246bd58b77887104c4ca776 Author: Sagar Sumit AuthorDate: Thu May 30 15:56:51 2024 +0530 [HUDI-7146] Implement secondary index write path (#11146) Main changes in this PR are for secondary index write path: New index type added in MetadataPartitionType Initialization of the new index in HoodieBackedTableMetadataWriter Util methods to support index creation and update in HoodieTableMetadataUtil Changes to HoodieBackedTableMetadataWriter to handle update and deletes for secondary index. New APIs in HoodieTableMetadata and their implementation in BaseTableMetadata and HoodieBackedTableMetadata to load secondary index. Changes in HoodieMergedLogRecordScanner to merge secondary index payloads. --- .../org/apache/hudi/config/HoodieWriteConfig.java | 18 +- .../org/apache/hudi/index/HoodieIndexUtils.java| 4 +- .../metadata/HoodieBackedTableMetadataWriter.java | 215 - .../action/index/ScheduleIndexActionExecutor.java | 2 +- .../BaseHoodieFunctionalIndexClient.java | 2 +- .../apache/hudi/index/TestHoodieIndexUtils.java| 14 +- .../FlinkHoodieBackedTableMetadataWriter.java | 15 +- .../JavaHoodieBackedTableMetadataWriter.java | 15 +- .../SparkHoodieBackedTableMetadataWriter.java | 27 ++- hudi-common/src/main/avro/HoodieMetadata.avsc | 28 +++ ...lIndexConfig.java => HoodieIndexingConfig.java} | 29 +-- .../hudi/common/config/HoodieMetadataConfig.java | 32 +++ ...xDefinition.java => HoodieIndexDefinition.java} | 30 +-- ...IndexMetadata.java => HoodieIndexMetadata.java} | 31 +-- .../hudi/common/table/HoodieTableMetaClient.java | 54 +++--- .../common/table/log/HoodieFileSliceReader.java| 4 +- .../hudi/common/table/log/LogFileIterator.java | 0 .../hudi/keygen/constant/KeyGeneratorOptions.java | 7 + .../apache/hudi/metadata/BaseTableMetadata.java| 13 ++ .../hudi/metadata/HoodieBackedTableMetadata.java | 128 +--- .../hudi/metadata/HoodieMetadataPayload.java | 80 +++- .../hudi/metadata/HoodieTableMetadataUtil.java | 152 ++- .../hudi/metadata/MetadataPartitionType.java | 32 ++- .../hudi/metadata/TestMetadataPartitionType.java | 30 ++- .../hudi/HoodieSparkFunctionalIndexClient.java | 22 ++- .../scala/org/apache/hudi/DataSourceOptions.scala | 7 + .../org/apache/hudi/FunctionalIndexSupport.scala | 6 +- .../org/apache/hudi/HoodieSparkSqlWriter.scala | 1 + .../spark/sql/hudi/command/IndexCommands.scala | 2 +- .../hudi/functional/RecordLevelIndexTestBase.scala | 19 ++ .../hudi/functional/SecondaryIndexTestBase.scala | 65 +++ .../functional}/TestFunctionalIndex.scala | 25 +-- .../functional/TestSecondaryIndexWithSql.scala | 98 ++ 33 files changed, 995 insertions(+), 212 deletions(-) diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java index be32ad8ac34..86a412fac64 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java @@ -27,7 +27,7 @@ import org.apache.hudi.common.config.ConfigGroups; import org.apache.hudi.common.config.ConfigProperty; import org.apache.hudi.common.config.HoodieCommonConfig; import org.apache.hudi.common.config.HoodieConfig; -import org.apache.hudi.common.config.HoodieFunctionalIndexConfig; +import org.apache.hudi.common.config.HoodieIndexingConfig; import org.apache.hudi.common.config.HoodieMemoryConfig; import org.apache.hudi.common.config.HoodieMetadataConfig; import org.apache.hudi.common.config.HoodieMetaserverConfig; @@ -801,7 +801,7 @@ public class HoodieWriteConfig extends HoodieConfig { private HoodieCommonConfig commonConfig; private HoodieStorageConfig storageConfig; private HoodieTimeGeneratorConfig timeGeneratorConfig; - private HoodieFunctionalIndexConfig functionalIndexConfig; + private HoodieIndexingConfig indexingConfig; private EngineType engineType; /** @@ -1199,7 +1199,7 @@ public class HoodieWriteConfig extends HoodieConfig { this.storageConfig = HoodieStorageConfig.newBuilder().fromProperties(props).build(); this.timeGeneratorConfig = HoodieTimeGeneratorConfig.newBuilder().fromProperties(props) .withDefaultLockProvider(!isLockRequired()).build(); -this.functionalIndexConfig =
Re: [PR] [HUDI-7815] Multiple writer with bulkinsert getAllPendingClusteringPlans should refresh timeline [hudi]
hudi-bot commented on PR #11365: URL: https://github.com/apache/hudi/pull/11365#issuecomment-2139068484 ## CI report: * 8147454d905761bd2256aac273ef69aa1e56fba8 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24143) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7146] Integrate secondary index on reader path [hudi]
hudi-bot commented on PR #11162: URL: https://github.com/apache/hudi/pull/11162#issuecomment-2139067855 ## CI report: * a602c9c4234062e66877fc4bf2c50f94f43767bc Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24142) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi Sink Connector shows broker disconnected [hudi]
prabodh1194 commented on issue #9070: URL: https://github.com/apache/hudi/issues/9070#issuecomment-2139020981 but still facing a bunch of issues in the java classpath. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7815] Multiple writer with bulkinsert getAllPendingClusteringPlans should refresh timeline [hudi]
xuzifu666 commented on code in PR #11365: URL: https://github.com/apache/hudi/pull/11365#discussion_r1620230806 ## hudi-common/src/main/java/org/apache/hudi/common/util/ClusteringUtils.java: ## @@ -69,7 +69,7 @@ public class ClusteringUtils { public static Stream> getAllPendingClusteringPlans( HoodieTableMetaClient metaClient) { List pendingReplaceInstants = - metaClient.getActiveTimeline().filterPendingReplaceTimeline().getInstants(); + metaClient.reloadActiveTimeline().filterPendingReplaceTimeline().getInstants(); Review Comment: multiple writer not do this,and heartbeat not set long engouh could cause it.All the next job would failed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7815] Multiple writer with bulkinsert getAllPendingClusteringPlans should refresh timeline [hudi]
danny0405 commented on code in PR #11365: URL: https://github.com/apache/hudi/pull/11365#discussion_r1620205175 ## hudi-common/src/main/java/org/apache/hudi/common/util/ClusteringUtils.java: ## @@ -69,7 +69,7 @@ public class ClusteringUtils { public static Stream> getAllPendingClusteringPlans( HoodieTableMetaClient metaClient) { List pendingReplaceInstants = - metaClient.getActiveTimeline().filterPendingReplaceTimeline().getInstants(); + metaClient.reloadActiveTimeline().filterPendingReplaceTimeline().getInstants(); Review Comment: It looks like all the invokers already have the refreshed timeline. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7146] Implement secondary index write path [hudi]
hudi-bot commented on PR #11146: URL: https://github.com/apache/hudi/pull/11146#issuecomment-2138926167 ## CI report: * 470bc5f44e7a6658a8717ef1b77e92afcdd90087 UNKNOWN * 43f73661f79eb87ac52d29fa153b996a15f29b99 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24141) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7007] Add bloom_filters index support on read side [hudi]
hudi-bot commented on PR #11043: URL: https://github.com/apache/hudi/pull/11043#issuecomment-2138904248 ## CI report: * 541b544049e68b3d22cdf0f5159fbd9b0005d345 UNKNOWN * 6ece7645a69b367901c71ab78dea15f39d69fca5 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24140) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] CVE problems in latest 0.14.1 [hudi]
Smith-Cruise opened a new issue, #11366: URL: https://github.com/apache/hudi/issues/11366 CVE jars were introduced by `hudi-common`(in `hbase-server` and `hbase-client` transitive dependency) Could you let me know if the community plans to resolve these CVE dependencies? ```bash lib/hbase-protocol-shaded-2.4.18.jar Total: 49 (UNKNOWN: 0, LOW: 0, MEDIUM: 3, HIGH: 26, CRITICAL: 20) +-+--+--+---++-+ | LIBRARY | VULNERABILITY ID | SEVERITY | INSTALLED VERSION | FIXED VERSION | TITLE | +-+--+--+---++-+ | com.fasterxml.jackson.core:jackson-databind | CVE-2017-15095 | CRITICAL | 2.4.0 | 2.9.4, 2.8.11 | jackson-databind: Unsafe | +-+--+--+---++-+ lib/htrace-core4-4.2.0-incubating.jar = Total: 49 (UNKNOWN: 0, LOW: 0, MEDIUM: 3, HIGH: 26, CRITICAL: 20) +-+--+--+---++-+ | LIBRARY | VULNERABILITY ID | SEVERITY | INSTALLED VERSION | FIXED VERSION | TITLE | +-+--+--+---++-+ | com.fasterxml.jackson.core:jackson-databind | CVE-2017-15095 | CRITICAL | 2.4.0 | 2.9.4, 2.8.11 | jackson-databind: Unsafe | | | | | || deserialization due to | | | | | || incomplete black list (incomplete | | | | | || fix for CVE-2017-7525)... | | | | | || -->avd.aquasec.com/nvd/cve-2017-15095 | +-+--+--+---++-+ ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7815] Multiple writer with bulkinsert getAllPendingClusteringPlans should refresh timeline [hudi]
hudi-bot commented on PR #11365: URL: https://github.com/apache/hudi/pull/11365#issuecomment-2138822739 ## CI report: * 8147454d905761bd2256aac273ef69aa1e56fba8 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24143) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7146] Integrate secondary index on reader path [hudi]
hudi-bot commented on PR #11162: URL: https://github.com/apache/hudi/pull/11162#issuecomment-2138822123 ## CI report: * 9d0e80222f6cc69b2dba6f4cdbfc642f31a95e52 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24124) * a602c9c4234062e66877fc4bf2c50f94f43767bc Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24142) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7146] Implement secondary index write path [hudi]
hudi-bot commented on PR #11146: URL: https://github.com/apache/hudi/pull/11146#issuecomment-2138821957 ## CI report: * 470bc5f44e7a6658a8717ef1b77e92afcdd90087 UNKNOWN * e8a4507886bc97b1819ea39788f2abd7385b8cf2 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24139) * 43f73661f79eb87ac52d29fa153b996a15f29b99 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24141) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7007] Add bloom_filters index support on read side [hudi]
hudi-bot commented on PR #11043: URL: https://github.com/apache/hudi/pull/11043#issuecomment-2138821662 ## CI report: * 541b544049e68b3d22cdf0f5159fbd9b0005d345 UNKNOWN * 87c15b2c23430d967749dede5e09d74a33dcce88 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24128) * 6ece7645a69b367901c71ab78dea15f39d69fca5 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24140) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] With autogenerated keys HoodieStreamer failing with error - ts(Part -ts) field not found in record [hudi]
Sarfaraz-214 commented on issue #10233: URL: https://github.com/apache/hudi/issues/10233#issuecomment-2138816691 Hi @nsivabalan I am already using INSERT mode. Shared all the configs above. `hoodie.spark.sql.insert.into.operation=insert` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7815] Multiple writer with bulkinsert getAllPendingClusteringPlans should refresh timeline [hudi]
hudi-bot commented on PR #11365: URL: https://github.com/apache/hudi/pull/11365#issuecomment-2138810841 ## CI report: * 8147454d905761bd2256aac273ef69aa1e56fba8 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7146] Integrate secondary index on reader path [hudi]
hudi-bot commented on PR #11162: URL: https://github.com/apache/hudi/pull/11162#issuecomment-2138810306 ## CI report: * 9d0e80222f6cc69b2dba6f4cdbfc642f31a95e52 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24124) * a602c9c4234062e66877fc4bf2c50f94f43767bc UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7146] Implement secondary index write path [hudi]
hudi-bot commented on PR #11146: URL: https://github.com/apache/hudi/pull/11146#issuecomment-2138810132 ## CI report: * 470bc5f44e7a6658a8717ef1b77e92afcdd90087 UNKNOWN * e8a4507886bc97b1819ea39788f2abd7385b8cf2 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24139) * 43f73661f79eb87ac52d29fa153b996a15f29b99 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7007] Add bloom_filters index support on read side [hudi]
hudi-bot commented on PR #11043: URL: https://github.com/apache/hudi/pull/11043#issuecomment-2138809912 ## CI report: * 541b544049e68b3d22cdf0f5159fbd9b0005d345 UNKNOWN * 87c15b2c23430d967749dede5e09d74a33dcce88 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24128) * 6ece7645a69b367901c71ab78dea15f39d69fca5 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7146] Implement secondary index write path [hudi]
codope commented on code in PR #11146: URL: https://github.com/apache/hudi/pull/11146#discussion_r1620048869 ## hudi-common/src/main/java/org/apache/hudi/common/model/HoodieIndexDefinition.java: ## @@ -45,14 +49,14 @@ public class HoodieFunctionalIndexDefinition implements Serializable { // Any other configuration or properties specific to the index private Map indexOptions; - public HoodieFunctionalIndexDefinition() { + public HoodieIndexDefinition() { } - public HoodieFunctionalIndexDefinition(String indexName, String indexType, String indexFunction, List sourceFields, - Map indexOptions) { + public HoodieIndexDefinition(String indexName, String indexType, String indexFunction, List sourceFields, + Map indexOptions) { this.indexName = indexName; this.indexType = indexType; -this.indexFunction = indexFunction; +this.indexFunction = nonEmpty(indexFunction) ? indexFunction : SPARK_IDENTITY; Review Comment: On second thought, it should not bind to spark function. I will correct it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [HUDI-7815] Multiple writer with bulkinsert getAllPendingClusteringPl… [hudi]
xuzifu666 opened a new pull request, #11365: URL: https://github.com/apache/hudi/pull/11365 ### Change Logs Multiple writer with bulkinsert getAllPendingClusteringPlans should refresh timeline Caused by: org.apache.hudi.exception.HoodieException: Error getting all file groups in pending clustering at org.apache.hudi.common.util.ClusteringUtils.getAllFileGroupsInPendingClusteringPlans(ClusteringUtils.java:135) at org.apache.hudi.common.table.view.AbstractTableFileSystemView.init(AbstractTableFileSystemView.java:113) at org.apache.hudi.common.table.view.HoodieTableFileSystemView.init(HoodieTableFileSystemView.java:108) at org.apache.hudi.common.table.view.HoodieTableFileSystemView.(HoodieTableFileSystemView.java:102) at org.apache.hudi.common.table.view.HoodieTableFileSystemView.(HoodieTableFileSystemView.java:93) at org.apache.hudi.metadata.HoodieMetadataFileSystemView.(HoodieMetadataFileSystemView.java:44) at org.apache.hudi.common.table.view.FileSystemViewManager.createInMemoryFileSystemView(FileSystemViewManager.java:166) at org.apache.hudi.common.table.view.FileSystemViewManager.lambda$createViewManager$5fcdabfe$1(FileSystemViewManager.java:259) ### Impact low ### Risk level (write none, low medium or high below) low ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none"._ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7815) Multiple writer with bulkinsert getAllPendingClusteringPlans should refresh timeline
[ https://issues.apache.org/jira/browse/HUDI-7815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7815: - Labels: pull-request-available (was: ) > Multiple writer with bulkinsert getAllPendingClusteringPlans should refresh > timeline > > > Key: HUDI-7815 > URL: https://issues.apache.org/jira/browse/HUDI-7815 > Project: Apache Hudi > Issue Type: Improvement > Components: spark-sql >Reporter: xy >Assignee: xy >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)