[GitHub] [hudi] wuwenchi commented on a diff in pull request #6539: [HUDI-4739] Wrong value returned when key's length equals 1
wuwenchi commented on code in PR #6539: URL: https://github.com/apache/hudi/pull/6539#discussion_r958074279 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/KeyGenUtils.java: ## @@ -73,21 +73,16 @@ public static String getPartitionPathFromGenericRecord(GenericRecord genericReco */ public static String[] extractRecordKeys(String recordKey) { String[] fieldKV = recordKey.split(","); -if (fieldKV.length == 1) { - return fieldKV; -} else { - // a complex key - return Arrays.stream(fieldKV).map(kv -> { -final String[] kvArray = kv.split(":"); -if (kvArray[1].equals(NULL_RECORDKEY_PLACEHOLDER)) { - return null; -} else if (kvArray[1].equals(EMPTY_RECORDKEY_PLACEHOLDER)) { - return ""; -} else { - return kvArray[1]; -} - }).toArray(String[]::new); -} +return Arrays.stream(fieldKV).map(kv -> { + final String[] kvArray = kv.split(":"); + if (kvArray[1].equals(NULL_RECORDKEY_PLACEHOLDER)) { +return null; + } else if (kvArray[1].equals(EMPTY_RECORDKEY_PLACEHOLDER)) { +return ""; + } else { +return kvArray[1]; + } +}).toArray(String[]::new); Review Comment: ok -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Zhangshunyu closed issue #6528: [SUPPORT]How to clean the compacted .log and .hfiles in metadata?
Zhangshunyu closed issue #6528: [SUPPORT]How to clean the compacted .log and .hfiles in metadata? URL: https://github.com/apache/hudi/issues/6528 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on pull request #6506: Allow hoodie read client to choose index
yihua commented on PR #6506: URL: https://github.com/apache/hudi/pull/6506#issuecomment-1231181637 @parisni Could you also add the Jira ticket number? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on a diff in pull request #6506: Allow hoodie read client to choose index
yihua commented on code in PR #6506: URL: https://github.com/apache/hudi/pull/6506#discussion_r958036128 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/HoodieReadClient.java: ## @@ -92,6 +92,18 @@ public HoodieReadClient(HoodieSparkEngineContext context, String basePath, SQLCo this.sqlContextOpt = Option.of(sqlContext); } + /** + * @param context Review Comment: nit: add meaningful docs here? ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/HoodieReadClient.java: ## @@ -92,6 +92,18 @@ public HoodieReadClient(HoodieSparkEngineContext context, String basePath, SQLCo this.sqlContextOpt = Option.of(sqlContext); } + /** + * @param context + * @param basePath + * @param sqlContext + * @param indexType + */ + public HoodieReadClient(HoodieSparkEngineContext context, String basePath, SQLContext sqlContext, HoodieIndex.IndexType indexType) { Review Comment: Is this going to be used in any query engine? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6537: Avoid update metastore schema if only missing column in input
hudi-bot commented on PR #6537: URL: https://github.com/apache/hudi/pull/6537#issuecomment-1231174900 ## CI report: * 9e63b76454a06d57a141ad4b844752abb346d3fa UNKNOWN * 00b9224ec8c49e83ca51d52351c782083a4fba84 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11030) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on pull request #6515: [HUDI-4730] FIX Batch job cannot clean old commits&data files in clea…
yihua commented on PR #6515: URL: https://github.com/apache/hudi/pull/6515#issuecomment-1231174530 @danny0405 @XuQianJin-Stars is this PR good for merging? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (HUDI-4718) Hudi cli does not support Kerberized Hadoop cluster
[ https://issues.apache.org/jira/browse/HUDI-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xi chaomin reassigned HUDI-4718: Assignee: Yao Zhang > Hudi cli does not support Kerberized Hadoop cluster > --- > > Key: HUDI-4718 > URL: https://issues.apache.org/jira/browse/HUDI-4718 > Project: Apache Hudi > Issue Type: Bug > Components: cli >Reporter: Yao Zhang >Assignee: Yao Zhang >Priority: Major > Fix For: 0.13.0 > > > Hudi cli connect command cannot read table from Kerberized Hadoop cluster and > there is no way to perform Kerberos authentication. > I plan to add this feature. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] yihua commented on pull request #6535: [HUDI-4193] change protoc version so it compiles on m1 mac
yihua commented on PR #6535: URL: https://github.com/apache/hudi/pull/6535#issuecomment-1231148093 @xushiyan @nsivabalan There are two other PRs fixing the build issue around protoc: #6455 #5757. Shall we decide the approach here and land only one of these? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on a diff in pull request #6539: [HUDI-4739] Wrong value returned when key's length equals 1
yihua commented on code in PR #6539: URL: https://github.com/apache/hudi/pull/6539#discussion_r958009785 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/KeyGenUtils.java: ## @@ -73,21 +73,16 @@ public static String getPartitionPathFromGenericRecord(GenericRecord genericReco */ public static String[] extractRecordKeys(String recordKey) { String[] fieldKV = recordKey.split(","); -if (fieldKV.length == 1) { - return fieldKV; -} else { - // a complex key - return Arrays.stream(fieldKV).map(kv -> { -final String[] kvArray = kv.split(":"); -if (kvArray[1].equals(NULL_RECORDKEY_PLACEHOLDER)) { - return null; -} else if (kvArray[1].equals(EMPTY_RECORDKEY_PLACEHOLDER)) { - return ""; -} else { - return kvArray[1]; -} - }).toArray(String[]::new); -} +return Arrays.stream(fieldKV).map(kv -> { + final String[] kvArray = kv.split(":"); + if (kvArray[1].equals(NULL_RECORDKEY_PLACEHOLDER)) { +return null; + } else if (kvArray[1].equals(EMPTY_RECORDKEY_PLACEHOLDER)) { +return ""; + } else { +return kvArray[1]; + } +}).toArray(String[]::new); Review Comment: @wuwenchi could you add a unit test for the util method considering the fixed case? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch asf-site updated: [DOCS] Update migration_guide.md (#6275)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new 4fc0d427a0 [DOCS] Update migration_guide.md (#6275) 4fc0d427a0 is described below commit 4fc0d427a00cd650057c0458e3a596dfb1d58e9d Author: Manu <36392121+x...@users.noreply.github.com> AuthorDate: Tue Aug 30 13:00:31 2022 +0800 [DOCS] Update migration_guide.md (#6275) Co-authored-by: Y Ethan Guo --- website/docs/migration_guide.md| 42 +- .../version-0.11.1/migration_guide.md | 42 +- .../version-0.12.0/migration_guide.md | 42 +- 3 files changed, 78 insertions(+), 48 deletions(-) diff --git a/website/docs/migration_guide.md b/website/docs/migration_guide.md index e7dd5c29d7..449d65c376 100644 --- a/website/docs/migration_guide.md +++ b/website/docs/migration_guide.md @@ -36,8 +36,29 @@ Import your existing table into a Hudi managed table. Since all the data is Hudi There are a few options when choosing this approach. **Option 1** -Use the HDFSParquetImporter tool. As the name suggests, this only works if your existing table is in parquet file format. -This tool essentially starts a Spark Job to read the existing parquet table and converts it into a HUDI managed table by re-writing all the data. +Use the HoodieDeltaStreamer tool. HoodieDeltaStreamer supports bootstrap with --run-bootstrap command line option. There are two types of bootstrap, +METADATA_ONLY and FULL_RECORD. METADATA_ONLY will generate just skeleton base files with keys/footers, avoiding full cost of rewriting the dataset. +FULL_RECORD will perform a full copy/rewrite of the data as a Hudi table. + +Here is an example for running FULL_RECORD bootstrap and keeping hive style partition with HoodieDeltaStreamer. +``` +spark-submit --master local \ +--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \ +--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer `ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` \ +--run-bootstrap \ +--target-base-path /tmp/hoodie/bootstrap_table \ +--target-table bootstrap_table \ +--table-type COPY_ON_WRITE \ +--hoodie-conf hoodie.bootstrap.base.path=/tmp/source_table \ +--hoodie-conf hoodie.datasource.write.recordkey.field=${KEY_FIELD} \ +--hoodie-conf hoodie.datasource.write.partitionpath.field=${PARTITION_FIELD} \ +--hoodie-conf hoodie.datasource.write.precombine.field=${PRECOMBINE_FILED} \ +--hoodie-conf hoodie.bootstrap.keygen.class=org.apache.hudi.keygen.SimpleKeyGenerator \ +--hoodie-conf hoodie.bootstrap.full.input.provider=org.apache.hudi.bootstrap.SparkParquetBootstrapDataProvider \ +--hoodie-conf hoodie.bootstrap.mode.selector=org.apache.hudi.client.bootstrap.selector.BootstrapRegexModeSelector \ +--hoodie-conf hoodie.bootstrap.mode.selector.regex.mode=FULL_RECORD \ +--hoodie-conf hoodie.datasource.write.hive_style_partitioning=true +``` **Option 2** For huge tables, this could be as simple as : @@ -50,21 +71,10 @@ for partition in [list of partitions in source table] { **Option 3** Write your own custom logic of how to load an existing table into a Hudi managed one. Please read about the RDD API - [here](/docs/quick-start-guide). Using the HDFSParquetImporter Tool. Once hudi has been built via `mvn clean install -DskipTests`, the shell can be +[here](/docs/quick-start-guide). Using the bootstrap run CLI. Once hudi has been built via `mvn clean install -DskipTests`, the shell can be fired by via `cd hudi-cli && ./hudi-cli.sh`. ```java -hudi->hdfsparquetimport ---upsert false ---srcPath /user/parquet/table/basepath ---targetPath /user/hoodie/table/basepath ---tableName hoodie_table ---tableType COPY_ON_WRITE ---rowKeyField _row_key ---partitionPathField partitionStr ---parallelism 1500 ---schemaFilePath /user/table/schema ---format parquet ---sparkMemory 6g ---retry 2 +hudi->bootstrap run --srcPath /tmp/source_table --targetPath /tmp/hoodie/bootstrap_table --tableName bootstrap_table --tableType COPY_ON_WRITE --rowKeyField ${KEY_FIELD} --partitionPathField ${PARTITION_FIELD} --sparkMaster local --hoodieConfigs hoodie.datasource.write.hive_style_partitioning=true --selectorClass org.apache.hudi.client.bootstrap.selector.FullRecordBootstrapModeSelector ``` +Unlike deltaStream, FULL_RECORD or METADATA_ONLY is set with --selectorClass, see detalis with help "bootstrap run". diff --git a/website/versioned_docs/version-0.11.1/migration_guide.md b/website/versioned_docs/version-0.11.1/migration_guide.md index e7dd5c29d7..7f5ccf2d9c 100644 --- a/website/versioned_docs/version-0.11.1/migration_guide.md +++ b/website/versioned_docs/version-0.11.1/migration_guide
[GitHub] [hudi] yihua merged pull request #6275: [DOCS] Update migration_guide.md
yihua merged PR #6275: URL: https://github.com/apache/hudi/pull/6275 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [HUDI-4327] Fixing flaky deltastreamer test (testCleanerDeleteReplacedDataWithArchive) (#6533)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new a3481efdf0 [HUDI-4327] Fixing flaky deltastreamer test (testCleanerDeleteReplacedDataWithArchive) (#6533) a3481efdf0 is described below commit a3481efdf076036b613a4be5de0cf0f9dba3aa96 Author: Sivabalan Narayanan AuthorDate: Mon Aug 29 21:59:15 2022 -0700 [HUDI-4327] Fixing flaky deltastreamer test (testCleanerDeleteReplacedDataWithArchive) (#6533) --- .../org/apache/hudi/utilities/functional/TestHoodieDeltaStreamer.java | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/hudi-utilities/src/test/java/org/apache/hudi/utilities/functional/TestHoodieDeltaStreamer.java b/hudi-utilities/src/test/java/org/apache/hudi/utilities/functional/TestHoodieDeltaStreamer.java index 88948b0385..69d6dd7d3b 100644 --- a/hudi-utilities/src/test/java/org/apache/hudi/utilities/functional/TestHoodieDeltaStreamer.java +++ b/hudi-utilities/src/test/java/org/apache/hudi/utilities/functional/TestHoodieDeltaStreamer.java @@ -902,6 +902,7 @@ public class TestHoodieDeltaStreamer extends HoodieDeltaStreamerTestBase { cfg.configs.addAll(getAsyncServicesConfigs(totalRecords, "false", "true", "2", "", "")); cfg.configs.add(String.format("%s=%s", HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT.key(), "0")); cfg.configs.add(String.format("%s=%s", HoodieMetadataConfig.COMPACT_NUM_DELTA_COMMITS.key(), "1")); +cfg.configs.add(String.format("%s=%s", HoodieWriteConfig.MARKERS_TYPE.key(), "DIRECT")); HoodieDeltaStreamer ds = new HoodieDeltaStreamer(cfg, jsc); deltaStreamerTestRunner(ds, cfg, (r) -> { TestHelpers.assertAtLeastNReplaceCommits(2, tableBasePath, dfs); @@ -947,13 +948,14 @@ public class TestHoodieDeltaStreamer extends HoodieDeltaStreamerTestBase { assertFalse(replacedFilePaths.isEmpty()); // Step 4 : Insert 1 record and trigger sync/async cleaner and archive. -List configs = getAsyncServicesConfigs(1, "true", "true", "2", "", ""); +List configs = getAsyncServicesConfigs(1, "true", "true", "6", "", ""); configs.add(String.format("%s=%s", HoodieCleanConfig.CLEANER_POLICY.key(), "KEEP_LATEST_COMMITS")); configs.add(String.format("%s=%s", HoodieCleanConfig.CLEANER_COMMITS_RETAINED.key(), "1")); configs.add(String.format("%s=%s", HoodieArchivalConfig.MIN_COMMITS_TO_KEEP.key(), "2")); configs.add(String.format("%s=%s", HoodieArchivalConfig.MAX_COMMITS_TO_KEEP.key(), "3")); configs.add(String.format("%s=%s", HoodieCleanConfig.ASYNC_CLEAN.key(), asyncClean)); configs.add(String.format("%s=%s", HoodieMetadataConfig.COMPACT_NUM_DELTA_COMMITS.key(), "1")); +cfg.configs.add(String.format("%s=%s", HoodieWriteConfig.MARKERS_TYPE.key(), "DIRECT")); if (asyncClean) { configs.add(String.format("%s=%s", HoodieWriteConfig.WRITE_CONCURRENCY_MODE.key(), WriteConcurrencyMode.OPTIMISTIC_CONCURRENCY_CONTROL.name()));
[GitHub] [hudi] yihua merged pull request #6533: [HUDI-4327] Fixing flaky deltastreamer test (testCleanerDeleteReplacedDataWithArchive)
yihua merged PR #6533: URL: https://github.com/apache/hudi/pull/6533 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on a diff in pull request #6533: [HUDI-4327] Fixing flaky deltastreamer test (testCleanerDeleteReplacedDataWithArchive)
yihua commented on code in PR #6533: URL: https://github.com/apache/hudi/pull/6533#discussion_r958006290 ## hudi-utilities/src/test/java/org/apache/hudi/utilities/functional/TestHoodieDeltaStreamer.java: ## @@ -902,6 +902,7 @@ public void testCleanerDeleteReplacedDataWithArchive(Boolean asyncClean) throws cfg.configs.addAll(getAsyncServicesConfigs(totalRecords, "false", "true", "2", "", "")); cfg.configs.add(String.format("%s=%s", HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT.key(), "0")); cfg.configs.add(String.format("%s=%s", HoodieMetadataConfig.COMPACT_NUM_DELTA_COMMITS.key(), "1")); +cfg.configs.add(String.format("%s=%s", HoodieWriteConfig.MARKERS_TYPE.key(), "DIRECT")); Review Comment: Do we know why the timeline-server-based markers make the test flaky? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated (71b8174058 -> 7c9ceb6370)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from 71b8174058 [HUDI-4340] fix not parsable text DateTimeParseException by addng a method parseDateFromInstantTimeSafely for parsing timestamp when output metrics (#6000) add 7c9ceb6370 [DOCS] Add docs about javax.security.auth.login.LoginException when starting Hudi Sink Connector (#6255) No new revisions were added by this update. Summary of changes: hudi-kafka-connect/README.md | 25 + 1 file changed, 25 insertions(+)
[GitHub] [hudi] yihua merged pull request #6255: [DOCS] Add doc about javax.security.auth.login.LoginException in Hudi KC Sink
yihua merged PR #6255: URL: https://github.com/apache/hudi/pull/6255 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6393: [HUDI-4619] Fix The retry mechanism of remotehoodietablefilesystemvie…
hudi-bot commented on PR #6393: URL: https://github.com/apache/hudi/pull/6393#issuecomment-1231138044 ## CI report: * 09f49abeeca229df307426ba79bd77ed0392b79f UNKNOWN * 0dd2a468fb99ca57ccf6da47dd6baa79b20f7f9d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10978) * 525791f3450141706470bb1ac39eb6b8716f3dfc Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11036) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6393: [HUDI-4619] Fix The retry mechanism of remotehoodietablefilesystemvie…
hudi-bot commented on PR #6393: URL: https://github.com/apache/hudi/pull/6393#issuecomment-1231135357 ## CI report: * 09f49abeeca229df307426ba79bd77ed0392b79f UNKNOWN * 0dd2a468fb99ca57ccf6da47dd6baa79b20f7f9d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10978) * 525791f3450141706470bb1ac39eb6b8716f3dfc UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch asf-site updated: [DOCS] Fix link rendering error in Docker Demo and some other typos (#6083)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new b42a13a277 [DOCS] Fix link rendering error in Docker Demo and some other typos (#6083) b42a13a277 is described below commit b42a13a2776991197f241f8792e3c1f74f05b64e Author: totoro AuthorDate: Tue Aug 30 12:45:35 2022 +0800 [DOCS] Fix link rendering error in Docker Demo and some other typos (#6083) --- website/docs/docker_demo.md | 2 +- website/docs/quick-start-guide.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/website/docs/docker_demo.md b/website/docs/docker_demo.md index 48cee5d507..4a390506c3 100644 --- a/website/docs/docker_demo.md +++ b/website/docs/docker_demo.md @@ -16,7 +16,7 @@ The steps have been tested on a Mac laptop ### Prerequisites * Clone the [Hudi repository](https://github.com/apache/hudi) to your local machine. - * Docker Setup : For Mac, Please follow the steps as defined in [https://docs.docker.com/v17.12/docker-for-mac/install/]. For running Spark-SQL queries, please ensure atleast 6 GB and 4 CPUs are allocated to Docker (See Docker -> Preferences -> Advanced). Otherwise, spark-SQL queries could be killed because of memory issues. + * Docker Setup : For Mac, Please follow the steps as defined in [Install Docker Desktop on Mac](https://docs.docker.com/desktop/install/mac-install/). For running Spark-SQL queries, please ensure atleast 6 GB and 4 CPUs are allocated to Docker (See Docker -> Preferences -> Advanced). Otherwise, spark-SQL queries could be killed because of memory issues. * kcat : A command-line utility to publish/consume from kafka topics. Use `brew install kcat` to install kcat. * /etc/hosts : The demo references many services running in container by the hostname. Add the following settings to /etc/hosts diff --git a/website/docs/quick-start-guide.md b/website/docs/quick-start-guide.md index 145c17f843..eb1f3596d2 100644 --- a/website/docs/quick-start-guide.md +++ b/website/docs/quick-start-guide.md @@ -287,7 +287,7 @@ create table hudi_cow_nonpcf_tbl ( ) using hudi; --- create a mor non-partitioned table without preCombineField provided +-- create a mor non-partitioned table with preCombineField provided create table hudi_mor_tbl ( id int, name string,
[GitHub] [hudi] yihua merged pull request #6083: [DOCS] Fix link rendering error in Docker Demo and some other typos
yihua merged PR #6083: URL: https://github.com/apache/hudi/pull/6083 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on pull request #6083: [DOCS] Fix link rendering error in Docker Demo and some other typos
yihua commented on PR #6083: URL: https://github.com/apache/hudi/pull/6083#issuecomment-1231130336 > Hi @yihua, I add a new commit to solve the conflict (seems not work), and there are three commits for this PR,Do I need to squash these commits? Don't worry about it. I'll do the squash and merge so you don't have to squash the commits. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4483) Fix checkstyle on scala code and integ-test module
[ https://issues.apache.org/jira/browse/HUDI-4483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-4483: - Fix Version/s: 0.12.1 (was: 0.13.0) > Fix checkstyle on scala code and integ-test module > -- > > Key: HUDI-4483 > URL: https://issues.apache.org/jira/browse/HUDI-4483 > Project: Apache Hudi > Issue Type: Improvement > Components: code-quality >Reporter: Raymond Xu >Assignee: KnightChess >Priority: Major > Labels: pull-request-available > Fix For: 0.12.1 > > > checkstyle does not work on scala code > see HUDI-4482 > and integration test module > in GenericRecordFullPayloadGenerator.java > import com.google.common.annotations.VisibleForTesting; -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-4483) Fix checkstyle on scala code and integ-test module
[ https://issues.apache.org/jira/browse/HUDI-4483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu reassigned HUDI-4483: Assignee: KnightChess > Fix checkstyle on scala code and integ-test module > -- > > Key: HUDI-4483 > URL: https://issues.apache.org/jira/browse/HUDI-4483 > Project: Apache Hudi > Issue Type: Improvement > Components: code-quality >Reporter: Raymond Xu >Assignee: KnightChess >Priority: Major > Labels: pull-request-available > Fix For: 0.13.0 > > > checkstyle does not work on scala code > see HUDI-4482 > and integration test module > in GenericRecordFullPayloadGenerator.java > import com.google.common.annotations.VisibleForTesting; -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-4483) Fix checkstyle on scala code and integ-test module
[ https://issues.apache.org/jira/browse/HUDI-4483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu closed HUDI-4483. Resolution: Fixed > Fix checkstyle on scala code and integ-test module > -- > > Key: HUDI-4483 > URL: https://issues.apache.org/jira/browse/HUDI-4483 > Project: Apache Hudi > Issue Type: Improvement > Components: code-quality >Reporter: Raymond Xu >Assignee: KnightChess >Priority: Major > Labels: pull-request-available > Fix For: 0.12.1 > > > checkstyle does not work on scala code > see HUDI-4482 > and integration test module > in GenericRecordFullPayloadGenerator.java > import com.google.common.annotations.VisibleForTesting; -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] si1verwind17 closed issue #6526: [SUPPORT] Unable to sync Hudi with hive metastore
si1verwind17 closed issue #6526: [SUPPORT] Unable to sync Hudi with hive metastore URL: https://github.com/apache/hudi/issues/6526 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] si1verwind17 commented on issue #6526: [SUPPORT] Unable to sync Hudi with hive metastore
si1verwind17 commented on issue #6526: URL: https://github.com/apache/hudi/issues/6526#issuecomment-1231125320 I have resolved the error. The problem wasn't from the Hudi side. The error below `: org.apache.hudi.exception.HoodieException: Could not sync using the meta sync class org.apache.hudi.hive.HiveSyncTool` caused by the other error from Hive Metastore which doesn't recognize scheme gs:// So, I solved by putting the gcs connector jar (shaded) to $HIVE_HOME/lib in Remote Hive Metastore -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] wuwenchi commented on pull request #6539: [HUDI-4739] Wrong value returned when key's length equals 1
wuwenchi commented on PR #6539: URL: https://github.com/apache/hudi/pull/6539#issuecomment-1231120563 @danny0405 Can you help review it? Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] szknb commented on issue #6530: [SUPPORT] org.apache.hudi.exception.HoodieException: Invalid partition name [2020/01/02, 2020/01/01, 2020/01/03]
szknb commented on issue #6530: URL: https://github.com/apache/hudi/issues/6530#issuecomment-1231110006 hudi version: 0.7.0-bd33 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] zhoulii commented on pull request #6083: [DOCS] Fix link rendering error in Docker Demo and some other typos
zhoulii commented on PR #6083: URL: https://github.com/apache/hudi/pull/6083#issuecomment-1231106323 Hi @yihua, I add a new commit to solve the conflict (seems not work), and there are three commits for this PR,Do I need to squash these commits? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] liangyu-1 commented on issue #6529: [SUPPORT] jar conflicts about org.apache.hudi.execution.FlinkLazyInsertIterable.getTransformFunction
liangyu-1 commented on issue #6529: URL: https://github.com/apache/hudi/issues/6529#issuecomment-1231105760 I figured out that I imported both hudi-flink-client jar and hudi-flink-bundle jar in my project. hudi-flinl-bundle have shaded org.apache.avro but hudi-flink-client didn't, thus there is a conflict. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6537: Avoid update metastore schema if only missing column in input
hudi-bot commented on PR #6537: URL: https://github.com/apache/hudi/pull/6537#issuecomment-1231104651 ## CI report: * 9e63b76454a06d57a141ad4b844752abb346d3fa UNKNOWN * a245595d0c988610d845f6918fe8c5ea76383e92 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11029) * 00b9224ec8c49e83ca51d52351c782083a4fba84 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11030) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6541: [HUDI-4740] Add metadata fields for hive catalog #createTable
hudi-bot commented on PR #6541: URL: https://github.com/apache/hudi/pull/6541#issuecomment-1231102308 ## CI report: * 1cbef90f645fdaaa68383ac186aedefbbf7da58b Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11034) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6542: [MINOR] Fix typo in HoodieArchivalConfig
hudi-bot commented on PR #6542: URL: https://github.com/apache/hudi/pull/6542#issuecomment-1231102332 ## CI report: * 93f96405bd8cd6a5486eb0e08e7c08a77214d362 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11035) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6534: [HUDI-4695] Fix flaky TestInlineCompaction#testCompactionRetryOnFailureBasedOnTime
hudi-bot commented on PR #6534: URL: https://github.com/apache/hudi/pull/6534#issuecomment-1231102263 ## CI report: * 1a56cdc2bc53917efb33ff786ff14775dd2b526b Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11026) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on pull request #960: [WIP] [HUDI-307] Adding type test for timestamp,date & decimal
yihua commented on PR #960: URL: https://github.com/apache/hudi/pull/960#issuecomment-1231101317 @arw357 @leesf @vinothchandar @bvaradar this PR becomes old :) Do we still need this? @nsivabalan @xushiyan has the current set of tests already covered different types? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6541: [HUDI-4740] Add metadata fields for hive catalog #createTable
hudi-bot commented on PR #6541: URL: https://github.com/apache/hudi/pull/6541#issuecomment-1231099684 ## CI report: * 1cbef90f645fdaaa68383ac186aedefbbf7da58b UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on pull request #1650: [HUDI-541]: replaced dataFile/df with baseFile/bf throughout code base
yihua commented on PR #1650: URL: https://github.com/apache/hudi/pull/1650#issuecomment-1231100189 @pratyakshsharma do you still plan to land this PR given the code base has changed since April? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] zhoulii commented on a diff in pull request #6083: [DOCS] Fix link rendering error in Docker Demo and some other typos
zhoulii commented on code in PR #6083: URL: https://github.com/apache/hudi/pull/6083#discussion_r957972065 ## website/docs/configurations.md: ## @@ -3197,7 +3197,7 @@ Configurations that control compaction (merging of log files onto a new base fil --- > hoodie.keep.min.commits -> Similar to hoodie.keep.max.commits, but controls the minimum number ofinstants to retain in the active timeline. +> Similar to hoodie.keep.max.commits, but controls the minimum number of instants to retain in the active timeline. Review Comment: @yihua Thanks for reviewing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6542: [MINOR] Fix typo in HoodieArchivalConfig
hudi-bot commented on PR #6542: URL: https://github.com/apache/hudi/pull/6542#issuecomment-1231099705 ## CI report: * 93f96405bd8cd6a5486eb0e08e7c08a77214d362 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6534: [HUDI-4695] Fix flaky TestInlineCompaction#testCompactionRetryOnFailureBasedOnTime
hudi-bot commented on PR #6534: URL: https://github.com/apache/hudi/pull/6534#issuecomment-1231099635 ## CI report: * 1a56cdc2bc53917efb33ff786ff14775dd2b526b UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch asf-site updated: [DOCS] Clarification to Docker quickstart demo (#6302)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new df4e119bdb [DOCS] Clarification to Docker quickstart demo (#6302) df4e119bdb is described below commit df4e119bdbe946d32b037492d8874452e29bf829 Author: Robin Moffatt AuthorDate: Tue Aug 30 04:30:43 2022 +0100 [DOCS] Clarification to Docker quickstart demo (#6302) Co-authored-by: Y Ethan Guo --- website/docs/docker_demo.md | 42 -- 1 file changed, 24 insertions(+), 18 deletions(-) diff --git a/website/docs/docker_demo.md b/website/docs/docker_demo.md index 7f56129a1c..48cee5d507 100644 --- a/website/docs/docker_demo.md +++ b/website/docs/docker_demo.md @@ -5,15 +5,17 @@ toc: true last_modified_at: 2019-12-30T15:59:57-04:00 --- -## A Demo using docker containers +## A Demo using Docker containers -Lets use a real world example to see how hudi works end to end. For this purpose, a self contained -data infrastructure is brought up in a local docker cluster within your computer. +Let's use a real world example to see how Hudi works end to end. For this purpose, a self contained +data infrastructure is brought up in a local Docker cluster within your computer. It requires the +Hudi repo to have been cloned locally. The steps have been tested on a Mac laptop ### Prerequisites + * Clone the [Hudi repository](https://github.com/apache/hudi) to your local machine. * Docker Setup : For Mac, Please follow the steps as defined in [https://docs.docker.com/v17.12/docker-for-mac/install/]. For running Spark-SQL queries, please ensure atleast 6 GB and 4 CPUs are allocated to Docker (See Docker -> Preferences -> Advanced). Otherwise, spark-SQL queries could be killed because of memory issues. * kcat : A command-line utility to publish/consume from kafka topics. Use `brew install kcat` to install kcat. * /etc/hosts : The demo references many services running in container by the hostname. Add the following settings to /etc/hosts @@ -41,7 +43,10 @@ Also, this has not been tested on some environments like Docker on Windows. ### Build Hudi -The first step is to build hudi. **Note** This step builds hudi on default supported scala version - 2.11. +The first step is to build Hudi. **Note** This step builds Hudi on default supported scala version - 2.11. + +NOTE: Make sure you've cloned the [Hudi repository](https://github.com/apache/hudi) first. + ```java cd mvn clean package -Pintegration-tests -DskipTests @@ -49,8 +54,9 @@ mvn clean package -Pintegration-tests -DskipTests ### Bringing up Demo Cluster -The next step is to run the docker compose script and setup configs for bringing up the cluster. -This should pull the docker images from docker hub and setup docker cluster. +The next step is to run the Docker compose script and setup configs for bringing up the cluster. These files are in the [Hudi repository](https://github.com/apache/hudi) which you should already have locally on your machine from the previous steps. + +This should pull the Docker images from Docker hub and setup the Docker cluster. ```java cd docker @@ -112,7 +118,7 @@ Copying spark default config and setting up configs $ docker ps ``` -At this point, the docker cluster will be up and running. The demo cluster brings up the following services +At this point, the Docker cluster will be up and running. The demo cluster brings up the following services * HDFS Services (NameNode, DataNode) * Spark Master and Worker @@ -1317,13 +1323,13 @@ This brings the demo to an end. ## Testing Hudi in Local Docker environment -You can bring up a hadoop docker environment containing Hadoop, Hive and Spark services with support for hudi. +You can bring up a Hadoop Docker environment containing Hadoop, Hive and Spark services with support for Hudi. ```java $ mvn pre-integration-test -DskipTests ``` -The above command builds docker images for all the services with +The above command builds Docker images for all the services with current Hudi source installed at /var/hoodie/ws and also brings up the services using a compose file. We -currently use Hadoop (v2.8.4), Hive (v2.3.3) and Spark (v2.4.4) in docker images. +currently use Hadoop (v2.8.4), Hive (v2.3.3) and Spark (v2.4.4) in Docker images. To bring down the containers ```java @@ -1331,7 +1337,7 @@ $ cd hudi-integ-test $ mvn docker-compose:down ``` -If you want to bring up the docker containers, use +If you want to bring up the Docker containers, use ```java $ cd hudi-integ-test $ mvn docker-compose:up -DdetachedMode=true @@ -1345,21 +1351,21 @@ docker environment (See __hudi-integ-test/src/test/java/org/apache/hudi/integ/IT ### Building Local Docker Containers: -The docker images required for demo and running
[GitHub] [hudi] yihua merged pull request #6302: [DOCS] Clarification to Docker quickstart demo
yihua merged PR #6302: URL: https://github.com/apache/hudi/pull/6302 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch asf-site updated: [HUDI-4339] Add example configuration for HoodieCleaner in docs (#6326)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new 4c8f222070 [HUDI-4339] Add example configuration for HoodieCleaner in docs (#6326) 4c8f222070 is described below commit 4c8f2220700e738cbff44782c1476f98f55d8f3e Author: Manu <36392121+x...@users.noreply.github.com> AuthorDate: Tue Aug 30 11:30:15 2022 +0800 [HUDI-4339] Add example configuration for HoodieCleaner in docs (#6326) Co-authored-by: Y Ethan Guo --- website/docs/hoodie_cleaner.md | 80 ++ .../version-0.11.1/hoodie_cleaner.md | 63 +++-- .../version-0.12.0/hoodie_cleaner.md | 62 +++-- 3 files changed, 179 insertions(+), 26 deletions(-) diff --git a/website/docs/hoodie_cleaner.md b/website/docs/hoodie_cleaner.md index 10f1aa2450..1687a0e065 100644 --- a/website/docs/hoodie_cleaner.md +++ b/website/docs/hoodie_cleaner.md @@ -14,15 +14,22 @@ each commit, to delete older file slices. It's recommended to leave this enabled When cleaning old files, you should be careful not to remove files that are being actively used by long running queries. Hudi cleaner currently supports the below cleaning policies to keep a certain number of commits or file versions: -- **KEEP_LATEST_COMMITS**: This is the default policy. This is a temporal cleaning policy that ensures the effect of -having lookback into all the changes that happened in the last X commits. Suppose a writer is ingesting data -into a Hudi dataset every 30 minutes and the longest running query can take 5 hours to finish, then the user should -retain atleast the last 10 commits. With such a configuration, we ensure that the oldest version of a file is kept on -disk for at least 5 hours, thereby preventing the longest running query from failing at any point in time. Incremental cleaning is also possible using this policy. -- **KEEP_LATEST_FILE_VERSIONS**: This policy has the effect of keeping N number of file versions irrespective of time. -This policy is useful when it is known how many MAX versions of the file does one want to keep at any given time. -To achieve the same behaviour as before of preventing long running queries from failing, one should do their calculations -based on data patterns. Alternatively, this policy is also useful if a user just wants to maintain 1 latest version of the file. +- **KEEP_LATEST_COMMITS**: This is the default policy. This is a temporal cleaning policy that ensures the effect of + having lookback into all the changes that happened in the last X commits. Suppose a writer is ingesting data + into a Hudi dataset every 30 minutes and the longest running query can take 5 hours to finish, then the user should + retain atleast the last 10 commits. With such a configuration, we ensure that the oldest version of a file is kept on + disk for at least 5 hours, thereby preventing the longest running query from failing at any point in time. Incremental cleaning is also possible using this policy. + Number of commits to retain can be configured by `hoodie.cleaner.commits.retained`. + +- **KEEP_LATEST_FILE_VERSIONS**: This policy has the effect of keeping N number of file versions irrespective of time. + This policy is useful when it is known how many MAX versions of the file does one want to keep at any given time. + To achieve the same behaviour as before of preventing long running queries from failing, one should do their calculations + based on data patterns. Alternatively, this policy is also useful if a user just wants to maintain 1 latest version of the file. + Number of file versions to retain can be configured by `hoodie.cleaner.fileversions.retained`. + +- **KEEP_LATEST_BY_HOURS**: This policy clean up based on hours.It is simple and useful when knowing that you want to keep files at any given time. + Corresponding to commits with commit times older than the configured number of hours to be retained are cleaned. + Currently you can configure by parameter `hoodie.cleaner.hours.retained`. ### Configurations For details about all possible configurations and their default values see the [configuration docs](https://hudi.apache.org/docs/configurations#Compaction-Configs). @@ -32,12 +39,52 @@ Hoodie Cleaner can be run as a separate process or along with your data ingestio ingesting data, configs are available which enable you to run it [synchronously or asynchronously](https://hudi.apache.org/docs/configurations#hoodiecleanasync). You can use this command for running the cleaner independently: -```java -[hoodie]$ spark-submit --class org.apache.hudi.utilities.HoodieCleaner \ - --props s3:///temp/hudi-ingestion-config/kafka-source.properties \ - --target-base-path s3:///temp/hudi \ - --spark-master yarn-clus
[GitHub] [hudi] yihua merged pull request #6326: [HUDI-4339] Add example configuration for HoodieCleaner in docs
yihua merged PR #6326: URL: https://github.com/apache/hudi/pull/6326 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on a diff in pull request #6534: [HUDI-4695] Fix flaky TestInlineCompaction#testCompactionRetryOnFailureBasedOnTime
xushiyan commented on code in PR #6534: URL: https://github.com/apache/hudi/pull/6534#discussion_r957970551 ## hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/action/compact/TestInlineCompaction.java: ## @@ -294,8 +294,9 @@ public void testCompactionRetryOnFailureBasedOnTime() throws Exception { moveCompactionFromRequestedToInflight(instantTime, cfg); } -// When: commit happens after 10s -HoodieWriteConfig inlineCfg = getConfigForInlineCompaction(5, 10, CompactionTriggerStrategy.TIME_ELAPSED); +// When: commit happens after 1000s. assumption is that, there won't be any new compaction getting scheduled within 100s, but the previous failed one will be Review Comment: is this gonna add a lot to the running time? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on a diff in pull request #6083: [DOCS] Fix link rendering error in Docker Demo and some other typos
yihua commented on code in PR #6083: URL: https://github.com/apache/hudi/pull/6083#discussion_r957970464 ## website/docs/configurations.md: ## @@ -3197,7 +3197,7 @@ Configurations that control compaction (merging of log files onto a new base fil --- > hoodie.keep.min.commits -> Similar to hoodie.keep.max.commits, but controls the minimum number ofinstants to retain in the active timeline. +> Similar to hoodie.keep.max.commits, but controls the minimum number of instants to retain in the active timeline. Review Comment: I addressed it in #6542. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua opened a new pull request, #6542: [MINOR] Fix typo in HoodieArchivalConfig
yihua opened a new pull request, #6542: URL: https://github.com/apache/hudi/pull/6542 ### Change Logs As above. ### Impact **Risk level: none** Only updates to config description. ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6536: [HUDI-4736] Fix inflight clean action preventing clean service to continue when multiple cleans are not allowed
hudi-bot commented on PR #6536: URL: https://github.com/apache/hudi/pull/6536#issuecomment-1231096991 ## CI report: * dc3daf9826dea5c5b2c09dec9e2b9b0f08048c16 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11028) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6535: [HUDI-4193] change protoc version so it compiles on m1 mac
hudi-bot commented on PR #6535: URL: https://github.com/apache/hudi/pull/6535#issuecomment-1231096982 ## CI report: * 4744b46d30c8b9cb57161f63996db85bd15b1dca Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11027) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on pull request #6347: [HUDI-4582] Support batch synchronization of partition to hive metastore to avoid timeout with --sync-mode="hms" and use-jdbc=false
xushiyan commented on PR #6347: URL: https://github.com/apache/hudi/pull/6347#issuecomment-1231096401 @honeyaya please also simplify the pr title and add details in change logs section. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on a diff in pull request #6347: [HUDI-4582] Support batch synchronization of partition to hive metastore to avoid timeout with --sync-mode="hms" and use-jdbc=false
xushiyan commented on code in PR #6347: URL: https://github.com/apache/hudi/pull/6347#discussion_r957968992 ## hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfig.java: ## @@ -69,6 +70,7 @@ public static String getBucketSpec(String bucketCols, int bucketNum) { public HiveSyncConfig(Properties props) { super(props); +validateParameters(); Review Comment: since validation is done in constructor. we don't need to check in JDBCExecutor either right? ## hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/ddl/HMSDDLExecutor.java: ## @@ -191,19 +195,27 @@ public void addPartitionsToTable(String tableName, List partitionsToAdd) } LOG.info("Adding partitions " + partitionsToAdd.size() + " to table " + tableName); try { + ValidationUtils.checkArgument(syncConfig.getIntOrDefault(HIVE_BATCH_SYNC_PARTITION_NUM) > 0, + "batch-sync-num for sync hive table must be greater than 0, pls check your parameter"); Review Comment: then this check can be removed? ## hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/ddl/HMSDDLExecutor.java: ## @@ -191,19 +195,27 @@ public void addPartitionsToTable(String tableName, List partitionsToAdd) } LOG.info("Adding partitions " + partitionsToAdd.size() + " to table " + tableName); try { + ValidationUtils.checkArgument(syncConfig.getIntOrDefault(HIVE_BATCH_SYNC_PARTITION_NUM) > 0, + "batch-sync-num for sync hive table must be greater than 0, pls check your parameter"); StorageDescriptor sd = client.getTable(databaseName, tableName).getSd(); - List partitionList = partitionsToAdd.stream().map(partition -> { -StorageDescriptor partitionSd = new StorageDescriptor(); -partitionSd.setCols(sd.getCols()); -partitionSd.setInputFormat(sd.getInputFormat()); -partitionSd.setOutputFormat(sd.getOutputFormat()); -partitionSd.setSerdeInfo(sd.getSerdeInfo()); -String fullPartitionPath = FSUtils.getPartitionPath(syncConfig.getString(META_SYNC_BASE_PATH), partition).toString(); -List partitionValues = partitionValueExtractor.extractPartitionValuesInPath(partition); -partitionSd.setLocation(fullPartitionPath); -return new Partition(partitionValues, databaseName, tableName, 0, 0, partitionSd, null); - }).collect(Collectors.toList()); - client.add_partitions(partitionList, true, false); + List partitionList = new ArrayList<>(); Review Comment: let's not re-use the same variable. create new var for each batch -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on a diff in pull request #6083: [DOCS] Fix link rendering error in Docker Demo and some other typos
yihua commented on code in PR #6083: URL: https://github.com/apache/hudi/pull/6083#discussion_r957967450 ## website/docs/configurations.md: ## @@ -3197,7 +3197,7 @@ Configurations that control compaction (merging of log files onto a new base fil --- > hoodie.keep.min.commits -> Similar to hoodie.keep.max.commits, but controls the minimum number ofinstants to retain in the active timeline. +> Similar to hoodie.keep.max.commits, but controls the minimum number of instants to retain in the active timeline. Review Comment: Please refrain from changing the `configurations.md` directly. This is automatically generated and updated based on the Hudi config classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] LinMingQiang commented on a diff in pull request #6393: [HUDI-4619] Fix The retry mechanism of remotehoodietablefilesystemvie…
LinMingQiang commented on code in PR #6393: URL: https://github.com/apache/hudi/pull/6393#discussion_r957966434 ## hudi-timeline-service/src/test/java/org/apache/hudi/timeline/service/functional/TestRemoteHoodieTableFileSystemView.java: ## @@ -64,4 +66,31 @@ protected SyncableFileSystemView getFileSystemView(HoodieTimeline timeline) { view = new RemoteHoodieTableFileSystemView("localhost", server.getServerPort(), metaClient); return view; } + + @Test + public void testRemoteHoodieTableFileSystemViewWithRetry() { +// Service is available. +view.getLatestBaseFiles(); +// Shut down the service. +server.close(); +try { + // Immediately fails and throws a connection refused exception. + view.getLatestBaseFiles(); +} catch (HoodieRemoteException e) { + assert e.getMessage().contains("Connection refused (Connection refused)"); +} +// Enable API request retry for remote file system view. +view = new RemoteHoodieTableFileSystemView(metaClient, FileSystemViewStorageConfig +.newBuilder() +.withRemoteServerHost("localhost") +.withRemoteServerPort(server.getServerPort()) +.withRemoteTimelineClientRetry(true) +.withRemoteTimelineClientMaxRetryNumbers(4) +.build()); +try { + view.getLatestBaseFiles(); Review Comment: > is it no possible to test that retry succeed after 2 or 3 tries? I can create a Thread to restart the service. ## hudi-timeline-service/src/test/java/org/apache/hudi/timeline/service/functional/TestRemoteHoodieTableFileSystemView.java: ## @@ -64,4 +66,31 @@ protected SyncableFileSystemView getFileSystemView(HoodieTimeline timeline) { view = new RemoteHoodieTableFileSystemView("localhost", server.getServerPort(), metaClient); return view; } + + @Test + public void testRemoteHoodieTableFileSystemViewWithRetry() { +// Service is available. +view.getLatestBaseFiles(); +// Shut down the service. +server.close(); +try { + // Immediately fails and throws a connection refused exception. + view.getLatestBaseFiles(); +} catch (HoodieRemoteException e) { + assert e.getMessage().contains("Connection refused (Connection refused)"); +} +// Enable API request retry for remote file system view. +view = new RemoteHoodieTableFileSystemView(metaClient, FileSystemViewStorageConfig +.newBuilder() +.withRemoteServerHost("localhost") +.withRemoteServerPort(server.getServerPort()) +.withRemoteTimelineClientRetry(true) +.withRemoteTimelineClientMaxRetryNumbers(4) +.build()); +try { + view.getLatestBaseFiles(); Review Comment: Done! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] LinMingQiang commented on a diff in pull request #6393: [HUDI-4619] Fix The retry mechanism of remotehoodietablefilesystemvie…
LinMingQiang commented on code in PR #6393: URL: https://github.com/apache/hudi/pull/6393#discussion_r957966434 ## hudi-timeline-service/src/test/java/org/apache/hudi/timeline/service/functional/TestRemoteHoodieTableFileSystemView.java: ## @@ -64,4 +66,31 @@ protected SyncableFileSystemView getFileSystemView(HoodieTimeline timeline) { view = new RemoteHoodieTableFileSystemView("localhost", server.getServerPort(), metaClient); return view; } + + @Test + public void testRemoteHoodieTableFileSystemViewWithRetry() { +// Service is available. +view.getLatestBaseFiles(); +// Shut down the service. +server.close(); +try { + // Immediately fails and throws a connection refused exception. + view.getLatestBaseFiles(); +} catch (HoodieRemoteException e) { + assert e.getMessage().contains("Connection refused (Connection refused)"); +} +// Enable API request retry for remote file system view. +view = new RemoteHoodieTableFileSystemView(metaClient, FileSystemViewStorageConfig +.newBuilder() +.withRemoteServerHost("localhost") +.withRemoteServerPort(server.getServerPort()) +.withRemoteTimelineClientRetry(true) +.withRemoteTimelineClientMaxRetryNumbers(4) +.build()); +try { + view.getLatestBaseFiles(); Review Comment: > is it no possible to test that retry succeed after 2 or 3 tries? I can start a thread to restart the service. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xiarixiaoyao commented on issue #6496: [SUPPORT] Hudi schema evolution, Null for oldest values
xiarixiaoyao commented on issue #6496: URL: https://github.com/apache/hudi/issues/6496#issuecomment-1231091867 @Armelabdelkbir if you has this requirement for spark 3.1x pls raise a pr, and i will fixed it as soon as possible -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on a diff in pull request #6302: [DOCS] Clarification to Docker quickstart demo
yihua commented on code in PR #6302: URL: https://github.com/apache/hudi/pull/6302#discussion_r957964460 ## website/docs/docker_demo.md: ## @@ -112,7 +118,7 @@ Copying spark default config and setting up configs $ docker ps ``` -At this point, the docker cluster will be up and running. The demo cluster brings up the following services +At this point, the Dockercluster will be up and running. The demo cluster brings up the following services Review Comment: nit: `Dockercluster` -> `Docker cluster` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on pull request #6302: [DOCS] Clarification to Docker quickstart demo
yihua commented on PR #6302: URL: https://github.com/apache/hudi/pull/6302#issuecomment-1231090323 @rmoff Thanks for your first contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4740) Add metadata fields for hive catalog #createTable
[ https://issues.apache.org/jira/browse/HUDI-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-4740: - Labels: pull-request-available (was: ) > Add metadata fields for hive catalog #createTable > - > > Key: HUDI-4740 > URL: https://issues.apache.org/jira/browse/HUDI-4740 > Project: Apache Hudi > Issue Type: Bug > Components: flink >Affects Versions: 0.12.0 >Reporter: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 0.12.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] danny0405 opened a new pull request, #6541: [HUDI-4740] Add metadata fields for hive catalog #createTable
danny0405 opened a new pull request, #6541: URL: https://github.com/apache/hudi/pull/6541 ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ ### Impact _Describe any public API or user-facing feature change or any performance impact._ **Risk level: none | low | medium | high** _Choose one. If medium or high, explain what verification was done to mitigate the risks._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-4740) Add metadata fields for hive catalog #createTable
Danny Chen created HUDI-4740: Summary: Add metadata fields for hive catalog #createTable Key: HUDI-4740 URL: https://issues.apache.org/jira/browse/HUDI-4740 Project: Apache Hudi Issue Type: Bug Components: flink Affects Versions: 0.12.0 Reporter: Danny Chen Fix For: 0.12.1 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6361: [WIP][HUDI-4690][HUDI-4503] Cleaning up Hudi custom Spark `Rule`s
alexeykudinkin commented on code in PR #6361: URL: https://github.com/apache/hudi/pull/6361#discussion_r957962272 ## hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala: ## @@ -26,127 +26,163 @@ import org.apache.hudi.hive.HiveSyncConfigHolder import org.apache.hudi.sync.common.HoodieSyncConfig import org.apache.hudi.{AvroConversionUtils, DataSourceWriteOptions, HoodieSparkSqlWriter, SparkAdapterSupport} import org.apache.spark.sql._ -import org.apache.spark.sql.catalyst.TableIdentifier -import org.apache.spark.sql.catalyst.analysis.Resolver import org.apache.spark.sql.catalyst.catalog.HoodieCatalogTable -import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, AttributeReference, BoundReference, Cast, EqualTo, Expression, Literal} +import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, AttributeReference, BoundReference, EqualTo, Expression, Literal, NamedExpression, PredicateHelper} import org.apache.spark.sql.catalyst.plans.logical._ import org.apache.spark.sql.hudi.HoodieSqlCommonUtils._ -import org.apache.spark.sql.hudi.HoodieSqlUtils.getMergeIntoTargetTableId +import org.apache.spark.sql.hudi.analysis.HoodieAnalysis.failAnalysis +import org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.sameNamedExpr import org.apache.spark.sql.hudi.command.payload.ExpressionPayload import org.apache.spark.sql.hudi.command.payload.ExpressionPayload._ import org.apache.spark.sql.hudi.{ProvidesHoodieConfig, SerDeUtils} import org.apache.spark.sql.types.{BooleanType, StructType} import java.util.Base64 - /** - * The Command for hoodie MergeIntoTable. - * The match on condition must contain the row key fields currently, so that we can use Hoodie - * Index to speed up the performance. + * Hudi's implementation of the {@code MERGE INTO} (MIT) Spark SQL statement. + * + * NOTE: That this implementation is restricted in a some aspects to accommodate for Hudi's crucial + * constraint (of requiring every record to bear unique primary-key): merging condition ([[mergeCondition]]) + * is currently can only (and must) reference target table's primary-key columns (this is necessary to + * leverage Hudi's upserting capabilities including Indexes) + * + * Following algorithm is applied: * - * The main algorithm: + * + * Incoming batch ([[sourceTable]]) is reshaped such that it bears correspondingly: + * a) (required) "primary-key" column as well as b) (optional) "pre-combine" column; this is + * required since MIT statements does not restrict [[sourceTable]]s schema to be aligned w/ the + * [[targetTable]]s one, while Hudi's upserting flow expects such columns to be present * - * We pushed down all the matched and not matched (condition, assignment) expression pairs to the - * ExpressionPayload. And the matched (condition, assignment) expression pairs will execute in the - * ExpressionPayload#combineAndGetUpdateValue to compute the result record, while the not matched - * expression pairs will execute in the ExpressionPayload#getInsertValue. + * After reshaping we're writing [[sourceTable]] as a normal batch using Hudi's upserting + * sequence, where special [[ExpressionPayload]] implementation of the [[HoodieRecordPayload]] + * is used allowing us to execute updating, deleting and inserting clauses like following: * - * For Mor table, it is a litter complex than this. The matched record also goes through the getInsertValue - * and write append to the log. So the update actions & insert actions should process by the same - * way. We pushed all the update actions & insert actions together to the - * ExpressionPayload#getInsertValue. + * + * All the matched {@code WHEN MATCHED AND ... THEN (DELETE|UPDATE ...)} conditional clauses + * will produce [[(condition, expression)]] tuples that will be executed w/in the + * [[ExpressionPayload#combineAndGetUpdateValue]] against existing (from [[targetTable]]) and + * incoming (from [[sourceTable]]) records producing the updated one; * + * Not matched {@code WHEN NOT MATCHED AND ... THEN INSERT ...} conditional clauses + * will produce [[(condition, expression)]] tuples that will be executed w/in [[ExpressionPayload#getInsertValue]] + * against incoming records producing ones to be inserted into target table; + * + * + * + * TODO explain workflow for MOR tables */ case class MergeIntoHoodieTableCommand(mergeInto: MergeIntoTable) extends HoodieLeafRunnableCommand Review Comment: Deleting custom Spark rules uncovered quite a few issues in this implementation, unfortunately had to essentially re-write it to address these -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.a
[hudi] branch asf-site updated (d98c2e1949 -> fb9b036bc6)
This is an automated email from the ASF dual-hosted git repository. github-bot pushed a change to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git from d98c2e1949 [DOCS] Fix typo in compaction.md (#6492) add fb9b036bc6 GitHub Actions build asf-site No new revisions were added by this update. Summary of changes: content/404.html | 8 content/404/index.html | 8 content/assets/js/05957343.2f74fb4c.js | 1 - content/assets/js/05957343.bcd954e0.js | 1 + .../assets/js/{10b6d210.6f6cde11.js => 10b6d210.75d4a296.js} | 2 +- content/assets/js/3533dbd1.06c06f54.js | 1 + content/assets/js/3533dbd1.5d7383c4.js | 1 - content/assets/js/44e51e65.64758d42.js | 1 + content/assets/js/44e51e65.976444d3.js | 1 - content/assets/js/81d19844.3c1f9c47.js | 1 - content/assets/js/81d19844.766d0401.js | 1 + content/assets/js/85c8b6c7.40f22d8a.js | 1 + content/assets/js/85c8b6c7.fdfdc22c.js | 1 - content/assets/js/ad132b09.68c24193.js | 1 - content/assets/js/ad132b09.d3b8d8db.js | 1 + .../assets/js/{e2d9a3af.082ecc45.js => e2d9a3af.6b144892.js} | 2 +- content/assets/js/{main.7651a0bd.js => main.ffe146b0.js} | 4 ++-- 7651a0bd.js.LICENSE.txt => main.ffe146b0.js.LICENSE.txt} | 0 .../{runtime~main.a89c7360.js => runtime~main.51ac0c85.js} | 2 +- .../The-Case-for-incremental-processing-on-Hadoop/index.html | 8 content/blog/2016/12/30/strata-talk-2017/index.html | 8 .../index.html | 8 content/blog/2019/01/18/asf-incubation/index.html| 8 content/blog/2019/03/07/batch-vs-incremental/index.html | 8 .../blog/2019/05/14/registering-dataset-to-hive/index.html | 8 .../blog/2019/09/09/ingesting-database-changes/index.html| 8 content/blog/2019/10/22/Hudi-On-Hops/index.html | 8 .../index.html | 8 content/blog/2020/01/15/delete-support-in-hudi/index.html| 8 content/blog/2020/01/20/change-capture-using-aws/index.html | 8 content/blog/2020/03/22/exporting-hudi-datasets/index.html | 8 .../blog/2020/04/27/apache-hudi-apache-zepplin/index.html| 8 .../05/28/monitoring-hudi-metrics-with-datadog/index.html| 8 .../index.html | 8 .../index.html | 8 .../16/Apache-Hudi-grows-cloud-data-lake-maturity/index.html | 8 content/blog/2020/08/04/PrestoDB-and-Apache-Hudi/index.html | 8 .../18/hudi-incremental-processing-on-data-lakes/index.html | 8 .../efficient-migration-of-large-parquet-tables/index.html | 8 .../2020/08/21/async-compaction-deployment-model/index.html | 8 .../2020/08/22/ingest-multiple-tables-using-hudi/index.html | 8 .../2020/10/06/cdc-solution-using-hudi-by-nclouds/index.html | 8 .../2020/10/15/apache-hudi-meets-apache-flink/index.html | 8 .../2020/10/19/Origins-of-Data-Lake-at-Grofers/index.html| 8 .../2020/10/19/hudi-meets-aws-emr-and-aws-dms/index.html | 8 .../index.html | 8 .../index.html | 8 content/blog/2020/11/11/hudi-indexing-mechanisms/index.html | 8 .../11/29/Can-Big-Data-Solutions-Be-Affordable/index.html| 8 .../index.html | 8 content/blog/2021/01/27/hudi-clustering-intro/index.html | 8 content/blog/2021/02/13/hudi-key-generators/index.html | 8 .../index.html | 8 .../index.html | 8 content/blog/2021/03/01/hudi-file-sizing/index.html | 8 .../index.html | 8 .../New-features-from-Apache-hudi-in-Amazon-EMR/index.html | 8 .../index.html | 8 .../blog/2021/05/12/Experts-primer-on-Apache-Hudi/index.html | 8 .../index.html | 8 .../index.html | 8 .../16/Amazon-Athena-expands-Apache-Hudi-support/index.html | 8 .../index.html
[GitHub] [hudi] yihua merged pull request #6492: [DOCS] Fix typo in compaction.md
yihua merged PR #6492: URL: https://github.com/apache/hudi/pull/6492 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch asf-site updated: [DOCS] Fix typo in compaction.md (#6492)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new d98c2e1949 [DOCS] Fix typo in compaction.md (#6492) d98c2e1949 is described below commit d98c2e19493e8b26f082e791b8ca7b88ca38e397 Author: Terry Wang AuthorDate: Tue Aug 30 10:43:27 2022 +0800 [DOCS] Fix typo in compaction.md (#6492) --- website/docs/compaction.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/website/docs/compaction.md b/website/docs/compaction.md index 9d73e31bd5..a6249b7ae7 100644 --- a/website/docs/compaction.md +++ b/website/docs/compaction.md @@ -132,9 +132,9 @@ Offline compaction needs to submit the Flink task on the command line. The progr | Option Name | Required | Default | Remarks | | --- | --- | --- | --- | -| `--path` | `frue` | `--` | The path where the target table is stored on Hudi | +| `--path` | `true` | `--` | The path where the target table is stored on Hudi | | `--compaction-max-memory` | `false` | `100` | The index map size of log data during compaction, 100 MB by default. If you have enough memory, you can turn up this parameter | | `--schedule` | `false` | `false` | whether to execute the operation of scheduling compaction plan. When the write process is still writing, turning on this parameter have a risk of losing data. Therefore, it must be ensured that there are no write tasks currently writing data to this table when this parameter is turned on | | `--seq` | `false` | `LIFO` | The order in which compaction tasks are executed. Executing from the latest compaction plan by default. `LIFO`: executing from the latest plan. `FIFO`: executing from the oldest plan. | | `--service` | `false` | `false` | Whether to start a monitoring service that checks and schedules new compaction task in configured interval. | -| `--min-compaction-interval-seconds` | `false` | `600(s)` | The checking interval for service mode, by default 10 minutes. | \ No newline at end of file +| `--min-compaction-interval-seconds` | `false` | `600(s)` | The checking interval for service mode, by default 10 minutes. |
[GitHub] [hudi] yihua commented on pull request #6492: [DOCS] Fix typo in compaction.md
yihua commented on PR #6492: URL: https://github.com/apache/hudi/pull/6492#issuecomment-1231076201 @zjuwangg Thanks for your first contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch asf-site updated: [DOCS][MINOR] Improve spark quick start doc (#6538)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new d9e0a47cb8 [DOCS][MINOR] Improve spark quick start doc (#6538) d9e0a47cb8 is described below commit d9e0a47cb88649dfdac2a13250737a590e50e5eb Author: KnightChess <981159...@qq.com> AuthorDate: Tue Aug 30 10:38:45 2022 +0800 [DOCS][MINOR] Improve spark quick start doc (#6538) --- website/docs/quick-start-guide.md| 12 .../versioned_docs/version-0.10.0/quick-start-guide.md | 12 +--- .../versioned_docs/version-0.10.1/quick-start-guide.md | 16 .../versioned_docs/version-0.11.0/quick-start-guide.md | 12 .../versioned_docs/version-0.11.1/quick-start-guide.md | 12 .../versioned_docs/version-0.12.0/quick-start-guide.md | 12 6 files changed, 53 insertions(+), 23 deletions(-) diff --git a/website/docs/quick-start-guide.md b/website/docs/quick-start-guide.md index 02ebfc74e0..145c17f843 100644 --- a/website/docs/quick-start-guide.md +++ b/website/docs/quick-start-guide.md @@ -67,13 +67,15 @@ spark-shell \ # Spark 3.1 spark-shell \ --packages org.apache.hudi:hudi-spark3.1-bundle_2.12:0.12.0 \ - --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' + --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \ + --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' ``` ```shell # Spark 2.4 spark-shell \ --packages org.apache.hudi:hudi-spark2.4-bundle_2.11:0.12.0 \ - --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' + --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \ + --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' ``` @@ -104,14 +106,16 @@ pyspark \ export PYSPARK_PYTHON=$(which python3) pyspark \ --packages org.apache.hudi:hudi-spark3.1-bundle_2.12:0.12.0 \ ---conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' +--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \ +--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' ``` ```shell # Spark 2.4 export PYSPARK_PYTHON=$(which python3) pyspark \ --packages org.apache.hudi:hudi-spark2.4-bundle_2.11:0.12.0 \ ---conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' +--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \ +--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' ``` diff --git a/website/versioned_docs/version-0.10.0/quick-start-guide.md b/website/versioned_docs/version-0.10.0/quick-start-guide.md index e3f38448e9..108b1071cd 100644 --- a/website/versioned_docs/version-0.10.0/quick-start-guide.md +++ b/website/versioned_docs/version-0.10.0/quick-start-guide.md @@ -41,17 +41,20 @@ From the extracted directory run spark-shell with Hudi as: // spark-shell for spark 3 spark-shell \ --packages org.apache.hudi:hudi-spark3-bundle_2.12:0.10.0,org.apache.spark:spark-avro_2.12:3.1.2 \ - --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' + --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \ + --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' // spark-shell for spark 2 with scala 2.12 spark-shell \ --packages org.apache.hudi:hudi-spark-bundle_2.12:0.10.0,org.apache.spark:spark-avro_2.12:2.4.4 \ - --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' + --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \ + --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' // spark-shell for spark 2 with scala 2.11 spark-shell \ --packages org.apache.hudi:hudi-spark-bundle_2.11:0.10.0,org.apache.spark:spark-avro_2.11:2.4.4 \ - --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' + --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \ + --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' ``` @@ -91,16 +94,19 @@ export PYSPARK_PYTHON=$(which python3) pyspark --packages org.apache.hudi:hudi-spark3-bundle_2.12:0.10.0,org.apache.spark:spark-avro_2.12:3.1.2 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' +--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' # for spark2 with scala 2.12 pyspark --packages org.apache.hudi:hudi-spark-bundle_2.12:0.10.0,org.apache.spark:spark-avro_2.12:2.4.4 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' +--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' # for spark2 with scala 2.11 pyspark --packages org.apache.hudi:hudi-spark-bundle_2.11:0.10
[GitHub] [hudi] yihua merged pull request #6538: [DOCS][MINOR] Improve spark quick start doc
yihua merged PR #6538: URL: https://github.com/apache/hudi/pull/6538 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch asf-site updated: [DOCS] Update Hudi support versions for Redshift Spectrum in the current doc. (#6521)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new d9460c5621 [DOCS] Update Hudi support versions for Redshift Spectrum in the current doc. (#6521) d9460c5621 is described below commit d9460c5621d3dd603ed074f428e7231158a6fb6c Author: pomaster AuthorDate: Mon Aug 29 22:33:26 2022 -0400 [DOCS] Update Hudi support versions for Redshift Spectrum in the current doc. (#6521) Co-authored-by: “pomaster” <“phong”_...@yahoo.com”> --- website/docs/query_engine_setup.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/docs/query_engine_setup.md b/website/docs/query_engine_setup.md index 619ec17ca1..63978797a6 100644 --- a/website/docs/query_engine_setup.md +++ b/website/docs/query_engine_setup.md @@ -92,7 +92,7 @@ to `org.apache.hadoop.hive.ql.io.HiveInputFormat`. Then proceed to query the tab ## Redshift Spectrum -Copy on Write Tables in Apache Hudi versions 0.5.2, 0.6.0, 0.7.0, 0.8.0, 0.9.0, and 0.10.0 can be queried via Amazon Redshift Spectrum external tables. +Copy on Write Tables in Apache Hudi versions 0.5.2, 0.6.0, 0.7.0, 0.8.0, 0.9.0, 0.10.x, 0.11.x and 0.12.0 can be queried via Amazon Redshift Spectrum external tables. :::note Hudi tables are supported only when AWS Glue Data Catalog is used. It's not supported when you use an Apache Hive metastore as the external catalog. :::
[GitHub] [hudi] yihua merged pull request #6521: [DOCS] Update Hudi support versions for Redshift Spectrum in the current doc.
yihua merged PR #6521: URL: https://github.com/apache/hudi/pull/6521 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on pull request #6521: [DOCS] Update Hudi support versions for Redshift Spectrum in the current doc.
yihua commented on PR #6521: URL: https://github.com/apache/hudi/pull/6521#issuecomment-1231071099 @pomaster Thanks for your first contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on issue #6514: [SUPPORT] Creating Hudi table with SparkSQL fails with FileNotFoundException
yihua commented on issue #6514: URL: https://github.com/apache/hudi/issues/6514#issuecomment-1231070318 @functicons does adding `spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension` work for you? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-4340) DeltaStreamer bootstrap failed when metrics on caused by DateTimeParseException: Text '00000000000001999' could not be parsed
[ https://issues.apache.org/jira/browse/HUDI-4340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597475#comment-17597475 ] Teng Huo commented on HUDI-4340: PR https://github.com/apache/hudi/pull/6000 merged > DeltaStreamer bootstrap failed when metrics on caused by > DateTimeParseException: Text '01999' could not be parsed > - > > Key: HUDI-4340 > URL: https://issues.apache.org/jira/browse/HUDI-4340 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer, metrics >Reporter: Teng Huo >Assignee: Teng Huo >Priority: Blocker > Labels: pull-request-available > Fix For: 0.12.1 > > Attachments: error-deltastreamer.log > > > Found this bug in Hudi integrate test ITTestHoodieDemo.java > HoodieTimeline.METADATA_BOOTSTRAP_INSTANT_TS is a invalid value, > "01", which can not be parsed by DateTimeFormatter with format > SECS_INSTANT_TIMESTAMP_FORMAT = "MMddHHmmss" in method > HoodieInstantTimeGenerator.parseDateFromInstantTime > Error code at > org.apache.hudi.common.table.timeline.HoodieInstantTimeGenerator.parseDateFromInstantTime(HoodieInstantTimeGenerator.java:96) > https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieInstantTimeGenerator.java#L100 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #6539: [HUDI-4739] Wrong value returned when key's length equals 1
hudi-bot commented on PR #6539: URL: https://github.com/apache/hudi/pull/6539#issuecomment-1231069104 ## CI report: * 822e071498bbe67aaaced421c27cdffb8e9e6584 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11032) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan merged pull request #6000: [HUDI-4340] fix not parsable text DateTimeParseException in HoodieInstantTimeGenerator.parseDateFromInstantTime
nsivabalan merged PR #6000: URL: https://github.com/apache/hudi/pull/6000 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] szknb commented on issue #6530: [SUPPORT] org.apache.hudi.exception.HoodieException: Invalid partition name [2020/01/02, 2020/01/01, 2020/01/03]
szknb commented on issue #6530: URL: https://github.com/apache/hudi/issues/6530#issuecomment-1231068629 @nsivabalan `public class HudiExample { private static final Logger LOG = LogManager.getLogger(HudiExample.class); private static String tableType = HoodieTableType.COPY_ON_WRITE.name(); public static void main(String[] args) throws Exception { String tablePath = "hdfs://haruna/home/xxx/xxx/hudi"; String tableName = "hudi-test"; SparkConf sparkConf = HoodieExampleSparkUtils.defaultSparkConf("hoodie-client-example"); try (JavaSparkContext jsc = new JavaSparkContext(sparkConf)) { // Generator of some records to be loaded in. HoodieExampleDataGenerator dataGen = new HoodieExampleDataGenerator<>(); // initialize the table, if not done already Path path = new Path(tablePath); FileSystem fs = FSUtils.getFs(tablePath, jsc.hadoopConfiguration()); if (!fs.exists(path)) { HoodieTableMetaClient.initTableType(jsc.hadoopConfiguration(), tablePath, new HoodieTableConfig.Builder() .withTableType(HoodieTableType.valueOf(tableType)) .withTableName(tableName) .withPayloadClassName(HoodieTableType.valueOf(tableType), HoodieAvroPayload.class.getName()).build()); } // Create the write client to write some records in HoodieWriteConfig cfg = HoodieWriteConfig .newBuilder() .withPath(tablePath) .withSchema(HoodieExampleDataGenerator.TRIP_EXAMPLE_SCHEMA) .withParallelism(2, 2) .withDeleteParallelism(2) .forTable(tableName) .withIndexConfig(HoodieIndexConfig.newBuilder().withIndexType(HoodieIndex.IndexType.BLOOM).build()) .withCompactionConfig(HoodieCompactionConfig.newBuilder().archiveCommitsWith(20, 30).build()).build(); SparkRDDWriteClient client = new SparkRDDWriteClient<>(new HoodieSparkEngineContext(jsc), cfg); // inserts String newCommitTime = client.startCommit(); LOG.info("Starting commit " + newCommitTime); List> records = dataGen.generateInserts(newCommitTime, 10); List> recordsSoFar = new ArrayList<>(records); JavaRDD> writeRecords = jsc.parallelize(records, 1); client.upsert(writeRecords, newCommitTime); LOG.info("insert finished"); } } }` the HoodieExampleDataGenerator is: org.apache.hudi.examples.common.HoodieExampleDataGenerator; -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated (ac9ce85334 -> 71b8174058)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from ac9ce85334 [HUDI-4483] Fix checkstyle in integ-test module (#6523) add 71b8174058 [HUDI-4340] fix not parsable text DateTimeParseException by addng a method parseDateFromInstantTimeSafely for parsing timestamp when output metrics (#6000) No new revisions were added by this update. Summary of changes: .../apache/hudi/client/BaseHoodieWriteClient.java | 20 +-- .../apache/hudi/client/SparkRDDWriteClient.java| 22 .../table/timeline/HoodieActiveTimeline.java | 41 ++ .../table/timeline/HoodieInstantTimeGenerator.java | 7 +--- .../table/timeline/TestHoodieActiveTimeline.java | 23 ++-- 5 files changed, 76 insertions(+), 37 deletions(-)
[GitHub] [hudi] nsivabalan commented on pull request #6000: [HUDI-4340] fix not parsable text DateTimeParseException in HoodieInstantTimeGenerator.parseDateFromInstantTime
nsivabalan commented on PR #6000: URL: https://github.com/apache/hudi/pull/6000#issuecomment-1231068126 Latest CI run succeeded: https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=11009&view=results -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6539: [HUDI-4739] Wrong value returned when key's length equals 1
hudi-bot commented on PR #6539: URL: https://github.com/apache/hudi/pull/6539#issuecomment-1231066421 ## CI report: * 822e071498bbe67aaaced421c27cdffb8e9e6584 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6537: Avoid update metastore schema if only missing column in input
hudi-bot commented on PR #6537: URL: https://github.com/apache/hudi/pull/6537#issuecomment-1231066402 ## CI report: * 9e63b76454a06d57a141ad4b844752abb346d3fa UNKNOWN * a245595d0c988610d845f6918fe8c5ea76383e92 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11029) * 00b9224ec8c49e83ca51d52351c782083a4fba84 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11030) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated (c50b6346b5 -> ac9ce85334)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from c50b6346b5 [HUDI-4482] remove guava and use caffeine instead for cache (#6240) add ac9ce85334 [HUDI-4483] Fix checkstyle in integ-test module (#6523) No new revisions were added by this update. Summary of changes: hudi-integ-test/pom.xml| 1 - .../testsuite/HoodieContinousTestSuiteWriter.java | 2 -- .../testsuite/HoodieInlineTestSuiteWriter.java | 8 --- .../testsuite/HoodieMultiWriterTestSuiteJob.java | 3 +-- .../integ/testsuite/HoodieTestSuiteWriter.java | 4 ++-- .../SparkDataSourceContinuousIngestTool.java | 1 - .../testsuite/configuration/DFSDeltaConfig.java| 2 +- .../apache/hudi/integ/testsuite/dag/DagUtils.java | 28 -- .../integ/testsuite/dag/nodes/BaseQueryNode.java | 3 +-- .../dag/nodes/BaseValidateDatasetNode.java | 24 +-- .../integ/testsuite/dag/nodes/HiveQueryNode.java | 3 +-- .../integ/testsuite/dag/nodes/HiveSyncNode.java| 1 - .../integ/testsuite/dag/nodes/PrestoQueryNode.java | 3 +-- .../integ/testsuite/dag/nodes/TrinoQueryNode.java | 5 ++-- .../dag/nodes/ValidateAsyncOperations.java | 11 ++--- .../testsuite/dag/scheduler/DagScheduler.java | 1 - .../GenericRecordFullPayloadGenerator.java | 6 ++--- .../testsuite/reader/DFSAvroDeltaInputReader.java | 8 --- 18 files changed, 41 insertions(+), 73 deletions(-)
[GitHub] [hudi] yihua merged pull request #6523: [HUDI-4483] fix checkstyle in integ-test module
yihua merged PR #6523: URL: https://github.com/apache/hudi/pull/6523 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on pull request #6523: [HUDI-4483] fix checkstyle in integ-test module
yihua commented on PR #6523: URL: https://github.com/apache/hudi/pull/6523#issuecomment-1231066073 CI is green. https://user-images.githubusercontent.com/2497195/187334475-1c57c3a4-a0ae-4e37-986c-7fba2a8e03a2.png";> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6000: [HUDI-4340] fix not parsable text DateTimeParseException in HoodieInstantTimeGenerator.parseDateFromInstantTime
hudi-bot commented on PR #6000: URL: https://github.com/apache/hudi/pull/6000#issuecomment-1231065908 ## CI report: * 6f8e83a20276203550589848ef38953ae3edd5f5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11009) * dab63726e5470be1315bb0194720def2a61ecc14 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (HUDI-4730) FIX Batch job cannot clean old commits&data files in clean Function
[ https://issues.apache.org/jira/browse/HUDI-4730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian Feng reassigned HUDI-4730: --- Assignee: Jian Feng > FIX Batch job cannot clean old commits&data files in clean Function > --- > > Key: HUDI-4730 > URL: https://issues.apache.org/jira/browse/HUDI-4730 > Project: Apache Hudi > Issue Type: Bug > Components: flink >Reporter: Jian Feng >Assignee: Jian Feng >Priority: Major > Labels: pull-request-available > > FIX Batch job cannot clean old commits&data files in clean Function -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] yihua commented on issue #6540: [SUPPORT]KryoException when bulk insert into hudi with flink
yihua commented on issue #6540: URL: https://github.com/apache/hudi/issues/6540#issuecomment-1231065519 @danny0405 another KryoException issue in Flink -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Zhangshunyu commented on issue #6528: [SUPPORT]How to clean the compacted .log and .hfiles in metadata?
Zhangshunyu commented on issue #6528: URL: https://github.com/apache/hudi/issues/6528#issuecomment-1231065241 @yihua Ok, I see, thank you very much! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6534: [HUDI-4695] Fixing flaky TestInlineCompaction#testCompactionRetryOnFailureBasedOnTime
hudi-bot commented on PR #6534: URL: https://github.com/apache/hudi/pull/6534#issuecomment-1231063650 ## CI report: * 1a56cdc2bc53917efb33ff786ff14775dd2b526b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11026) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-4737) Fix flaky: TestHoodieSparkMergeOnReadTableRollback.testRollbackWithDeltaAndCompactionCommit
[ https://issues.apache.org/jira/browse/HUDI-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597468#comment-17597468 ] xi chaomin commented on HUDI-4737: -- Hi [~shivnarayan] , this test may have been fixed by [#5874.|https://github.com/apache/hudi/pull/5874,] From the point of logging time and line number, this branch is not the latest master, shall we merge the master and re-run the test? > Fix flaky: > TestHoodieSparkMergeOnReadTableRollback.testRollbackWithDeltaAndCompactionCommit > --- > > Key: HUDI-4737 > URL: https://issues.apache.org/jira/browse/HUDI-4737 > Project: Apache Hudi > Issue Type: Bug >Reporter: sivabalan narayanan >Priority: Major > > [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/9088/logs/21] > > {code:java} > 2022-06-06T07:55:56.8610256Z [ERROR] Tests run: 298, Failures: 1, Errors: 0, > Skipped: 0, Time elapsed: 4,569.528 s <<< FAILURE! - in JUnit Vintage > 2022-06-06T07:55:56.8611489Z [ERROR] boolean).[1] > true(testRollbackWithDeltaAndCompactionCommit Time elapsed: 55.377 s <<< > FAILURE! > 2022-06-06T07:55:56.8612231Z org.opentest4j.AssertionFailedError: expected: > <0> but was: <1> > 2022-06-06T07:55:56.8612919Z at > org.junit.jupiter.api.AssertionUtils.fail(AssertionUtils.java:55) > 2022-06-06T07:55:56.8613677Z at > org.junit.jupiter.api.AssertionUtils.failNotEqual(AssertionUtils.java:62) > 2022-06-06T07:55:56.8614730Z at > org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:166) > 2022-06-06T07:55:56.8615742Z at > org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:161) > 2022-06-06T07:55:56.8616614Z at > org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:611) > 2022-06-06T07:55:56.8617839Z at > org.apache.hudi.table.functional.TestHoodieSparkMergeOnReadTableRollback.testRollbackWithDeltaAndCompactionCommit(TestHoodieSparkMergeOnReadTableRollback.java:268) > 2022-06-06T07:55:56.8619135Z at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > 2022-06-06T07:55:56.8620057Z at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > 2022-06-06T07:55:56.8621014Z at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > 2022-06-06T07:55:56.8621778Z at > java.lang.reflect.Method.invoke(Method.java:498) > 2022-06-06T07:55:56.8622518Z at > org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:688) > 2022-06-06T07:55:56.8623350Z at > org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60) > 2022-06-06T07:55:56.8624441Z at > org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131) > 2022-06-06T07:55:56.8625493Z at > org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:149) > 2022-06-06T07:55:56.8626499Z at > org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:140) > 2022-06-06T07:55:56.8642788Z at > org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestTemplateMethod(TimeoutExtension.java:92) > 2022-06-06T07:55:56.8644032Z at > org.junit.jupiter.engine.execution.ExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(ExecutableInvoker.java:115) > 2022-06-06T07:55:56.8645036Z at > org.junit.jupiter.engine.execution.ExecutableInvoker.lambda$invoke$0(ExecutableInvoker.java:105) > 2022-06-06T07:55:56.8646046Z at > org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106) > 2022-06-06T07:55:56.8648269Z at > org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64) > 2022-06-06T07:55:56.8649118Z at > org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45) > 2022-06-06T07:55:56.8650108Z at > org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37) > 2022-06-06T07:55:56.8651091Z at > org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:104) > 2022-06-06T07:55:56.8651889Z at > org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:98) > 2022-06-06T07:55:56.8652809Z at > org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$6(TestMethodTestDescriptor.java:212) > 2022-06-06T07:55:56.8653936Z at > org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73) > 2022-06-06T07:55:56.8654845Z at > org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor
[GitHub] [hudi] yihua commented on issue #6528: [SUPPORT]How to clean the compacted .log and .hfiles in metadata?
yihua commented on issue #6528: URL: https://github.com/apache/hudi/issues/6528#issuecomment-1231062883 The reason you don't see any instantTime + '002' in the timeline is that the clean action does not happen in the metadata table. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on issue #6528: [SUPPORT]How to clean the compacted .log and .hfiles in metadata?
yihua commented on issue #6528: URL: https://github.com/apache/hudi/issues/6528#issuecomment-1231062546 > Hi @yihua, thanks for your reply, i will try once. BTW, what's the meaning of '002' here in writeClient.clean(instantTime + "002"); i didnt find any instantTime + '002' in timeline `002` is the suffix for the clean instant timestamp. The metadata table writer takes the same timestamp from the deltacommit and adds the suffix of `001` for compaction and `002` for clean to differentiate from the corresponding deltacommit. This is for easy debugging. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hbgstc123 opened a new issue, #6540: [SUPPORT]KryoException when bulk insert into hudi with flink
hbgstc123 opened a new issue, #6540: URL: https://github.com/apache/hudi/issues/6540 When bulk insert into hudi with flink, flink job fail with Exception com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException -- hudi table DDL CREATE TEMPORARY TABLE table_one ( imp_date string, id bigint, name string, ts timestamp(3) ) PARTITIONED BY (imp_date) WITH ( 'connector' = 'hudi', 'path' = ${hdfs_path}, 'write.operation' = 'bulk_insert', 'table.type' = 'MERGE_ON_READ', 'hoodie.table.keygenerator.class' = 'org.apache.hudi.keygen.SimpleKeyGenerator', 'hoodie.datasource.write.recordkey.field' = 'id', 'write.precombine.field' = 'ts', 'hive_sync.enable' = 'true', 'hive_sync.mode' = 'hms', 'hive_sync.metastore.uris' = 'thrift://...', 'hive_sync.db' = 'hive_db', 'hive_sync.table' = 'table_one', 'hive_sync.partition_fields' = 'imp_date', 'hive_sync.partition_extractor_class' = 'org.apache.hudi.hive.MultiPartKeysValueExtractor', 'hoodie.datasource.write.hive_style_partitioning' = 'true', 'hoodie.metadata.enable'='true' ); -- insert SQL insert into table_one select DATE_FORMAT(ts, 'MMdd') || cast(hour(ts) as string) as dt ,id ,`name` ,ts from source_table; **Environment Description** * Hudi version : 0.11 & 0.12 * Flink version : 1.13 * Storage (HDFS/S3/GCS..) : HDFS **Stacktrace** com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException Serialization trace: cleaner (org.apache.flink.core.memory.MemorySegment) segments (org.apache.flink.table.data.binary.BinaryRowData) at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:82) at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:495) at com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:577) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:320) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:289) at com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:577) at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:68) at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:495) at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:505) at org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.copy(KryoSerializer.java:266) at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:69) at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:46) at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:26) at org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:50) at org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:28) at org.apache.flink.table.runtime.util.StreamRecordCollector.collect(StreamRecordCollector.java:44) at org.apache.hudi.sink.bulk.sort.SortOperator.endInput(SortOperator.java:113) at org.apache.flink.streaming.runtime.tasks.StreamOperatorWrapper.endOperatorInput(StreamOperatorWrapper.java:91) at org.apache.flink.streaming.runtime.tasks.OperatorChain.endInput(OperatorChain.java:441) at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:69) at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:427) at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:204) at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:688) at org.apache.flink.streaming.runtime.tasks.StreamTask.executeInvoke(StreamTask.java:643) at org.apache.flink.streaming.runtime.tasks.StreamTask.runWithCleanUpOnFail(StreamTask.java:654) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:627) at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:782) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:566) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.NullPointerException at com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:80) at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:488) at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:57) ... 28 more -- This is an automated message
[jira] [Updated] (HUDI-4739) Wrong value returned when length equals 1
[ https://issues.apache.org/jira/browse/HUDI-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-4739: - Labels: pull-request-available (was: ) > Wrong value returned when length equals 1 > - > > Key: HUDI-4739 > URL: https://issues.apache.org/jira/browse/HUDI-4739 > Project: Apache Hudi > Issue Type: Bug >Reporter: wuwenchi >Priority: Major > Labels: pull-request-available > > In "KeyGenUtils#extractRecordKeys" function, it will return the value > corresponding to the key, but when the length is equal to 1, the key and > value are returned. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] wuwenchi opened a new pull request, #6539: [HUDI-4739] Wrong value returned when key's length equals 1
wuwenchi opened a new pull request, #6539: URL: https://github.com/apache/hudi/pull/6539 ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ ### Impact _Describe any public API or user-facing feature change or any performance impact._ **Risk level: none | low | medium | high** _Choose one. If medium or high, explain what verification was done to mitigate the risks._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] microbearz commented on issue #5792: [SUPPORT] Update hudi table(using SparkSQL) failed when the column contains `null` value in other records
microbearz commented on issue #5792: URL: https://github.com/apache/hudi/issues/5792#issuecomment-1231055367 @a0x I tried to reproduce with master branch, and failed at step 3. ` Cannot write 'note': NullType is incompatible with StringType;` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-4739) Wrong value returned when length equals 1
wuwenchi created HUDI-4739: -- Summary: Wrong value returned when length equals 1 Key: HUDI-4739 URL: https://issues.apache.org/jira/browse/HUDI-4739 Project: Apache Hudi Issue Type: Bug Reporter: wuwenchi In "KeyGenUtils#extractRecordKeys" function, it will return the value corresponding to the key, but when the length is equal to 1, the key and value are returned. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] Zhangshunyu commented on issue #6528: [SUPPORT]How to clean the compacted .log and .hfiles in metadata?
Zhangshunyu commented on issue #6528: URL: https://github.com/apache/hudi/issues/6528#issuecomment-1231054611 Hi @yihua, thanks for your reply, i will try once. BTW, what's the meaning of '002' here in writeClient.clean(instantTime + "002"); i didnt find any instantTime + '002' in timeline -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] KnightChess commented on issue #6514: [SUPPORT] Creating Hudi table with SparkSQL fails with FileNotFoundException
KnightChess commented on issue #6514: URL: https://github.com/apache/hudi/issues/6514#issuecomment-1231052745 @functicons all version I think need. The doc desc is a bit confusing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on issue #6528: [SUPPORT]How to clean the compacted .log and .hfiles in metadata?
yihua commented on issue #6528: URL: https://github.com/apache/hudi/issues/6528#issuecomment-1231051936 @Zhangshunyu as @nsivabalan Hudi manages the compaction and cleaning for metadata table internally, as shown below in `HoodieBackedTableMetadataWriter` class: ``` protected void cleanIfNecessary(BaseHoodieWriteClient writeClient, String instantTime) { Option lastCompletedCompactionInstant = metadataMetaClient.reloadActiveTimeline() .getCommitTimeline().filterCompletedInstants().lastInstant(); if (lastCompletedCompactionInstant.isPresent() && metadataMetaClient.getActiveTimeline().filterCompletedInstants() .findInstantsAfter(lastCompletedCompactionInstant.get().getTimestamp()).countInstants() < 3) { // do not clean the log files immediately after compaction to give some buffer time for metadata table reader, // because there is case that the reader has prepared for the log file readers already before the compaction completes // while before/during the reading of the log files, the cleaning triggers and delete the reading files, // then a FileNotFoundException(for LogFormatReader) or NPE(for HFileReader) would throw. // 3 is a value that I think is enough for metadata table reader. return; } // Trigger cleaning with suffixes based on the same instant time. This ensures that any future // delta commits synced over will not have an instant time lesser than the last completed instant on the // metadata table. writeClient.clean(instantTime + "002"); } ``` Also as laid out above, the current logic prevents the cleaning from happening within 3 instants after the compaction in metadata table. That could be the reason why you don't see cleaning, as `hoodie.metadata.compact.max.delta.commits` is set to 1. Could you try setting `hoodie.metadata.compact.max.delta.commits` to 5 and see if that solves your problem? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] KnightChess opened a new pull request, #6538: [MINOR] improve spark quick start doc
KnightChess opened a new pull request, #6538: URL: https://github.com/apache/hudi/pull/6538 ### Change Logs #6405 #6514 , user will not use extend config when read current doc ### Impact _Describe any public API or user-facing feature change or any performance impact._ **Risk level: none | low | medium | high** _Choose one. If medium or high, explain what verification was done to mitigate the risks._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] dik111 commented on issue #6430: [SUPPORT]Flink SQL can't read complex type data Java client write
dik111 commented on issue #6430: URL: https://github.com/apache/hudi/issues/6430#issuecomment-1231050940 I meet the same error in flink-sql-client -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org