[GitHub] [hudi] danny0405 commented on a diff in pull request #9223: [HUDI-6553] Speedup column stats and bloom index creation on large datasets.
danny0405 commented on code in PR #9223: URL: https://github.com/apache/hudi/pull/9223#discussion_r1294257025 ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java: ## @@ -87,14 +88,15 @@ import java.util.LinkedList; import java.util.List; import java.util.Map; -import java.util.Objects; import java.util.Set; import java.util.function.BiFunction; import java.util.function.Function; import java.util.stream.Collector; import java.util.stream.Collectors; import java.util.stream.Stream; +import scala.Tuple3; + Review Comment: Can we use `org.apache.hudi.common.util.collection.Triple` instead? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9223: [HUDI-6553] Speedup column stats and bloom index creation on large datasets.
danny0405 commented on code in PR #9223: URL: https://github.com/apache/hudi/pull/9223#discussion_r1294256449 ## hudi-common/pom.xml: ## @@ -103,6 +103,13 @@ + + + org.scala-lang + scala-library Review Comment: I don't think we should introduce any scala dependency in `hudi-common` module. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codope commented on a diff in pull request #9433: [HUDI-6686] - Handling empty commits after s3 applyFilter api
codope commented on code in PR #9433: URL: https://github.com/apache/hudi/pull/9433#discussion_r1294252246 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java: ## @@ -157,26 +157,24 @@ public Pair>, String> fetchNextBatch(Option lastChec } Dataset source = queryRunner.run(queryInfo); -if (source.isEmpty()) { - LOG.info("Source of file names is empty. Returning empty result and endInstant: " - + queryInfo.getEndInstant()); - return Pair.of(Option.empty(), queryInfo.getEndInstant()); -} - Dataset filteredSourceData = applyFilter(source, fileFormat); LOG.info("Adjusting end checkpoint:" + queryInfo.getEndInstant() + " based on sourceLimit :" + sourceLimit); -Pair> checkPointAndDataset = +Pair>> checkPointAndDataset = IncrSourceHelper.filterAndGenerateCheckpointBasedOnSourceLimit( filteredSourceData, sourceLimit, queryInfo, cloudObjectIncrCheckpoint); +if (!checkPointAndDataset.getRight().isPresent()) { + LOG.info("Empty source, returning endpoint:" + queryInfo.getEndInstant()); + return Pair.of(Option.empty(), queryInfo.getEndInstant()); +} LOG.info("Adjusted end checkpoint :" + checkPointAndDataset.getLeft()); String s3FS = getStringWithAltKeys(props, S3_FS_PREFIX, true).toLowerCase(); String s3Prefix = s3FS + "://"; // Create S3 paths SerializableConfiguration serializableHadoopConf = new SerializableConfiguration(sparkContext.hadoopConfiguration()); -List cloudObjectMetadata = checkPointAndDataset.getRight() +List cloudObjectMetadata = checkPointAndDataset.getRight().get() Review Comment: Can the Option be empty or nullable? Should we check before calling get() on Option? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codope opened a new pull request, #9448: [MINOR] Moving to 1.0.0-SNAPSHOT on master branch
codope opened a new pull request, #9448: URL: https://github.com/apache/hudi/pull/9448 ### Change Logs Changed pom version to `1.0.0-SNAPSHOT`. ### Impact none ### Risk level (write none, low medium or high below) none ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9447: [MINOR] Infer the preCombine field only if the value is not null
hudi-bot commented on PR #9447: URL: https://github.com/apache/hudi/pull/9447#issuecomment-1678480510 ## CI report: * c181bd4a3fa227cef4ab96457c38d9b207b6a981 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19298) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9447: [MINOR] Infer the preCombine field only if the value is not null
hudi-bot commented on PR #9447: URL: https://github.com/apache/hudi/pull/9447#issuecomment-1678475353 ## CI report: * c181bd4a3fa227cef4ab96457c38d9b207b6a981 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9437: [HUDI-6689] Add record index validation in MDT validator
hudi-bot commented on PR #9437: URL: https://github.com/apache/hudi/pull/9437#issuecomment-1678469976 ## CI report: * 0cc0c34422625e63bf9e421d73c22959b7cc9916 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19296) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 opened a new pull request, #9447: [MINOR] Infer the preCombine field only if the value is not null
danny0405 opened a new pull request, #9447: URL: https://github.com/apache/hudi/pull/9447 ### Change Logs Table created by Spark may not have the preCombine field set up. ### Impact none ### Risk level (write none, low medium or high below) none ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] ksoullpwk commented on issue #9440: [SUPPORT] Trino cannot read when there is replacecommit metadata
ksoullpwk commented on issue #9440: URL: https://github.com/apache/hudi/issues/9440#issuecomment-1678460113 Yes, it works. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9434: Dummy commit to trigger CI
hudi-bot commented on PR #9434: URL: https://github.com/apache/hudi/pull/9434#issuecomment-1678442027 ## CI report: * e895bfb27350f497100c3cd50246badcba99f27d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19272) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19273) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19289) * 1728274eb5640204a88c8f8915fca62f58c1cb6a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19297) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9434: Dummy commit to trigger CI
hudi-bot commented on PR #9434: URL: https://github.com/apache/hudi/pull/9434#issuecomment-1678437811 ## CI report: * e895bfb27350f497100c3cd50246badcba99f27d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19272) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19273) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19289) * 1728274eb5640204a88c8f8915fca62f58c1cb6a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #8683: [HUDI-5533] Support spark columns comments
danny0405 commented on code in PR #8683: URL: https://github.com/apache/hudi/pull/8683#discussion_r1294178078 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/catalog/TableOptionProperties.java: ## @@ -184,7 +184,9 @@ public static Map translateFlinkTableProperties2Spark( partitionKeys, sparkVersion, 4000, -messageType); +messageType, +// flink does not support comment yet +Arrays.asList()); Review Comment: Collections.emptyList() ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Riddle4045 commented on issue #9435: [SUPPORT] Trino can't read tables created by Flink Hudi conector
Riddle4045 commented on issue #9435: URL: https://github.com/apache/hudi/issues/9435#issuecomment-1678406201 > > HMS props for the Hudi table creating using Flink SQL > > You are using the Flink Hive catalog, the table are actually created by the hive catalog. Actually we have a separate Hudi hive catalog instead, the syntax looks like: > > ```sql > CREATE CATALOG hoodie_catalog > WITH ( > 'type'='hudi', > 'catalog.path' = '${catalog root path}', > 'hive.conf.dir' = '${hive-site.xml dir}', > 'mode'='hms' > ); > ``` > > The error log in JM indicates a missing calcite-core jar, you can fix it by adding it to the classpath. Thanks, I'll give it a try! @danny0405 in the table definition I specified `connector=hudi` is that not sufficient? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch asf-site updated: [DOCS] Updated image paths for blogs (#9446)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new b1b1b524bbd [DOCS] Updated image paths for blogs (#9446) b1b1b524bbd is described below commit b1b1b524bbde2423520d94c50d0c6a70d8a51e4c Author: nadine farah AuthorDate: Mon Aug 14 21:01:52 2023 -0700 [DOCS] Updated image paths for blogs (#9446) --- ...ction-Techniques-and-Approaches-Using-AWS-Glue-by-Job-Target-LLC.mdx | 2 +- .../2023-07-21-AWS-Glue-Crawlers-now-supports-Apache-Hudi-Tables.mdx| 2 +- ...Hudi-Revolutionizing-Big-Data-Management-for-Real-Time-Analytics.mdx | 2 +- .../blog/2023-08-03-Apache-Hudi-on-AWS-Glue-A-Step-by-Step-Guide.mdx| 2 +- ...a-lake-Table-formats-Apache-Iceberg-vs-Apache-Hudi-vs-Delta-lake.mdx | 2 +- ...-09-Lakehouse-Trifecta-Delta-Lake-Apache-Iceberg-and-Apache-Hudi.mdx | 2 +- 6 files changed, 6 insertions(+), 6 deletions(-) diff --git a/website/blog/2023-07-20-Backfilling-Apache-Hudi-Tables-in-Production-Techniques-and-Approaches-Using-AWS-Glue-by-Job-Target-LLC.mdx b/website/blog/2023-07-20-Backfilling-Apache-Hudi-Tables-in-Production-Techniques-and-Approaches-Using-AWS-Glue-by-Job-Target-LLC.mdx index 683a4e22352..2f22e41379e 100644 --- a/website/blog/2023-07-20-Backfilling-Apache-Hudi-Tables-in-Production-Techniques-and-Approaches-Using-AWS-Glue-by-Job-Target-LLC.mdx +++ b/website/blog/2023-07-20-Backfilling-Apache-Hudi-Tables-in-Production-Techniques-and-Approaches-Using-AWS-Glue-by-Job-Target-LLC.mdx @@ -3,7 +3,7 @@ title: "Backfilling Apache Hudi Tables in Production: Techniques & Approaches Us authors: - name: Soumil Shah category: blog -image: /assets/images/2023-07-20-Backfilling-Apache-Hudi-Tables-in-Production-Techniques-and-Approaches-Using-AWS-Glue-by-Job-Target-LLC.png +image: /assets/images/blog/2023-07-20-Backfilling-Apache-Hudi-Tables-in-Production-Techniques-and-Approaches-Using-AWS-Glue-by-Job-Target-LLC.png tags: - blog - backfilling diff --git a/website/blog/2023-07-21-AWS-Glue-Crawlers-now-supports-Apache-Hudi-Tables.mdx b/website/blog/2023-07-21-AWS-Glue-Crawlers-now-supports-Apache-Hudi-Tables.mdx index 0d93d4be701..cb55c854070 100644 --- a/website/blog/2023-07-21-AWS-Glue-Crawlers-now-supports-Apache-Hudi-Tables.mdx +++ b/website/blog/2023-07-21-AWS-Glue-Crawlers-now-supports-Apache-Hudi-Tables.mdx @@ -3,7 +3,7 @@ title: "AWS Glue Crawlers now supports Apache Hudi Tables" authors: - name: AWS Team category: blog -image: /assets/images/2023-07-21-AWS-Glue-Crawlers-now-supports-Apache-Hudi-Tables.png +image: /assets/images/blog/2023-07-21-AWS-Glue-Crawlers-now-supports-Apache-Hudi-Tables.png tags: - blog - aws glue diff --git a/website/blog/2023-07-27-Apache-Hudi-Revolutionizing-Big-Data-Management-for-Real-Time-Analytics.mdx b/website/blog/2023-07-27-Apache-Hudi-Revolutionizing-Big-Data-Management-for-Real-Time-Analytics.mdx index 08224b604e9..1dff86efb9f 100644 --- a/website/blog/2023-07-27-Apache-Hudi-Revolutionizing-Big-Data-Management-for-Real-Time-Analytics.mdx +++ b/website/blog/2023-07-27-Apache-Hudi-Revolutionizing-Big-Data-Management-for-Real-Time-Analytics.mdx @@ -3,7 +3,7 @@ title: "Apache Hudi: Revolutionizing Big Data Management for Real-Time Analytics authors: - name: Dev Jain category: blog -image: /assets/images/2023-07-27-Apache-Hudi-Revolutionizing-Big-Data-Management-for-Real-Time-Analytics.png +image: /assets/images/blog/2023-07-27-Apache-Hudi-Revolutionizing-Big-Data-Management-for-Real-Time-Analytics.png tags: - blog - medium diff --git a/website/blog/2023-08-03-Apache-Hudi-on-AWS-Glue-A-Step-by-Step-Guide.mdx b/website/blog/2023-08-03-Apache-Hudi-on-AWS-Glue-A-Step-by-Step-Guide.mdx index 4f0eab9402d..3a4d895a929 100644 --- a/website/blog/2023-08-03-Apache-Hudi-on-AWS-Glue-A-Step-by-Step-Guide.mdx +++ b/website/blog/2023-08-03-Apache-Hudi-on-AWS-Glue-A-Step-by-Step-Guide.mdx @@ -3,7 +3,7 @@ title: "Apache Hudi on AWS Glue: A Step-by-Step Guide" authors: - name: Dev Jain category: blog -image: /assets/images/2023-08-03-Apache-Hudi-on-AWS-Glue-A-Step-by-Step-Guide.png +image: /assets/images/blog/2023-08-03-Apache-Hudi-on-AWS-Glue-A-Step-by-Step-Guide.png tags: - blog - medium diff --git a/website/blog/2023-08-03-Data-lake-Table-formats-Apache-Iceberg-vs-Apache-Hudi-vs-Delta-lake.mdx b/website/blog/2023-08-03-Data-lake-Table-formats-Apache-Iceberg-vs-Apache-Hudi-vs-Delta-lake.mdx index c7e017e6834..82c53b05179 100644 --- a/website/blog/2023-08-03-Data-lake-Table-formats-Apache-Iceberg-vs-Apache-Hudi-vs-Delta-lake.mdx +++ b/website/blog/2023-08-03-Data-lake-Table-formats-Apache-Iceberg-vs-Apache-Hudi-vs-Delta-lake.mdx @@ -3,7 +3,7 @@ title: "Data lake Table formats : Apache Iceberg vs Apache Hudi vs Delta lake" authors: - name: Shashwat Pandey category: blog -image: /assets/images/2023-08-03-Data-lake-Tabl
[GitHub] [hudi] yihua merged pull request #9446: [DOCS] Updated image paths for blogs
yihua merged PR #9446: URL: https://github.com/apache/hudi/pull/9446 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6654) Add new log block header type to store record positions
[ https://issues.apache.org/jira/browse/HUDI-6654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6654: Epic Link: HUDI-6242 > Add new log block header type to store record positions > --- > > Key: HUDI-6654 > URL: https://issues.apache.org/jira/browse/HUDI-6654 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Labels: pull-request-available > Fix For: 0.14.0, 1.0.0 > > > To support position-based merging of base and log files, we need to encode > positions in the log blocks so that the positions can be used directly, > without having to deserialize records or delete keys for OverwriteWithLatest > payload, or with ordering values required only for > `DefaultHoodieRecordPayload` supporting event time based streaming. We add a > new `HeaderMetadataType` to store the positions in the log block header. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6654) Add new log block header type to store record positions
[ https://issues.apache.org/jira/browse/HUDI-6654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6654: Fix Version/s: 1.0.0 > Add new log block header type to store record positions > --- > > Key: HUDI-6654 > URL: https://issues.apache.org/jira/browse/HUDI-6654 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Labels: pull-request-available > Fix For: 0.14.0, 1.0.0 > > > To support position-based merging of base and log files, we need to encode > positions in the log blocks so that the positions can be used directly, > without having to deserialize records or delete keys for OverwriteWithLatest > payload, or with ordering values required only for > `DefaultHoodieRecordPayload` supporting event time based streaming. We add a > new `HeaderMetadataType` to store the positions in the log block header. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-6654) Add new log block header type to store record positions
[ https://issues.apache.org/jira/browse/HUDI-6654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo closed HUDI-6654. --- Resolution: Fixed > Add new log block header type to store record positions > --- > > Key: HUDI-6654 > URL: https://issues.apache.org/jira/browse/HUDI-6654 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Labels: pull-request-available > Fix For: 0.14.0 > > > To support position-based merging of base and log files, we need to encode > positions in the log blocks so that the positions can be used directly, > without having to deserialize records or delete keys for OverwriteWithLatest > payload, or with ordering values required only for > `DefaultHoodieRecordPayload` supporting event time based streaming. We add a > new `HeaderMetadataType` to store the positions in the log block header. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] danny0405 commented on issue #9435: [SUPPORT] Trino can't read tables created by Flink Hudi conector
danny0405 commented on issue #9435: URL: https://github.com/apache/hudi/issues/9435#issuecomment-1678355424 > HMS props for the Hudi table creating using Flink SQL You are using the Flink Hive catalog, the table are actually created by the hive catalog. Actually we have a separate Hudi hive catalog instead, the syntax looks like: ```sql ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on issue #8848: [SUPPORT] Hive Sync tool fails to sync Hoodi table written using Flink 1.16 to HMS
danny0405 commented on issue #8848: URL: https://github.com/apache/hudi/issues/8848#issuecomment-1678352784 In principle, we do not package any hadoop related jars into the bundle jar, the classpath of the runtime env should include it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch asf-site updated: [HUDI-6685] Fix code typo in pyspark 'Insert Overwrite' section of Quick Start Guide. (#9432)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new b0e57453d3a [HUDI-6685] Fix code typo in pyspark 'Insert Overwrite' section of Quick Start Guide. (#9432) b0e57453d3a is described below commit b0e57453d3aa1393838e177cfa15a18217da9629 Author: Amrish Lal AuthorDate: Mon Aug 14 19:45:54 2023 -0700 [HUDI-6685] Fix code typo in pyspark 'Insert Overwrite' section of Quick Start Guide. (#9432) --- website/docs/quick-start-guide.md | 4 ++-- website/versioned_docs/version-0.12.0/quick-start-guide.md | 4 ++-- website/versioned_docs/version-0.12.1/quick-start-guide.md | 4 ++-- website/versioned_docs/version-0.12.2/quick-start-guide.md | 4 ++-- website/versioned_docs/version-0.12.3/quick-start-guide.md | 4 ++-- website/versioned_docs/version-0.13.0/quick-start-guide.md | 4 ++-- website/versioned_docs/version-0.13.1/quick-start-guide.md | 4 ++-- 7 files changed, 14 insertions(+), 14 deletions(-) diff --git a/website/docs/quick-start-guide.md b/website/docs/quick-start-guide.md index 4e6a6e55e5c..3cad1cadc3e 100644 --- a/website/docs/quick-start-guide.md +++ b/website/docs/quick-start-guide.md @@ -1573,11 +1573,11 @@ spark. ```python # pyspark -self.spark.read.format("hudi"). \ +spark.read.format("hudi"). \ load(basePath). \ select(["uuid", "partitionpath"]). \ sort(["partitionpath", "uuid"]). \ -show(n=100, truncate=False) \ +show(n=100, truncate=False) inserts = sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateInserts(10)) df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)). \ diff --git a/website/versioned_docs/version-0.12.0/quick-start-guide.md b/website/versioned_docs/version-0.12.0/quick-start-guide.md index 9a18bcf358e..73df9aac567 100644 --- a/website/versioned_docs/version-0.12.0/quick-start-guide.md +++ b/website/versioned_docs/version-0.12.0/quick-start-guide.md @@ -1443,11 +1443,11 @@ spark. ```python # pyspark -self.spark.read.format("hudi"). \ +spark.read.format("hudi"). \ load(basePath). \ select(["uuid", "partitionpath"]). \ sort(["partitionpath", "uuid"]). \ -show(n=100, truncate=False) \ +show(n=100, truncate=False) inserts = sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateInserts(10)) df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)). \ diff --git a/website/versioned_docs/version-0.12.1/quick-start-guide.md b/website/versioned_docs/version-0.12.1/quick-start-guide.md index 8f5fc45cd3d..60658958a60 100644 --- a/website/versioned_docs/version-0.12.1/quick-start-guide.md +++ b/website/versioned_docs/version-0.12.1/quick-start-guide.md @@ -1443,11 +1443,11 @@ spark. ```python # pyspark -self.spark.read.format("hudi"). \ +spark.read.format("hudi"). \ load(basePath). \ select(["uuid", "partitionpath"]). \ sort(["partitionpath", "uuid"]). \ -show(n=100, truncate=False) \ +show(n=100, truncate=False) inserts = sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateInserts(10)) df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)). \ diff --git a/website/versioned_docs/version-0.12.2/quick-start-guide.md b/website/versioned_docs/version-0.12.2/quick-start-guide.md index e0f3e60554d..0a4eda6cbe0 100644 --- a/website/versioned_docs/version-0.12.2/quick-start-guide.md +++ b/website/versioned_docs/version-0.12.2/quick-start-guide.md @@ -1475,11 +1475,11 @@ spark. ```python # pyspark -self.spark.read.format("hudi"). \ +spark.read.format("hudi"). \ load(basePath). \ select(["uuid", "partitionpath"]). \ sort(["partitionpath", "uuid"]). \ -show(n=100, truncate=False) \ +show(n=100, truncate=False) inserts = sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateInserts(10)) df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)). \ diff --git a/website/versioned_docs/version-0.12.3/quick-start-guide.md b/website/versioned_docs/version-0.12.3/quick-start-guide.md index f21a01bd8ac..0df6150d905 100644 --- a/website/versioned_docs/version-0.12.3/quick-start-guide.md +++ b/website/versioned_docs/version-0.12.3/quick-start-guide.md @@ -1475,11 +1475,11 @@ spark. ```python # pyspark -self.spark.read.format("hudi"). \ +spark.read.format("hudi"). \ load(basePath). \ select(["uuid", "partitionpath"]). \ sort(["partitionpath", "uuid"]). \ -show(n=100, truncate=False) \ +show(n=100, truncate=False) inserts = sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateInserts(10)) df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)). \ diff --git a/website/versioned_docs/version-0.13.0/quick-start-guide.md b/website/versi
[GitHub] [hudi] nsivabalan merged pull request #9432: [HUDI-6685] Fix code typo in pyspark 'Insert Overwrite' section of Quick Start Guide.
nsivabalan merged PR #9432: URL: https://github.com/apache/hudi/pull/9432 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9437: [HUDI-6689] Add record index validation in MDT validator
hudi-bot commented on PR #9437: URL: https://github.com/apache/hudi/pull/9437#issuecomment-1678342431 ## CI report: * b25b5402c1e3e14264c6bbfd38910f4b93b8a871 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19291) * 0cc0c34422625e63bf9e421d73c22959b7cc9916 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19296) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on a diff in pull request #9437: [HUDI-6689] Add record index validation in MDT validator
yihua commented on code in PR #9437: URL: https://github.com/apache/hudi/pull/9437#discussion_r1294125375 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java: ## @@ -741,6 +791,116 @@ private void validateBloomFilters( validate(metadataBasedBloomFilters, fsBasedBloomFilters, partitionPath, "bloom filters"); } + private void validateRecordIndex(HoodieSparkEngineContext sparkEngineContext, + HoodieTableMetaClient metaClient, + HoodieTableMetadata tableMetadata) { +if (cfg.validateRecordIndexContent) { + validateRecordIndexContent(sparkEngineContext, metaClient, tableMetadata); +} else if (cfg.validateRecordIndexCount) { + validateRecordIndexCount(sparkEngineContext, metaClient); +} + } + + private void validateRecordIndexCount(HoodieSparkEngineContext sparkEngineContext, +HoodieTableMetaClient metaClient) { +String basePath = metaClient.getBasePathV2().toString(); +long countKeyFromTable = sparkEngineContext.getSqlContext().read().format("hudi") +.load(basePath) +.select(RECORD_KEY_METADATA_FIELD) +.distinct() +.count(); +long countKeyFromRecordIndex = sparkEngineContext.getSqlContext().read().format("hudi") +.load(getMetadataTableBasePath(basePath)) +.select("key") +.filter("type = 5") +.distinct() Review Comment: The `distinct()` operation is removed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] boneanxs commented on a diff in pull request #9408: [HUDI-6671] Support 'alter table add partition' sql
boneanxs commented on code in PR #9408: URL: https://github.com/apache/hudi/pull/9408#discussion_r1294125769 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/command/AlterHoodieTableAddPartitionCommand.scala: ## @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hudi.command + +import org.apache.hadoop.fs.Path +import org.apache.hudi.common.fs.FSUtils +import org.apache.hudi.common.model.HoodiePartitionMetadata +import org.apache.hudi.common.table.timeline.HoodieActiveTimeline +import org.apache.spark.sql.{AnalysisException, Row, SparkSession} +import org.apache.spark.sql.catalyst.TableIdentifier +import org.apache.spark.sql.catalyst.catalog.CatalogTypes.TablePartitionSpec +import org.apache.spark.sql.catalyst.catalog.{CatalogTablePartition, HoodieCatalogTable} +import org.apache.spark.sql.execution.command.DDLUtils +import org.apache.spark.sql.hudi.HoodieSqlCommonUtils.{makePartitionPath, normalizePartitionSpec} + +case class AlterHoodieTableAddPartitionCommand( + tableIdentifier: TableIdentifier, + partitionSpecsAndLocs: Seq[(TablePartitionSpec, Option[String])], + ifNotExists: Boolean) + extends HoodieLeafRunnableCommand { + + override def run(sparkSession: SparkSession): Seq[Row] = { +val fullTableName = s"${tableIdentifier.database}.${tableIdentifier.table}" +logInfo(s"start execute alter table add partition command for $fullTableName") + +val hoodieCatalogTable = HoodieCatalogTable(sparkSession, tableIdentifier) + +if (!hoodieCatalogTable.isPartitionedTable) { + throw new AnalysisException(s"$fullTableName is a non-partitioned table that is not allowed to add partition") +} + +val catalog = sparkSession.sessionState.catalog +val table = hoodieCatalogTable.table +DDLUtils.verifyAlterTableType(catalog, table, isView = false) + +val normalizedSpecs: Seq[Map[String, String]] = partitionSpecsAndLocs.map { case (spec, location) => + if (location.isDefined) { +throw new AnalysisException(s"Hoodie table does not support specify partition location explicitly") + } + normalizePartitionSpec( +spec, +hoodieCatalogTable.partitionFields, +hoodieCatalogTable.tableName, +sparkSession.sessionState.conf.resolver) +} + +val basePath = new Path(hoodieCatalogTable.tableLocation) +val fileSystem = hoodieCatalogTable.metaClient.getFs +val instantTime = HoodieActiveTimeline.createNewInstantTime +val format = hoodieCatalogTable.tableConfig.getPartitionMetafileFormat +val (partitionMetadata, parts) = normalizedSpecs.map { spec => + val partitionPath = makePartitionPath(hoodieCatalogTable, spec) + val fullPartitionPath: Path = FSUtils.getPartitionPath(basePath, partitionPath) + val metadata = if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, fullPartitionPath)) { +if (!ifNotExists) { + throw new AnalysisException(s"Partition metadata already exists for path: $fullPartitionPath") +} +None + } else Some(new HoodiePartitionMetadata(fileSystem, instantTime, basePath, fullPartitionPath, format)) + (metadata, CatalogTablePartition(spec, table.storage.copy(locationUri = Some(fullPartitionPath.toUri +}.unzip +partitionMetadata.flatten.foreach(_.trySave(0)) + +// Sync new partitions in batch, enable ignoreIfExists to avoid sync failure. +val batchSize = sparkSession.sparkContext.conf.getInt("spark.sql.addPartitionInBatch.size", 100) +parts.toIterator.grouped(batchSize).foreach { batch => Review Comment: ping @danny0405 any thoughts for this? I see some commands catch the exception and only print warning(like `CreateHoodieTableCommand`), and some commands throw the exception out(like `DropHoodieTableCommand`, `CreateHoodieTableAsSelectCommand`) Looks we don't have any standards whether throw exceptions if syncing to HMS occurs error. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...
[GitHub] [hudi] yihua commented on a diff in pull request #9437: [HUDI-6689] Add record index validation in MDT validator
yihua commented on code in PR #9437: URL: https://github.com/apache/hudi/pull/9437#discussion_r1294125576 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java: ## @@ -741,6 +791,116 @@ private void validateBloomFilters( validate(metadataBasedBloomFilters, fsBasedBloomFilters, partitionPath, "bloom filters"); } + private void validateRecordIndex(HoodieSparkEngineContext sparkEngineContext, + HoodieTableMetaClient metaClient, + HoodieTableMetadata tableMetadata) { +if (cfg.validateRecordIndexContent) { + validateRecordIndexContent(sparkEngineContext, metaClient, tableMetadata); +} else if (cfg.validateRecordIndexCount) { + validateRecordIndexCount(sparkEngineContext, metaClient); +} + } + + private void validateRecordIndexCount(HoodieSparkEngineContext sparkEngineContext, +HoodieTableMetaClient metaClient) { +String basePath = metaClient.getBasePathV2().toString(); +long countKeyFromTable = sparkEngineContext.getSqlContext().read().format("hudi") +.load(basePath) +.select(RECORD_KEY_METADATA_FIELD) +.distinct() +.count(); +long countKeyFromRecordIndex = sparkEngineContext.getSqlContext().read().format("hudi") +.load(getMetadataTableBasePath(basePath)) +.select("key") +.filter("type = 5") +.distinct() +.count(); + +if (countKeyFromTable != countKeyFromRecordIndex) { + String message = String.format("Validation of record index count failed: " + + "%s entries from record index metadata, %s keys from the data table.", + countKeyFromRecordIndex, countKeyFromTable); + LOG.error(message); + throw new HoodieValidationException(message); +} else { + LOG.info(String.format( + "Validation of record index count succeeded: %s entries.", countKeyFromRecordIndex)); +} + } + + private void validateRecordIndexContent(HoodieSparkEngineContext sparkEngineContext, + HoodieTableMetaClient metaClient, + HoodieTableMetadata tableMetadata) { +String basePath = metaClient.getBasePathV2().toString(); +JavaPairRDD> keyToLocationOnFsRdd = +sparkEngineContext.getSqlContext().read().format("hudi").load(basePath) +.select(RECORD_KEY_METADATA_FIELD, PARTITION_PATH_METADATA_FIELD, FILENAME_METADATA_FIELD) +.toJavaRDD() +.mapToPair(row -> new Tuple2<>(row.getString(row.fieldIndex(RECORD_KEY_METADATA_FIELD)), + Pair.of(row.getString(row.fieldIndex(PARTITION_PATH_METADATA_FIELD)), + FSUtils.getFileId(row.getString(row.fieldIndex(FILENAME_METADATA_FIELD)) +.cache(); + +JavaPairRDD> keyToLocationFromRecordIndexRdd = +sparkEngineContext.getSqlContext().read().format("hudi") +.load(getMetadataTableBasePath(basePath)) +.filter("type = 5") +.select(functions.col("key"), + functions.col("recordIndexMetadata.partitionName").as("partitionName"), + functions.col("recordIndexMetadata.fileIdHighBits").as("fileIdHighBits"), + functions.col("recordIndexMetadata.fileIdLowBits").as("fileIdLowBits"), +functions.col("recordIndexMetadata.fileIndex").as("fileIndex"), +functions.col("recordIndexMetadata.fileId").as("fileId"), + functions.col("recordIndexMetadata.instantTime").as("instantTime"), + functions.col("recordIndexMetadata.fileIdEncoding").as("fileIdEncoding")) +.toJavaRDD() +.mapToPair(row -> { + HoodieRecordGlobalLocation location = HoodieTableMetadataUtil.getLocationFromRecordIndexInfo( + row.getString(row.fieldIndex("partitionName")), + row.getInt(row.fieldIndex("fileIdEncoding")), + row.getLong(row.fieldIndex("fileIdHighBits")), + row.getLong(row.fieldIndex("fileIdLowBits")), + row.getInt(row.fieldIndex("fileIndex")), + row.getString(row.fieldIndex("fileId")), + row.getLong(row.fieldIndex("instantTime"))); + return new Tuple2<>(row.getString(row.fieldIndex("key")), + Pair.of(location.getPartitionPath(), location.getFileId())); +}); + +long diffCount = keyToLocationOnFsRdd.fullOuterJoin(keyToLocationFromRecordIndexRdd, cfg.recordIndexParallelism) +.map(e -> { + Optional> locationOnFs = e._2._1; + Optional> locationFromRecordIndex = e._2._2; + if (locationOnFs.isPresent() && locationFromRecordIndex.isPresent()) { +if (locationOnFs.get().getLeft().equals(locationFromRecordIndex.get().getLeft()) +
[GitHub] [hudi] codope commented on issue #9440: [SUPPORT] Trino cannot read when there is replacecommit metadata
codope commented on issue #9440: URL: https://github.com/apache/hudi/issues/9440#issuecomment-1678338350 @ksoullpwk Thanks for the diagnosis. Could you check if this fix helps you? https://github.com/trinodb/trino/pull/18213 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9437: [HUDI-6689] Add record index validation in MDT validator
hudi-bot commented on PR #9437: URL: https://github.com/apache/hudi/pull/9437#issuecomment-1678336975 ## CI report: * b25b5402c1e3e14264c6bbfd38910f4b93b8a871 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19291) * 0cc0c34422625e63bf9e421d73c22959b7cc9916 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #9444: [HUDI-6692] Do not allow switching from Primary keyed table to primary key less table
danny0405 commented on PR #9444: URL: https://github.com/apache/hudi/pull/9444#issuecomment-1678331086 > If a write to a table with a pk was missing the recordkey field in options it would think it was a pkless write. now it fails I'm confused, if we already know it is a table with pk, can we just use the field from table config as the record key by default. And we should not think it as a pkless table. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch asf-site updated: [HUDI-6676][DOCS] Add command for CreateHoodieTableLike (#9441)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new f138422fcb9 [HUDI-6676][DOCS] Add command for CreateHoodieTableLike (#9441) f138422fcb9 is described below commit f138422fcb94f54d0d0431f81766b64af5a9d519 Author: Rex(Hui) An AuthorDate: Tue Aug 15 10:08:54 2023 +0800 [HUDI-6676][DOCS] Add command for CreateHoodieTableLike (#9441) Co-authored-by: Hussein Awala --- website/docs/quick-start-guide.md | 62 +++ 1 file changed, 62 insertions(+) diff --git a/website/docs/quick-start-guide.md b/website/docs/quick-start-guide.md index a23ce275394..4e6a6e55e5c 100644 --- a/website/docs/quick-start-guide.md +++ b/website/docs/quick-start-guide.md @@ -384,6 +384,68 @@ create table hudi_ctas_cow_pt_tbl2 using hudi location 'file:/tmp/hudi/hudi_tbl/ partitioned by (datestr) as select * from parquet_mngd; ``` +**CREATE TABLE LIKE** + +The `CREATE TABLE LIKE` statement allows you to create a new Hudi table with the same schema and properties from an existing Hudi/hive table. + +:::note +This feature is available in Apache Hudi for Spark 3 and later versions. +::: + +Examples Create a HUDI table from an existing HUDI table + +```sql +# create a source hudi table +create table source_hudi ( + id int, + name string, + price double, + ts long +) using hudi +tblproperties ( + primaryKey = 'id,name', + type = 'cow' + ); + +# create a new hudi table based on the source table +create table target_hudi1 +like source_hudi +using hudi; + +# create a new hudi table based on the source table with override options +create table target_hudi2 +like source_hudi +using hudi +tblproperties (primaryKey = 'id'); + +# create a new external hudi table based on the source table with location +create table target_hudi3 +like source_hudi +using hudi +location 'file:/tmp/hudi/target_hudi3/'; +``` + +Examples Create a HUDI table from an existing parquet table + +```sql +# create a source parquet table +create table source_parquet ( + id int, + name string, + price double, + ts long +) using parquet; + +# create a new hudi table based on the source table +create table target_hudi1 +like source_parquet +using hudi +tblproperties ( + primaryKey = 'id,name', + type = 'cow' +); +``` + **Create Table Properties** Users can set table properties while creating a hudi table. Critical options are listed here.
[GitHub] [hudi] danny0405 merged pull request #9441: [HUDI-6676][DOCS] Add command for CreateHoodieTableLike
danny0405 merged PR #9441: URL: https://github.com/apache/hudi/pull/9441 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-6683) Added kafka key as part of hudi metadata columns for Json & Avro KafkaSource
[ https://issues.apache.org/jira/browse/HUDI-6683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-6683. Resolution: Fixed Fixed via master branch: 4099e1d18b78583d739fdb252f85b58d991d2fb0 > Added kafka key as part of hudi metadata columns for Json & Avro KafkaSource > > > Key: HUDI-6683 > URL: https://issues.apache.org/jira/browse/HUDI-6683 > Project: Apache Hudi > Issue Type: New Feature > Components: deltastreamer >Reporter: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 1.1.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[hudi] branch master updated: [HUDI-6683] Added kafka key as part of hudi metadata columns for Json & Avro KafkaSource (#9403)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 4099e1d18b7 [HUDI-6683] Added kafka key as part of hudi metadata columns for Json & Avro KafkaSource (#9403) 4099e1d18b7 is described below commit 4099e1d18b78583d739fdb252f85b58d991d2fb0 Author: Prathit malik <53890994+prathi...@users.noreply.github.com> AuthorDate: Tue Aug 15 07:37:26 2023 +0530 [HUDI-6683] Added kafka key as part of hudi metadata columns for Json & Avro KafkaSource (#9403) --- .../hudi/utilities/schema/KafkaOffsetPostProcessor.java | 6 +- .../org/apache/hudi/utilities/sources/JsonKafkaSource.java| 3 +++ .../apache/hudi/utilities/sources/helpers/AvroConvertor.java | 3 +++ .../apache/hudi/utilities/sources/TestAvroKafkaSource.java| 11 ++- .../apache/hudi/utilities/sources/TestJsonKafkaSource.java| 9 + 5 files changed, 22 insertions(+), 10 deletions(-) diff --git a/hudi-utilities/src/main/java/org/apache/hudi/utilities/schema/KafkaOffsetPostProcessor.java b/hudi-utilities/src/main/java/org/apache/hudi/utilities/schema/KafkaOffsetPostProcessor.java index 63473c3bce8..500bb0c7f99 100644 --- a/hudi-utilities/src/main/java/org/apache/hudi/utilities/schema/KafkaOffsetPostProcessor.java +++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/schema/KafkaOffsetPostProcessor.java @@ -18,6 +18,7 @@ package org.apache.hudi.utilities.schema; +import org.apache.avro.JsonProperties; import org.apache.hudi.common.config.ConfigProperty; import org.apache.hudi.common.config.TypedProperties; import org.apache.hudi.internal.schema.HoodieSchemaException; @@ -31,6 +32,7 @@ import org.slf4j.LoggerFactory; import java.util.List; import java.util.stream.Collectors; +import static org.apache.hudi.avro.AvroSchemaUtils.createNullableSchema; import static org.apache.hudi.common.util.ConfigUtils.getBooleanWithAltKeys; /** @@ -54,6 +56,7 @@ public class KafkaOffsetPostProcessor extends SchemaPostProcessor { public static final String KAFKA_SOURCE_OFFSET_COLUMN = "_hoodie_kafka_source_offset"; public static final String KAFKA_SOURCE_PARTITION_COLUMN = "_hoodie_kafka_source_partition"; public static final String KAFKA_SOURCE_TIMESTAMP_COLUMN = "_hoodie_kafka_source_timestamp"; + public static final String KAFKA_SOURCE_KEY_COLUMN = "_hoodie_kafka_source_key"; public KafkaOffsetPostProcessor(TypedProperties props, JavaSparkContext jssc) { super(props, jssc); @@ -61,7 +64,7 @@ public class KafkaOffsetPostProcessor extends SchemaPostProcessor { @Override public Schema processSchema(Schema schema) { -// this method adds kafka offset fields namely source offset, partition and timestamp to the schema of the batch. +// this method adds kafka offset fields namely source offset, partition, timestamp and kafka message key to the schema of the batch. try { List fieldList = schema.getFields(); List newFieldList = fieldList.stream() @@ -69,6 +72,7 @@ public class KafkaOffsetPostProcessor extends SchemaPostProcessor { newFieldList.add(new Schema.Field(KAFKA_SOURCE_OFFSET_COLUMN, Schema.create(Schema.Type.LONG), "offset column", 0)); newFieldList.add(new Schema.Field(KAFKA_SOURCE_PARTITION_COLUMN, Schema.create(Schema.Type.INT), "partition column", 0)); newFieldList.add(new Schema.Field(KAFKA_SOURCE_TIMESTAMP_COLUMN, Schema.create(Schema.Type.LONG), "timestamp column", 0)); + newFieldList.add(new Schema.Field(KAFKA_SOURCE_KEY_COLUMN, createNullableSchema(Schema.Type.STRING), "kafka key column", JsonProperties.NULL_VALUE)); Schema newSchema = Schema.createRecord(schema.getName() + "_processed", schema.getDoc(), schema.getNamespace(), false, newFieldList); return newSchema; } catch (Exception e) { diff --git a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JsonKafkaSource.java b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JsonKafkaSource.java index 775bd095fe0..de67dc171a9 100644 --- a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JsonKafkaSource.java +++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JsonKafkaSource.java @@ -47,6 +47,7 @@ import static org.apache.hudi.common.util.ConfigUtils.getStringWithAltKeys; import static org.apache.hudi.utilities.schema.KafkaOffsetPostProcessor.KAFKA_SOURCE_OFFSET_COLUMN; import static org.apache.hudi.utilities.schema.KafkaOffsetPostProcessor.KAFKA_SOURCE_PARTITION_COLUMN; import static org.apache.hudi.utilities.schema.KafkaOffsetPostProcessor.KAFKA_SOURCE_TIMESTAMP_COLUMN; +import static org.apache.hudi.utilities.schema.KafkaOffsetPostProcessor.KAFKA_SOURCE_KEY_COLUMN; /** * Read json kafka data. @@ -80,11 +81,13 @@ public class JsonKafkaSource extends KafkaSource {
[GitHub] [hudi] danny0405 merged pull request #9403: [HUDI-6683] Added kafka key as part of hudi metadata columns for Json & Avro KafkaSource
danny0405 merged PR #9403: URL: https://github.com/apache/hudi/pull/9403 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #9403: [HUDI-6683] Added kafka key as part of hudi metadata columns for Json & Avro KafkaSource
danny0405 commented on PR #9403: URL: https://github.com/apache/hudi/pull/9403#issuecomment-1678327812 Thanks for the nice feedback @hussein-awala , maybe you can fire a separate PR to address it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6683) Added kafka key as part of hudi metadata columns for Json & Avro KafkaSource
[ https://issues.apache.org/jira/browse/HUDI-6683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-6683: - Fix Version/s: 1.1.0 (was: 1.0.0) > Added kafka key as part of hudi metadata columns for Json & Avro KafkaSource > > > Key: HUDI-6683 > URL: https://issues.apache.org/jira/browse/HUDI-6683 > Project: Apache Hudi > Issue Type: New Feature > Components: deltastreamer >Reporter: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 1.1.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6585) Certify DedupeSparkJob for both table types
[ https://issues.apache.org/jira/browse/HUDI-6585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-6585: - Fix Version/s: 1.1.0 0.15.0 (was: 1.0.0) > Certify DedupeSparkJob for both table types > --- > > Key: HUDI-6585 > URL: https://issues.apache.org/jira/browse/HUDI-6585 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Priority: Major > Fix For: 1.1.0, 0.15.0 > > > Hudi has a utility `DedupeSparkJob` which can deduplicate data present in a > partition. Need to check if it can dedupe across table for both table types. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6586) Add Incremental scan support to dbt
[ https://issues.apache.org/jira/browse/HUDI-6586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-6586: - Fix Version/s: 0.15.0 0.14.1 > Add Incremental scan support to dbt > --- > > Key: HUDI-6586 > URL: https://issues.apache.org/jira/browse/HUDI-6586 > Project: Apache Hudi > Issue Type: Epic > Components: connectors >Reporter: Vinoth Govindarajan >Assignee: Vinoth Govindarajan >Priority: Major > Fix For: 1.0.0, 0.15.0, 0.14.1 > > > The current dbt support adds only the basic hudi primitives, but with deeper > integration we could enable faster ETL queries using the incremental read > primitive similar to the deltastreamer support. > > The goal of this epic is to enable incremental data processing for dbt. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-6490) Implement support for applying updates as deletes + inserts
[ https://issues.apache.org/jira/browse/HUDI-6490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17754362#comment-17754362 ] Vinoth Chandar commented on HUDI-6490: -- [~tim.brown] Do you want to take this work up? This can be done even on the 0.X code line. > Implement support for applying updates as deletes + inserts > --- > > Key: HUDI-6490 > URL: https://issues.apache.org/jira/browse/HUDI-6490 > Project: Apache Hudi > Issue Type: New Feature > Components: performance >Reporter: Vinoth Chandar >Assignee: Timothy Brown >Priority: Major > Fix For: 1.0.0, 0.15.0, 0.14.1 > > > This needs to happen at the higher layer of writing from Spark/Flink etc. > Hudi can already support this, by > - Logging delete blocks to the old file group. > - Writing new data blocks/base files to the new file group. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6490) Implement support for applying updates as deletes + inserts
[ https://issues.apache.org/jira/browse/HUDI-6490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-6490: - Fix Version/s: 0.15.0 0.14.1 > Implement support for applying updates as deletes + inserts > --- > > Key: HUDI-6490 > URL: https://issues.apache.org/jira/browse/HUDI-6490 > Project: Apache Hudi > Issue Type: New Feature > Components: performance >Reporter: Vinoth Chandar >Priority: Major > Fix For: 1.0.0, 0.15.0, 0.14.1 > > > This needs to happen at the higher layer of writing from Spark/Flink etc. > Hudi can already support this, by > - Logging delete blocks to the old file group. > - Writing new data blocks/base files to the new file group. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6490) Implement support for applying updates as deletes + inserts
[ https://issues.apache.org/jira/browse/HUDI-6490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar reassigned HUDI-6490: Assignee: Timothy Brown > Implement support for applying updates as deletes + inserts > --- > > Key: HUDI-6490 > URL: https://issues.apache.org/jira/browse/HUDI-6490 > Project: Apache Hudi > Issue Type: New Feature > Components: performance >Reporter: Vinoth Chandar >Assignee: Timothy Brown >Priority: Major > Fix For: 1.0.0, 0.15.0, 0.14.1 > > > This needs to happen at the higher layer of writing from Spark/Flink etc. > Hudi can already support this, by > - Logging delete blocks to the old file group. > - Writing new data blocks/base files to the new file group. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6296) Add Scala 2.13 build profile to support scala 2.13
[ https://issues.apache.org/jira/browse/HUDI-6296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-6296: - Fix Version/s: 1.1.0 (was: 1.0.0) > Add Scala 2.13 build profile to support scala 2.13 > -- > > Key: HUDI-6296 > URL: https://issues.apache.org/jira/browse/HUDI-6296 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Aditya Goenka >Priority: Minor > Fix For: 1.1.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-6640) Non-blocking concurrency control
[ https://issues.apache.org/jira/browse/HUDI-6640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17754361#comment-17754361 ] Vinoth Chandar commented on HUDI-6640: -- This is a duplicate of HUDI-5672 > Non-blocking concurrency control > > > Key: HUDI-6640 > URL: https://issues.apache.org/jira/browse/HUDI-6640 > Project: Apache Hudi > Issue Type: Epic > Components: core >Reporter: Danny Chen >Assignee: Jing Zhang >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-1045) Support updates during clustering
[ https://issues.apache.org/jira/browse/HUDI-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-1045: - Epic Link: HUDI-5672 (was: HUDI-1042) > Support updates during clustering > - > > Key: HUDI-1045 > URL: https://issues.apache.org/jira/browse/HUDI-1045 > Project: Apache Hudi > Issue Type: Task >Reporter: leesf >Assignee: leesf >Priority: Blocker > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5672) Non-blocking multi writer support
[ https://issues.apache.org/jira/browse/HUDI-5672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-5672: - Summary: Non-blocking multi writer support (was: Lockless multi writer support) > Non-blocking multi writer support > - > > Key: HUDI-5672 > URL: https://issues.apache.org/jira/browse/HUDI-5672 > Project: Apache Hudi > Issue Type: Epic >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-1238) [UMBRELLA] Perf test env
[ https://issues.apache.org/jira/browse/HUDI-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-1238: - Fix Version/s: 1.1.0 (was: 1.0.0) > [UMBRELLA] Perf test env > > > Key: HUDI-1238 > URL: https://issues.apache.org/jira/browse/HUDI-1238 > Project: Apache Hudi > Issue Type: Epic > Components: performance, Testing >Reporter: sivabalan narayanan >Assignee: Rajesh Mahindra >Priority: Blocker > Labels: hudi-umbrellas > Fix For: 1.1.0 > > > We need to build a perf test environment which monitors metrics from a long > running test suite and displays via dashboards. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-2519) [UMBRELLA] Seamless meta sync
[ https://issues.apache.org/jira/browse/HUDI-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-2519: - Fix Version/s: 1.0.0 (was: 1.1.0) > [UMBRELLA] Seamless meta sync > - > > Key: HUDI-2519 > URL: https://issues.apache.org/jira/browse/HUDI-2519 > Project: Apache Hudi > Issue Type: Epic > Components: hive >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Major > Labels: hive, hive3, hudi-umbrellas > Fix For: 1.0.0 > > > Hudi to Hive sync is a common use case which enables querying Hudi tables > through other query engines that support hive connector such as Presto and > Trino. Currently, Hudi supports syncing to Hive asynchronously using > run_sync_tool or synchronously through deltastreamer. > The goal of this umbrella JIRA is to imrpove the current sync mechanism and > support Hive3. Additionally, we need to improve the documentation around > different configs and sync modes. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-2519) [UMBRELLA] Seamless meta sync
[ https://issues.apache.org/jira/browse/HUDI-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-2519: - Fix Version/s: 1.1.0 (was: 1.0.0) > [UMBRELLA] Seamless meta sync > - > > Key: HUDI-2519 > URL: https://issues.apache.org/jira/browse/HUDI-2519 > Project: Apache Hudi > Issue Type: Epic > Components: hive >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Major > Labels: hive, hive3, hudi-umbrellas > Fix For: 1.1.0 > > > Hudi to Hive sync is a common use case which enables querying Hudi tables > through other query engines that support hive connector such as Presto and > Trino. Currently, Hudi supports syncing to Hive asynchronously using > run_sync_tool or synchronously through deltastreamer. > The goal of this umbrella JIRA is to imrpove the current sync mechanism and > support Hive3. Additionally, we need to improve the documentation around > different configs and sync modes. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #9444: [HUDI-6692] Do not allow switching from Primary keyed table to primary key less table
hudi-bot commented on PR #9444: URL: https://github.com/apache/hudi/pull/9444#issuecomment-167822 ## CI report: * c7e99fd19a00469c0e181b6c64b63aa9cfb7ed4e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19292) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9425: When invalidate the table in the spark sql query cache, verify if the…
danny0405 commented on code in PR #9425: URL: https://github.com/apache/hudi/pull/9425#discussion_r1294098528 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala: ## @@ -965,8 +965,9 @@ object HoodieSparkSqlWriter { // we must invalidate this table in the cache so writes are reflected in later queries if (metaSyncEnabled) { getHiveTableNames(hoodieConfig).foreach(name => { -val qualifiedTableName = String.join(".", hoodieConfig.getStringOrDefault(HIVE_DATABASE), name) -if (spark.catalog.tableExists(qualifiedTableName)) { +val syncDb = hoodieConfig.getStringOrDefault(HIVE_DATABASE) +val qualifiedTableName = String.join(".", syncDb, name) Review Comment: Reasonable, should we also take the default database name into consideration? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-2638) Rewrite tests around Hudi index
[ https://issues.apache.org/jira/browse/HUDI-2638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-2638: - Fix Version/s: (was: 1.0.0) > Rewrite tests around Hudi index > --- > > Key: HUDI-2638 > URL: https://issues.apache.org/jira/browse/HUDI-2638 > Project: Apache Hudi > Issue Type: Task >Reporter: Ethan Guo >Assignee: Raymond Xu >Priority: Major > Fix For: 1.1.0 > > > There are duplicate code between `TestFlinkHoodieBloomIndex` and > `TestHoodieBloomIndex`, among other test classes. We should do one pass to > clean the test code once the refactoring is done. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3121) Spark datasource with bucket index unit test reuse
[ https://issues.apache.org/jira/browse/HUDI-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-3121: - Fix Version/s: 1.1.0 (was: 1.0.0) > Spark datasource with bucket index unit test reuse > -- > > Key: HUDI-3121 > URL: https://issues.apache.org/jira/browse/HUDI-3121 > Project: Apache Hudi > Issue Type: Test > Components: index, tests-ci >Reporter: XiaoyuGeng >Priority: Major > Fix For: 1.1.0 > > > let `TestMORDataSourceWithBucket` reuse existing unit test by parameterizing -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-2638) Rewrite tests around Hudi index
[ https://issues.apache.org/jira/browse/HUDI-2638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-2638: - Fix Version/s: 1.1.0 > Rewrite tests around Hudi index > --- > > Key: HUDI-2638 > URL: https://issues.apache.org/jira/browse/HUDI-2638 > Project: Apache Hudi > Issue Type: Task >Reporter: Ethan Guo >Assignee: Raymond Xu >Priority: Major > Fix For: 1.0.0, 1.1.0 > > > There are duplicate code between `TestFlinkHoodieBloomIndex` and > `TestHoodieBloomIndex`, among other test classes. We should do one pass to > clean the test code once the refactoring is done. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-1916) Create a matrix of datatypes across spark, hive, presto, Avro, parquet.
[ https://issues.apache.org/jira/browse/HUDI-1916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-1916: - Fix Version/s: 1.1.0 (was: 1.0.0) > Create a matrix of datatypes across spark, hive, presto, Avro, parquet. > > > Key: HUDI-1916 > URL: https://issues.apache.org/jira/browse/HUDI-1916 > Project: Apache Hudi > Issue Type: Task > Components: docs >Reporter: sivabalan narayanan >Assignee: Nishith Agarwal >Priority: Major > Fix For: 1.1.0 > > > Create a matrix of datatypes across spark, hive, presto, Avro, parquet. > Follow up with Flink. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-2375) Create common SchemaProvider and RecordPayloads for spark, flink etc.
[ https://issues.apache.org/jira/browse/HUDI-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-2375: - Fix Version/s: 1.1.0 (was: 1.0.0) > Create common SchemaProvider and RecordPayloads for spark, flink etc. > - > > Key: HUDI-2375 > URL: https://issues.apache.org/jira/browse/HUDI-2375 > Project: Apache Hudi > Issue Type: Improvement > Components: kafka-connect, writer-core >Reporter: Rajesh Mahindra >Priority: Blocker > Fix For: 1.1.0 > > > Create common SchemaProvider and RecordPayloads for spark, flink etc. > - Currently the class org.apache.hudi.utilities.schema.SchemaProvider takes > in input JavaSparkContext, and is specific to Spark Engine. So we have > created a separate SchemaProvider for flink. Now for Kafka connect, we can > use neither, since its neither spark nor flink. Implement a common class that > uses HoodieEngineContext .. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-309) General Redesign of Archived Timeline for efficient scan and management
[ https://issues.apache.org/jira/browse/HUDI-309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar reassigned HUDI-309: --- Assignee: Danny Chen (was: Balaji Varadarajan) > General Redesign of Archived Timeline for efficient scan and management > --- > > Key: HUDI-309 > URL: https://issues.apache.org/jira/browse/HUDI-309 > Project: Apache Hudi > Issue Type: New Feature > Components: Common Core >Reporter: Balaji Varadarajan >Assignee: Danny Chen >Priority: Major > Fix For: 1.0.0 > > Attachments: Archive TImeline Notes by Vinoth 1.jpg, Archived > Timeline Notes by Vinoth 2.jpg > > > As designed by Vinoth: > Goals > # Archived Metadata should be scannable in the same way as data > # Provides more safety by always serving committed data independent of > timeframe when the corresponding commit action was tried. Currently, we > implicitly assume a data file to be valid if its commit time is older than > the earliest time in the active timeline. While this works ok, any inherent > bugs in rollback could inadvertently expose a possibly duplicate file when > its commit timestamp becomes older than that of any commits in the timeline. > # We had to deal with lot of corner cases because of the way we treat a > "commit" as special after it gets archived. Examples also include Savepoint > handling logic by cleaner. > # Small Files : For Cloud stores, archiving simply moves fils from one > directory to another causing the archive folder to grow. We need a way to > efficiently compact these files and at the same time be friendly to scans > Design: > The basic file-group abstraction for managing file versions for data files > can be extended to managing archived commit metadata. The idea is to use an > optimal format (like HFile) for storing compacted version of Metadata> pairs. Every archiving run will read pairs > from active timeline and append to indexable log files. We will run periodic > minor compactions to merge multiple log files to a compacted HFile storing > metadata for a time-range. It should be also noted that we will partition by > the action types (commit/clean). This design would allow for the archived > timeline to be queryable for determining whether a timeline is valid or not. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] danny0405 commented on pull request #9199: [HUDI-6534]Support consistent hashing row writer
danny0405 commented on PR #9199: URL: https://github.com/apache/hudi/pull/9199#issuecomment-1678284808 @prashantwason You can cherry pick https://github.com/apache/hudi/pull/9401 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6676) Add command for CreateHoodieTableLike
[ https://issues.apache.org/jira/browse/HUDI-6676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-6676: - Fix Version/s: 1.0.0 > Add command for CreateHoodieTableLike > - > > Key: HUDI-6676 > URL: https://issues.apache.org/jira/browse/HUDI-6676 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Hui An >Assignee: Hui An >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > 1. Create table from non-hudi table > 2. Create table from hudi table(The properties related to Hudi in the source > Hudi table will be carried over) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-6676) Add command for CreateHoodieTableLike
[ https://issues.apache.org/jira/browse/HUDI-6676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-6676. Resolution: Fixed Fixed via master branch: 8220d23be19af4783a9a776dfffa48167975a6a2 > Add command for CreateHoodieTableLike > - > > Key: HUDI-6676 > URL: https://issues.apache.org/jira/browse/HUDI-6676 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Hui An >Assignee: Hui An >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > 1. Create table from non-hudi table > 2. Create table from hudi table(The properties related to Hudi in the source > Hudi table will be carried over) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3529) Improve dependency management and bundling
[ https://issues.apache.org/jira/browse/HUDI-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-3529: - Fix Version/s: 1.1.0 (was: 1.0.0) > Improve dependency management and bundling > -- > > Key: HUDI-3529 > URL: https://issues.apache.org/jira/browse/HUDI-3529 > Project: Apache Hudi > Issue Type: Epic > Components: dependencies >Reporter: Raymond Xu >Assignee: Raymond Xu >Priority: Critical > Labels: pull-request-available > Fix For: 1.1.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] danny0405 merged pull request #9412: [HUDI-6676] Add command for CreateHoodieTableLike
danny0405 merged PR #9412: URL: https://github.com/apache/hudi/pull/9412 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [HUDI-6676] Add command for CreateHoodieTableLike (#9412)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 8220d23be19 [HUDI-6676] Add command for CreateHoodieTableLike (#9412) 8220d23be19 is described below commit 8220d23be19af4783a9a776dfffa48167975a6a2 Author: Rex(Hui) An AuthorDate: Tue Aug 15 09:02:04 2023 +0800 [HUDI-6676] Add command for CreateHoodieTableLike (#9412) * add command for CreateHoodieTableLike * don't support spark2 --- .../spark/sql/HoodieCatalystPlansUtils.scala | 7 ++ .../org/apache/spark/sql/hudi/SparkAdapter.scala | 8 +- .../apache/spark/sql/hudi/HoodieOptionConfig.scala | 8 ++ .../command/CreateHoodieTableLikeCommand.scala | 110 .../spark/sql/hudi/analysis/HoodieAnalysis.scala | 13 +- .../apache/spark/sql/hudi/TestCreateTable.scala| 139 + .../spark/sql/HoodieSpark2CatalystPlanUtils.scala | 9 ++ .../spark/sql/HoodieSpark3CatalystPlanUtils.scala | 13 +- 8 files changed, 302 insertions(+), 5 deletions(-) diff --git a/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieCatalystPlansUtils.scala b/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieCatalystPlansUtils.scala index 58789681c54..9cfe23f86cc 100644 --- a/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieCatalystPlansUtils.scala +++ b/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieCatalystPlansUtils.scala @@ -18,6 +18,7 @@ package org.apache.spark.sql import org.apache.spark.sql.catalyst.TableIdentifier +import org.apache.spark.sql.catalyst.catalog.CatalogStorageFormat import org.apache.spark.sql.catalyst.expressions.{Attribute, Expression} import org.apache.spark.sql.catalyst.plans.JoinType import org.apache.spark.sql.catalyst.plans.logical.{Join, LogicalPlan} @@ -93,6 +94,12 @@ trait HoodieCatalystPlansUtils { */ def unapplyInsertIntoStatement(plan: LogicalPlan): Option[(LogicalPlan, Map[String, Option[String]], LogicalPlan, Boolean, Boolean)] + /** + * Decomposes [[CreateTableLikeCommand]] into its arguments allowing to accommodate for API + * changes in Spark 3 + */ + def unapplyCreateTableLikeCommand(plan: LogicalPlan): Option[(TableIdentifier, TableIdentifier, CatalogStorageFormat, Option[String], Map[String, String], Boolean)] + /** * Rebases instance of {@code InsertIntoStatement} onto provided instance of {@code targetTable} and {@code query} */ diff --git a/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/hudi/SparkAdapter.scala b/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/hudi/SparkAdapter.scala index 041beba95df..1c6111afe47 100644 --- a/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/hudi/SparkAdapter.scala +++ b/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/hudi/SparkAdapter.scala @@ -150,11 +150,11 @@ trait SparkAdapter extends Serializable { } def isHoodieTable(map: java.util.Map[String, String]): Boolean = { -map.getOrDefault("provider", "").equals("hudi") +isHoodieTable(map.getOrDefault("provider", "")) } def isHoodieTable(table: CatalogTable): Boolean = { -table.provider.map(_.toLowerCase(Locale.ROOT)).orNull == "hudi" +isHoodieTable(table.provider.map(_.toLowerCase(Locale.ROOT)).orNull) } def isHoodieTable(tableId: TableIdentifier, spark: SparkSession): Boolean = { @@ -162,6 +162,10 @@ trait SparkAdapter extends Serializable { isHoodieTable(table) } + def isHoodieTable(provider: String): Boolean = { +"hudi".equalsIgnoreCase(provider) + } + /** * Create instance of [[ParquetFileFormat]] */ diff --git a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/HoodieOptionConfig.scala b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/HoodieOptionConfig.scala index d715a108d62..abe98bb46cf 100644 --- a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/HoodieOptionConfig.scala +++ b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/HoodieOptionConfig.scala @@ -182,6 +182,14 @@ object HoodieOptionConfig { options.filterNot(_._1.startsWith("hoodie.")).filterNot(kv => sqlOptionKeyToWriteConfigKey.contains(kv._1)) } + /** + * The opposite of `deleteHoodieOptions`, this method extract all hoodie related + * options(start with `hoodie.` and all sql options) + */ + def extractHoodieOptions(options: Map[String, String]): Map[String, String] = { +options.filter(_._1.startsWith("hoodie.")) ++ extractSqlOptions(options) + } + // extract primaryKey, preCombineField, type options def extractSqlOptions(options: Map[String, String]): Map[String, String] = {
[jira] [Updated] (HUDI-2871) Decouple metrics dependencies from hudi-client-common
[ https://issues.apache.org/jira/browse/HUDI-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-2871: - Fix Version/s: 1.1.0 (was: 1.0.0) > Decouple metrics dependencies from hudi-client-common > - > > Key: HUDI-2871 > URL: https://issues.apache.org/jira/browse/HUDI-2871 > Project: Apache Hudi > Issue Type: Improvement > Components: code-quality, dependencies, metrics, writer-core >Reporter: Vinoth Chandar >Assignee: Sagar Sumit >Priority: Critical > Labels: pull-request-available > Fix For: 0.14.0, 1.1.0 > > > There are some metrics stuff - Cloudwatch, graphite, prometheus etc are all > pulled in. > might be good to break these out into their own modules and include during > packaging. This needs some way of reflection based instantiation of the > Metrics reporter -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6483) MERGE INTO should support schema evolution for partial updates.
[ https://issues.apache.org/jira/browse/HUDI-6483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-6483: - Fix Version/s: 1.1.0 > MERGE INTO should support schema evolution for partial updates. > --- > > Key: HUDI-6483 > URL: https://issues.apache.org/jira/browse/HUDI-6483 > Project: Apache Hudi > Issue Type: Improvement > Components: spark-sql >Reporter: Aditya Goenka >Priority: Major > Fix For: 1.1.0, 0.15.0 > > > Following code is example for doing MERGE INTO along with schema evolution > which is not yet supported by hudi. Currently, Hudi tries to use target table > schema during MERGE INTO. > Following code should be supported - > ``` > create table test_insert3 ( > id int, > name string, > updated_at timestamp > ) using hudi > options ( > type = 'cow', > primaryKey = 'id', > preCombineField = 'updated_at' > ) location 'file:///tmp/test_insert3'; > merge into test_insert3 as target > using ( > select 1 as id, 'c' as name, 1 as new_col, current_timestamp as updated_at > ) source > on target.id = source.id > when matched then update set target.new_col = source.new_col > when not matched then insert *; > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-2687) [UMBRELLA] A new Trino connector for Hudi
[ https://issues.apache.org/jira/browse/HUDI-2687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-2687: - Fix Version/s: 1.0.0 (was: 1.1.0) > [UMBRELLA] A new Trino connector for Hudi > - > > Key: HUDI-2687 > URL: https://issues.apache.org/jira/browse/HUDI-2687 > Project: Apache Hudi > Issue Type: Epic > Components: trino-presto >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Critical > Labels: hudi-umbrellas > Fix For: 0.14.0, 1.0.0, 0.15.0 > > Attachments: image-2021-11-05-14-16-57-324.png, > image-2021-11-05-14-17-03-211.png > > > This JIRA tracks all the tasks related to building a new Hudi connector in > Trino. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-1574) Trim existing unit tests to finish in much shorter amount of time
[ https://issues.apache.org/jira/browse/HUDI-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-1574: - Fix Version/s: 1.1.0 (was: 1.0.0) > Trim existing unit tests to finish in much shorter amount of time > - > > Key: HUDI-1574 > URL: https://issues.apache.org/jira/browse/HUDI-1574 > Project: Apache Hudi > Issue Type: Epic > Components: Testing, tests-ci >Affects Versions: 0.9.0 >Reporter: Vinoth Chandar >Priority: Critical > Fix For: 1.1.0, 0.15.0 > > > spark-client-tests > 278.165 s - in org.apache.hudi.table.TestHoodieMergeOnReadTable > 201.628 s - in org.apache.hudi.metadata.TestHoodieBackedMetadata > 185.716 s - in org.apache.hudi.client.TestHoodieClientOnCopyOnWriteStorage > 158.361 s - in org.apache.hudi.index.TestHoodieIndex > 156.196 s - in org.apache.hudi.table.TestCleaner > 132.369 s - in > org.apache.hudi.table.action.commit.TestCopyOnWriteActionExecutor > 93.307 s - in org.apache.hudi.table.action.compact.TestAsyncCompaction > 67.301 s - in org.apache.hudi.table.upgrade.TestUpgradeDowngrade > 45.794 s - in org.apache.hudi.client.TestHoodieReadClient > 38.615 s - in org.apache.hudi.index.bloom.TestHoodieBloomIndex > 31.181 s - in org.apache.hudi.client.TestTableSchemaEvolution > 20.072 s - in org.apache.hudi.table.action.compact.TestInlineCompaction > grep " Time elapsed" hudi-client/hudi-spark-client/target/surefire-reports/* > | awk -F',' ' { print $5 } ' | awk -F':' ' { print $2 } ' | sort -nr | less > hudi-utilities > 209.936 s - in org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer > 204.653 s - in > org.apache.hudi.utilities.functional.TestHoodieMultiTableDeltaStreamer > 34.116 s - in org.apache.hudi.utilities.sources.TestKafkaSource > 29.865 s - in org.apache.hudi.utilities.sources.TestParquetDFSSource > 26.189 s - in > org.apache.hudi.utilities.sources.helpers.TestDatePartitionPathSelector > Other Tests > 42.595 s - in org.apache.hudi.common.functional.TestHoodieLogFormat > 38.918 s - in org.apache.hudi.common.bootstrap.TestBootstrapIndex > 22.046 s - in > org.apache.hudi.common.functional.TestHoodieLogFormatAppendFailure -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-1457) Add multi writing to Hudi tables using DFS based locking (only HDFS atomic renames)
[ https://issues.apache.org/jira/browse/HUDI-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17754349#comment-17754349 ] Vinoth Chandar commented on HUDI-1457: -- this does not work on cloud storage, since we cannot rely just on atomic puts. Just a note for anyone who is picking this up. > Add multi writing to Hudi tables using DFS based locking (only HDFS atomic > renames) > --- > > Key: HUDI-1457 > URL: https://issues.apache.org/jira/browse/HUDI-1457 > Project: Apache Hudi > Issue Type: New Feature > Components: writer-core >Reporter: Nishith Agarwal >Assignee: Nishith Agarwal >Priority: Major > Fix For: 1.1.0, 0.15.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3057) Instants should be generated strictly under locks
[ https://issues.apache.org/jira/browse/HUDI-3057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-3057: - Fix Version/s: 1.1.0 > Instants should be generated strictly under locks > - > > Key: HUDI-3057 > URL: https://issues.apache.org/jira/browse/HUDI-3057 > Project: Apache Hudi > Issue Type: Bug > Components: multi-writer, writer-core >Reporter: Alexey Kudinkin >Assignee: sivabalan narayanan >Priority: Major > Labels: sev:high > Fix For: 0.14.0, 1.1.0 > > Attachments: logs.txt > > > While looking into the flakiness of the tests outlined here: > https://issues.apache.org/jira/browse/HUDI-3043 > > I've stumbled upon following failure where one of the writers tries to > complete the Commit but it couldn't b/c such file does already exist: > {code:java} > java.util.concurrent.ExecutionException: java.lang.RuntimeException: > org.apache.hudi.exception.HoodieIOException: Failed to create file > /var/folders/kb/cnff55vj041g2nnlzs5ylqk0gn/T/junit5142536255031969586/testtable_MERGE_ON_READ/.hoodie/20211217150157632.commit > at java.util.concurrent.FutureTask.report(FutureTask.java:122) > at java.util.concurrent.FutureTask.get(FutureTask.java:192) > at > org.apache.hudi.utilities.functional.TestHoodieDeltaStreamerWithMultiWriter.runJobsInParallel(TestHoodieDeltaStreamerWithMultiWriter.java:336) > at > org.apache.hudi.utilities.functional.TestHoodieDeltaStreamerWithMultiWriter.testUpsertsContinuousModeWithMultipleWriters(TestHoodieDeltaStreamerWithMultiWriter.java:150) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:688) > at > org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60) > at > org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131) > at > org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:149) > at > org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:140) > at > org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestTemplateMethod(TimeoutExtension.java:92) > at > org.junit.jupiter.engine.execution.ExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(ExecutableInvoker.java:115) > at > org.junit.jupiter.engine.execution.ExecutableInvoker.lambda$invoke$0(ExecutableInvoker.java:105) > at > org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106) > at > org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64) > at > org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45) > at > org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37) > at > org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:104) > at > org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:98) > at > org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$6(TestMethodTestDescriptor.java:212) > at > org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73) > at > org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:208) > at > org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:137) > at > org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:71) > at > org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$5(NodeTestTask.java:139) > at > org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73) > at > org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$7(NodeTestTask.java:129) > at > org.junit.platform.engine.support.hierarchical.Node.around(Node.java:137) > at > org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$8(NodeTestTask.java:127) > at > org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.
[jira] [Updated] (HUDI-1457) Add multi writing to Hudi tables using DFS based locking (only HDFS atomic renames)
[ https://issues.apache.org/jira/browse/HUDI-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-1457: - Fix Version/s: (was: 0.15.0) > Add multi writing to Hudi tables using DFS based locking (only HDFS atomic > renames) > --- > > Key: HUDI-1457 > URL: https://issues.apache.org/jira/browse/HUDI-1457 > Project: Apache Hudi > Issue Type: New Feature > Components: writer-core >Reporter: Nishith Agarwal >Assignee: Nishith Agarwal >Priority: Major > Fix For: 1.1.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4068) Add Cosmos based lock provider for Azure
[ https://issues.apache.org/jira/browse/HUDI-4068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-4068: - Fix Version/s: 1.1.0 (was: 1.0.0) > Add Cosmos based lock provider for Azure > > > Key: HUDI-4068 > URL: https://issues.apache.org/jira/browse/HUDI-4068 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Priority: Major > Fix For: 1.1.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4067) Add Spanner based lock provider for GCP
[ https://issues.apache.org/jira/browse/HUDI-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-4067: - Fix Version/s: 1.1.0 (was: 1.0.0) > Add Spanner based lock provider for GCP > --- > > Key: HUDI-4067 > URL: https://issues.apache.org/jira/browse/HUDI-4067 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Priority: Major > Labels: concurrency, multi-writer > Fix For: 1.1.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-2173) Enhancing DynamoDB based LockProvider
[ https://issues.apache.org/jira/browse/HUDI-2173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-2173: - Fix Version/s: 1.1.0 > Enhancing DynamoDB based LockProvider > - > > Key: HUDI-2173 > URL: https://issues.apache.org/jira/browse/HUDI-2173 > Project: Apache Hudi > Issue Type: New Feature > Components: writer-core >Reporter: Vinoth Chandar >Assignee: Dave Hagman >Priority: Major > Fix For: 0.14.0, 1.1.0 > > > Currently, we have ZK and HMS based Lock providers, which can be limited to > co-ordinating across a single EMR or Hadoop cluster. > For aws users, DynamoDB is a readily available , fully managed , geo > replicated datastore, that can actually be used to hold locks, that can now > span across EMR/hadoop clusters. > This effort involves supporting a new `DynamoDB` lock provider that > implements org.apache.hudi.common.lock.LockProvider. We can place the > implementation itself in hudi-client-common, so it can be used across Spark, > Flink, Deltastreamer etc. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-2687) [UMBRELLA] A new Trino connector for Hudi
[ https://issues.apache.org/jira/browse/HUDI-2687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-2687: - Fix Version/s: 1.1.0 (was: 1.0.0) > [UMBRELLA] A new Trino connector for Hudi > - > > Key: HUDI-2687 > URL: https://issues.apache.org/jira/browse/HUDI-2687 > Project: Apache Hudi > Issue Type: Epic > Components: trino-presto >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Critical > Labels: hudi-umbrellas > Fix For: 0.14.0, 1.1.0, 0.15.0 > > Attachments: image-2021-11-05-14-16-57-324.png, > image-2021-11-05-14-17-03-211.png > > > This JIRA tracks all the tasks related to building a new Hudi connector in > Trino. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hussein-awala commented on a diff in pull request #9441: [HUDI-6676][DOCS] Add command for CreateHoodieTableLike
hussein-awala commented on code in PR #9441: URL: https://github.com/apache/hudi/pull/9441#discussion_r1294085403 ## website/docs/quick-start-guide.md: ## @@ -384,6 +384,68 @@ create table hudi_ctas_cow_pt_tbl2 using hudi location 'file:/tmp/hudi/hudi_tbl/ partitioned by (datestr) as select * from parquet_mngd; ``` +**CREATE TABLE LIKE** + +The "CREATE TABLE LIKE" statement allows you to create a new Hudi table with the same schema and properties from an existing Hudi/hive table. Review Comment: Nit ```suggestion The `CREATE TABLE LIKE` statement allows you to create a new Hudi table with the same schema and properties from an existing Hudi/hive table. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4141) [RFC-64] Table Format APIs
[ https://issues.apache.org/jira/browse/HUDI-4141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-4141: - Start Date: 4/Sep/23 Due Date: 4/Oct/23 > [RFC-64] Table Format APIs > -- > > Key: HUDI-4141 > URL: https://issues.apache.org/jira/browse/HUDI-4141 > Project: Apache Hudi > Issue Type: Epic > Components: reader-core, writer-core >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Critical > Fix For: 1.0.0 > > > RFC: [https://github.com/apache/hudi/pull/7080] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hussein-awala commented on a diff in pull request #9444: [HUDI-6692] Do not allow switching from Primary keyed table to primary key less table
hussein-awala commented on code in PR #9444: URL: https://github.com/apache/hudi/pull/9444#discussion_r1294078912 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieWriterUtils.scala: ## @@ -179,9 +179,11 @@ object HoodieWriterUtils { if (null != tableConfig) { val datasourceRecordKey = params.getOrElse(RECORDKEY_FIELD.key(), null) val tableConfigRecordKey = tableConfig.getString(HoodieTableConfig.RECORDKEY_FIELDS) -if ((null != datasourceRecordKey && null != tableConfigRecordKey - && datasourceRecordKey != tableConfigRecordKey) || (null != datasourceRecordKey && datasourceRecordKey.nonEmpty - && tableConfigRecordKey == null)) { +val dsnull = datasourceRecordKey == null +val tcnull = tableConfigRecordKey == null +if ((!dsnull && !tcnull && datasourceRecordKey != tableConfigRecordKey) + || (!dsnull && datasourceRecordKey.nonEmpty + && tcnull) || ((dsnull || datasourceRecordKey.isEmpty) && !tcnull)) { Review Comment: I'm not sure, but I wonder if tableConfigRecordKey could be empty string ```suggestion && tcnull) || ((dsnull || datasourceRecordKey.isEmpty) && !tcnull && tableConfigRecordKey.nonEmpty)) { ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9445: [HUDI-6694] Fix log file CLI around command blocks
hudi-bot commented on PR #9445: URL: https://github.com/apache/hudi/pull/9445#issuecomment-1678258723 ## CI report: * 06d72d5563b9cd26e131c3907dcc653e59a2b8be Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19293) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9445: [HUDI-6694] Fix log file CLI around command blocks
hudi-bot commented on PR #9445: URL: https://github.com/apache/hudi/pull/9445#issuecomment-1678253584 ## CI report: * 06d72d5563b9cd26e131c3907dcc653e59a2b8be UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9437: [HUDI-6689] Add record index validation in MDT validator
hudi-bot commented on PR #9437: URL: https://github.com/apache/hudi/pull/9437#issuecomment-1678253520 ## CI report: * b25b5402c1e3e14264c6bbfd38910f4b93b8a871 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19291) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hussein-awala commented on a diff in pull request #9403: [HUDI-6683] Added kafka key as part of hudi metadata columns for Json & Avro KafkaSource
hussein-awala commented on code in PR #9403: URL: https://github.com/apache/hudi/pull/9403#discussion_r1294069377 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/AvroConvertor.java: ## @@ -175,9 +176,11 @@ public GenericRecord withKafkaFieldsAppended(ConsumerRecord consumerRecord) { for (Schema.Field field : record.getSchema().getFields()) { recordBuilder.set(field, record.get(field.name())); } + recordBuilder.set(KAFKA_SOURCE_OFFSET_COLUMN, consumerRecord.offset()); recordBuilder.set(KAFKA_SOURCE_PARTITION_COLUMN, consumerRecord.partition()); recordBuilder.set(KAFKA_SOURCE_TIMESTAMP_COLUMN, consumerRecord.timestamp()); +recordBuilder.set(KAFKA_SOURCE_KEY_COLUMN, String.valueOf(consumerRecord.key())); Review Comment: ```suggestion recordBuilder.set(KAFKA_SOURCE_KEY_COLUMN, consumerRecord.key().toString()); ``` ## hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JsonKafkaSource.java: ## @@ -80,11 +81,13 @@ protected JavaRDD maybeAppendKafkaOffsets(JavaRDD { String record = consumerRecord.value().toString(); Review Comment: I think renaming this variable to `recordValue` might make the code more readable: ```suggestion String recordValue = consumerRecord.value().toString(); ``` ## hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JsonKafkaSource.java: ## @@ -80,11 +81,13 @@ protected JavaRDD maybeAppendKafkaOffsets(JavaRDD { String record = consumerRecord.value().toString(); + String recordKey = (String) consumerRecord.key(); Review Comment: ```suggestion String recordKey = consumerRecord.key().toString(); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua opened a new pull request, #5269: [HUDI-3636] Create new write clients for async table services in DeltaStreamer and Spark streaming sink
yihua opened a new pull request, #5269: URL: https://github.com/apache/hudi/pull/5269 ## What is the purpose of the pull request - In Deltastreamer, we re-instantiate WriteClient whenever schema changes. Same write client is used by all async table services as well. This poses an issue, because the new write client when re-instantiated is intimated to the async table service. but if the async table service is in the middle of compaction, uses a local copy of write client. and hence may not be able to reach the timeline server and will run into connection issues. We are fixing this in this patch. - We have a singleton instance of embedded timeline service which regular writers and all table services will use. And within async table services, we will listen to write config changes and re-instantiate write client before any new compaction execution. - Even between multiple re-instantiations of write clients for regular writer (due to schema change), uses the same singleton embedded timeline server. - Previously embedded timeline server was shutdown when write client was shutdown. Fixed that in this patch, so that a single instantiation and tear down of embedded timeline server will span entire process start and stop. - This also fixes a long standing issue w/ spark structured streaming. Apparently, this is what is happening in spark structured streaming flow. We start a new write client during first batch and close it at the end. But keep re-using the same instance of writeClient for subsequent batches. Only core entity that is impacted here was the embedded timeline server since we were closing it when write client was closed. So, after batch1, if timeline server was enabled, pipeline will fail since timeline server is shutdown. So, in this patch we are fixing that as well. Embedded timeline server is externally instantiated and so writeClient.close() will not shutdown the timeline server. We have a singleton instance of timeline server through entire pipeline. Previously we hard coded DIRECT style markers for spark streaming, but after this patch, we should be able to relax that. ## Brief change log - Fixed Deltastreamer and Spark streaking sink to ensure timeline server sustains multiple instantiations of write client by different wriiters. ## Verify this pull request This change added tests and can be verified as follows: - *Manually verified the change by running a job locally.* - For structured streaming, existing tests cover all flows. ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nfarah86 opened a new pull request, #9446: updated image path for /blog
nfarah86 opened a new pull request, #9446: URL: https://github.com/apache/hudi/pull/9446 ### Change Logs fixed broken images https://github.com/apache/hudi/assets/5392555/055efb07-c4bc-4727-a4e4-bdb81fdbf546";> @nsivabalan please review -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6694) Fix log file CLI around command blocks
[ https://issues.apache.org/jira/browse/HUDI-6694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6694: - Labels: pull-request-available (was: ) > Fix log file CLI around command blocks > -- > > Key: HUDI-6694 > URL: https://issues.apache.org/jira/browse/HUDI-6694 > Project: Apache Hudi > Issue Type: Bug >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: pull-request-available > > When there are rollback command blocks in the log files, the log file command > throws NPE: > {code:java} > hudi:hoodie_table->show logfile metadata --logFilePathPattern > file:/.1414abd2-346b-4c84-b380-c6ea6ec0863a-0_20230813220941456.log* > java.lang.NullPointerException > at java.util.Objects.requireNonNull(Objects.java:203) > at > org.apache.hudi.cli.commands.HoodieLogFileCommand.showLogFileCommits(HoodieLogFileCommand.java:102) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.springframework.shell.command.invocation.InvocableShellMethod.doInvoke(InvocableShellMethod.java:306) > at > org.springframework.shell.command.invocation.InvocableShellMethod.invoke(InvocableShellMethod.java:232) > at > org.springframework.shell.command.CommandExecution$DefaultCommandExecution.evaluate(CommandExecution.java:158) > at org.springframework.shell.Shell.evaluate(Shell.java:208) > at org.springframework.shell.Shell.run(Shell.java:140) > at > org.springframework.shell.jline.InteractiveShellRunner.run(InteractiveShellRunner.java:73) > at > org.springframework.shell.DefaultShellApplicationRunner.run(DefaultShellApplicationRunner.java:65) > at > org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:762) > at > org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:752) > at > org.springframework.boot.SpringApplication.run(SpringApplication.java:315) > at > org.springframework.boot.SpringApplication.run(SpringApplication.java:1306) > at > org.springframework.boot.SpringApplication.run(SpringApplication.java:1295) > at org.apache.hudi.cli.Main.main(Main.java:34) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] yihua opened a new pull request, #9445: [HUDI-6694] Fix log file CLI around command blocks
yihua opened a new pull request, #9445: URL: https://github.com/apache/hudi/pull/9445 ### Change Logs This PR fixes the log file CLI commands when the log file contains command blocks like rollback commands. The tests are adjusted to consider such a scenario. Without the fix, the new tests fail. Before the fix, when there are rollback command blocks in the log files, the log file command throws NPE: ``` hudi:hoodie_table->show logfile metadata --logFilePathPattern file:/.1414abd2-346b-4c84-b380-c6ea6ec0863a-0_20230813220941456.log* java.lang.NullPointerException at java.util.Objects.requireNonNull(Objects.java:203) at org.apache.hudi.cli.commands.HoodieLogFileCommand.showLogFileCommits(HoodieLogFileCommand.java:102) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.springframework.shell.command.invocation.InvocableShellMethod.doInvoke(InvocableShellMethod.java:306) at org.springframework.shell.command.invocation.InvocableShellMethod.invoke(InvocableShellMethod.java:232) at org.springframework.shell.command.CommandExecution$DefaultCommandExecution.evaluate(CommandExecution.java:158) at org.springframework.shell.Shell.evaluate(Shell.java:208) at org.springframework.shell.Shell.run(Shell.java:140) at org.springframework.shell.jline.InteractiveShellRunner.run(InteractiveShellRunner.java:73) at org.springframework.shell.DefaultShellApplicationRunner.run(DefaultShellApplicationRunner.java:65) at org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:762) at org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:752) at org.springframework.boot.SpringApplication.run(SpringApplication.java:315) at org.springframework.boot.SpringApplication.run(SpringApplication.java:1306) at org.springframework.boot.SpringApplication.run(SpringApplication.java:1295) at org.apache.hudi.cli.Main.main(Main.java:34) ``` ### Impact Bug fix on log file CLI. ### Risk level none ### Documentation Update N/A ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6694) Fix log file CLI around command blocks
[ https://issues.apache.org/jira/browse/HUDI-6694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6694: Description: When there are rollback command blocks in the log files, the log file command throws NPE: {code:java} hudi:hoodie_table->show logfile metadata --logFilePathPattern file:/.1414abd2-346b-4c84-b380-c6ea6ec0863a-0_20230813220941456.log* java.lang.NullPointerException at java.util.Objects.requireNonNull(Objects.java:203) at org.apache.hudi.cli.commands.HoodieLogFileCommand.showLogFileCommits(HoodieLogFileCommand.java:102) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.springframework.shell.command.invocation.InvocableShellMethod.doInvoke(InvocableShellMethod.java:306) at org.springframework.shell.command.invocation.InvocableShellMethod.invoke(InvocableShellMethod.java:232) at org.springframework.shell.command.CommandExecution$DefaultCommandExecution.evaluate(CommandExecution.java:158) at org.springframework.shell.Shell.evaluate(Shell.java:208) at org.springframework.shell.Shell.run(Shell.java:140) at org.springframework.shell.jline.InteractiveShellRunner.run(InteractiveShellRunner.java:73) at org.springframework.shell.DefaultShellApplicationRunner.run(DefaultShellApplicationRunner.java:65) at org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:762) at org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:752) at org.springframework.boot.SpringApplication.run(SpringApplication.java:315) at org.springframework.boot.SpringApplication.run(SpringApplication.java:1306) at org.springframework.boot.SpringApplication.run(SpringApplication.java:1295) at org.apache.hudi.cli.Main.main(Main.java:34) {code} was: When there are rollback command blocks in the log files, the > Fix log file CLI around command blocks > -- > > Key: HUDI-6694 > URL: https://issues.apache.org/jira/browse/HUDI-6694 > Project: Apache Hudi > Issue Type: Bug >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > > When there are rollback command blocks in the log files, the log file command > throws NPE: > {code:java} > hudi:hoodie_table->show logfile metadata --logFilePathPattern > file:/.1414abd2-346b-4c84-b380-c6ea6ec0863a-0_20230813220941456.log* > java.lang.NullPointerException > at java.util.Objects.requireNonNull(Objects.java:203) > at > org.apache.hudi.cli.commands.HoodieLogFileCommand.showLogFileCommits(HoodieLogFileCommand.java:102) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.springframework.shell.command.invocation.InvocableShellMethod.doInvoke(InvocableShellMethod.java:306) > at > org.springframework.shell.command.invocation.InvocableShellMethod.invoke(InvocableShellMethod.java:232) > at > org.springframework.shell.command.CommandExecution$DefaultCommandExecution.evaluate(CommandExecution.java:158) > at org.springframework.shell.Shell.evaluate(Shell.java:208) > at org.springframework.shell.Shell.run(Shell.java:140) > at > org.springframework.shell.jline.InteractiveShellRunner.run(InteractiveShellRunner.java:73) > at > org.springframework.shell.DefaultShellApplicationRunner.run(DefaultShellApplicationRunner.java:65) > at > org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:762) > at > org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:752) > at > org.springframework.boot.SpringApplication.run(SpringApplication.java:315) > at > org.springframework.boot.SpringApplication.run(SpringApplication.java:1306) > at > org.springframework.boot.SpringApplication.run(SpringApplication.java:1295) > at org.apache.hudi.cli.Main.main(Main.java:34) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6694) Fix log file CLI around command blocks
[ https://issues.apache.org/jira/browse/HUDI-6694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6694: Description: When there are rollback command blocks in the log files, the > Fix log file CLI around command blocks > -- > > Key: HUDI-6694 > URL: https://issues.apache.org/jira/browse/HUDI-6694 > Project: Apache Hudi > Issue Type: Bug >Reporter: Ethan Guo >Priority: Major > > When there are rollback command blocks in the log files, the -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6694) Fix log file CLI around command blocks
Ethan Guo created HUDI-6694: --- Summary: Fix log file CLI around command blocks Key: HUDI-6694 URL: https://issues.apache.org/jira/browse/HUDI-6694 Project: Apache Hudi Issue Type: Bug Reporter: Ethan Guo -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6694) Fix log file CLI around command blocks
[ https://issues.apache.org/jira/browse/HUDI-6694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo reassigned HUDI-6694: --- Assignee: Ethan Guo > Fix log file CLI around command blocks > -- > > Key: HUDI-6694 > URL: https://issues.apache.org/jira/browse/HUDI-6694 > Project: Apache Hudi > Issue Type: Bug >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > > When there are rollback command blocks in the log files, the -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] yihua commented on a diff in pull request #9437: [HUDI-6689] Add record index validation in MDT validator
yihua commented on code in PR #9437: URL: https://github.com/apache/hudi/pull/9437#discussion_r1294049901 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java: ## @@ -741,6 +791,116 @@ private void validateBloomFilters( validate(metadataBasedBloomFilters, fsBasedBloomFilters, partitionPath, "bloom filters"); } + private void validateRecordIndex(HoodieSparkEngineContext sparkEngineContext, + HoodieTableMetaClient metaClient, + HoodieTableMetadata tableMetadata) { +if (cfg.validateRecordIndexContent) { + validateRecordIndexContent(sparkEngineContext, metaClient, tableMetadata); +} else if (cfg.validateRecordIndexCount) { + validateRecordIndexCount(sparkEngineContext, metaClient); +} + } + + private void validateRecordIndexCount(HoodieSparkEngineContext sparkEngineContext, +HoodieTableMetaClient metaClient) { +String basePath = metaClient.getBasePathV2().toString(); +long countKeyFromTable = sparkEngineContext.getSqlContext().read().format("hudi") +.load(basePath) +.select(RECORD_KEY_METADATA_FIELD) +.distinct() +.count(); +long countKeyFromRecordIndex = sparkEngineContext.getSqlContext().read().format("hudi") +.load(getMetadataTableBasePath(basePath)) +.select("key") +.filter("type = 5") +.distinct() +.count(); + +if (countKeyFromTable != countKeyFromRecordIndex) { + String message = String.format("Validation of record index count failed: " + + "%s entries from record index metadata, %s keys from the data table.", + countKeyFromRecordIndex, countKeyFromTable); + LOG.error(message); + throw new HoodieValidationException(message); +} else { + LOG.info(String.format( + "Validation of record index count succeeded: %s entries.", countKeyFromRecordIndex)); +} + } + + private void validateRecordIndexContent(HoodieSparkEngineContext sparkEngineContext, + HoodieTableMetaClient metaClient, + HoodieTableMetadata tableMetadata) { +String basePath = metaClient.getBasePathV2().toString(); +JavaPairRDD> keyToLocationOnFsRdd = +sparkEngineContext.getSqlContext().read().format("hudi").load(basePath) +.select(RECORD_KEY_METADATA_FIELD, PARTITION_PATH_METADATA_FIELD, FILENAME_METADATA_FIELD) +.toJavaRDD() +.mapToPair(row -> new Tuple2<>(row.getString(row.fieldIndex(RECORD_KEY_METADATA_FIELD)), + Pair.of(row.getString(row.fieldIndex(PARTITION_PATH_METADATA_FIELD)), + FSUtils.getFileId(row.getString(row.fieldIndex(FILENAME_METADATA_FIELD)) +.cache(); + +JavaPairRDD> keyToLocationFromRecordIndexRdd = +sparkEngineContext.getSqlContext().read().format("hudi") +.load(getMetadataTableBasePath(basePath)) +.filter("type = 5") +.select(functions.col("key"), + functions.col("recordIndexMetadata.partitionName").as("partitionName"), + functions.col("recordIndexMetadata.fileIdHighBits").as("fileIdHighBits"), + functions.col("recordIndexMetadata.fileIdLowBits").as("fileIdLowBits"), +functions.col("recordIndexMetadata.fileIndex").as("fileIndex"), +functions.col("recordIndexMetadata.fileId").as("fileId"), + functions.col("recordIndexMetadata.instantTime").as("instantTime"), + functions.col("recordIndexMetadata.fileIdEncoding").as("fileIdEncoding")) +.toJavaRDD() +.mapToPair(row -> { + HoodieRecordGlobalLocation location = HoodieTableMetadataUtil.getLocationFromRecordIndexInfo( + row.getString(row.fieldIndex("partitionName")), + row.getInt(row.fieldIndex("fileIdEncoding")), + row.getLong(row.fieldIndex("fileIdHighBits")), + row.getLong(row.fieldIndex("fileIdLowBits")), + row.getInt(row.fieldIndex("fileIndex")), + row.getString(row.fieldIndex("fileId")), + row.getLong(row.fieldIndex("instantTime"))); + return new Tuple2<>(row.getString(row.fieldIndex("key")), + Pair.of(location.getPartitionPath(), location.getFileId())); +}); + +long diffCount = keyToLocationOnFsRdd.fullOuterJoin(keyToLocationFromRecordIndexRdd, cfg.recordIndexParallelism) +.map(e -> { + Optional> locationOnFs = e._2._1; + Optional> locationFromRecordIndex = e._2._2; + if (locationOnFs.isPresent() && locationFromRecordIndex.isPresent()) { +if (locationOnFs.get().getLeft().equals(locationFromRecordIndex.get().getLeft()) +
[GitHub] [hudi] hudi-bot commented on pull request #9444: [HUDI-6692] Do not allow switching from Primary keyed table to primary key less table
hudi-bot commented on PR #9444: URL: https://github.com/apache/hudi/pull/9444#issuecomment-1678162678 ## CI report: * c7e99fd19a00469c0e181b6c64b63aa9cfb7ed4e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19292) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9444: [HUDI-6692] Do not allow switching from Primary keyed table to primary key less table
hudi-bot commented on PR #9444: URL: https://github.com/apache/hudi/pull/9444#issuecomment-1678153770 ## CI report: * c7e99fd19a00469c0e181b6c64b63aa9cfb7ed4e UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #9437: [HUDI-6689] Add record index validation in MDT validator
nsivabalan commented on code in PR #9437: URL: https://github.com/apache/hudi/pull/9437#discussion_r1294013795 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java: ## @@ -741,6 +791,116 @@ private void validateBloomFilters( validate(metadataBasedBloomFilters, fsBasedBloomFilters, partitionPath, "bloom filters"); } + private void validateRecordIndex(HoodieSparkEngineContext sparkEngineContext, + HoodieTableMetaClient metaClient, + HoodieTableMetadata tableMetadata) { +if (cfg.validateRecordIndexContent) { + validateRecordIndexContent(sparkEngineContext, metaClient, tableMetadata); +} else if (cfg.validateRecordIndexCount) { + validateRecordIndexCount(sparkEngineContext, metaClient); +} + } + + private void validateRecordIndexCount(HoodieSparkEngineContext sparkEngineContext, +HoodieTableMetaClient metaClient) { +String basePath = metaClient.getBasePathV2().toString(); +long countKeyFromTable = sparkEngineContext.getSqlContext().read().format("hudi") +.load(basePath) +.select(RECORD_KEY_METADATA_FIELD) +.distinct() +.count(); +long countKeyFromRecordIndex = sparkEngineContext.getSqlContext().read().format("hudi") +.load(getMetadataTableBasePath(basePath)) +.select("key") +.filter("type = 5") +.distinct() Review Comment: snapshot read by itself should return unique values. if there are dups, its a bug.can we remove distinct() here? ## hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java: ## @@ -741,6 +791,116 @@ private void validateBloomFilters( validate(metadataBasedBloomFilters, fsBasedBloomFilters, partitionPath, "bloom filters"); } + private void validateRecordIndex(HoodieSparkEngineContext sparkEngineContext, + HoodieTableMetaClient metaClient, + HoodieTableMetadata tableMetadata) { +if (cfg.validateRecordIndexContent) { + validateRecordIndexContent(sparkEngineContext, metaClient, tableMetadata); +} else if (cfg.validateRecordIndexCount) { + validateRecordIndexCount(sparkEngineContext, metaClient); +} + } + + private void validateRecordIndexCount(HoodieSparkEngineContext sparkEngineContext, +HoodieTableMetaClient metaClient) { +String basePath = metaClient.getBasePathV2().toString(); +long countKeyFromTable = sparkEngineContext.getSqlContext().read().format("hudi") +.load(basePath) +.select(RECORD_KEY_METADATA_FIELD) +.distinct() +.count(); +long countKeyFromRecordIndex = sparkEngineContext.getSqlContext().read().format("hudi") +.load(getMetadataTableBasePath(basePath)) +.select("key") +.filter("type = 5") +.distinct() +.count(); + +if (countKeyFromTable != countKeyFromRecordIndex) { + String message = String.format("Validation of record index count failed: " + + "%s entries from record index metadata, %s keys from the data table.", + countKeyFromRecordIndex, countKeyFromTable); + LOG.error(message); + throw new HoodieValidationException(message); +} else { + LOG.info(String.format( + "Validation of record index count succeeded: %s entries.", countKeyFromRecordIndex)); +} + } + + private void validateRecordIndexContent(HoodieSparkEngineContext sparkEngineContext, + HoodieTableMetaClient metaClient, + HoodieTableMetadata tableMetadata) { +String basePath = metaClient.getBasePathV2().toString(); +JavaPairRDD> keyToLocationOnFsRdd = +sparkEngineContext.getSqlContext().read().format("hudi").load(basePath) +.select(RECORD_KEY_METADATA_FIELD, PARTITION_PATH_METADATA_FIELD, FILENAME_METADATA_FIELD) +.toJavaRDD() +.mapToPair(row -> new Tuple2<>(row.getString(row.fieldIndex(RECORD_KEY_METADATA_FIELD)), + Pair.of(row.getString(row.fieldIndex(PARTITION_PATH_METADATA_FIELD)), + FSUtils.getFileId(row.getString(row.fieldIndex(FILENAME_METADATA_FIELD)) +.cache(); + +JavaPairRDD> keyToLocationFromRecordIndexRdd = +sparkEngineContext.getSqlContext().read().format("hudi") +.load(getMetadataTableBasePath(basePath)) +.filter("type = 5") +.select(functions.col("key"), + functions.col("recordIndexMetadata.partitionName").as("partitionName"), + functions.col("recordIndexMetadata.fileIdHighBits").as("fileIdHighBits"), + functions.col("recordIndexMetadata.fileIdLowBits").as("fileIdLowBits"), +function
[GitHub] [hudi] prashantwason commented on pull request #9199: [HUDI-6534]Support consistent hashing row writer
prashantwason commented on PR #9199: URL: https://github.com/apache/hudi/pull/9199#issuecomment-1678132749 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] prashantwason commented on pull request #9199: [HUDI-6534]Support consistent hashing row writer
prashantwason commented on PR #9199: URL: https://github.com/apache/hudi/pull/9199#issuecomment-1678132491 @stream2000 The build is failing due to a test failure due to this commit. Can you please check? https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/19273/logs/25 This is blocking 0.14.0 release so please prioritize if possible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6693) Streaming writes fail in quick start w/ 0.14.0
sivabalan narayanan created HUDI-6693: - Summary: Streaming writes fail in quick start w/ 0.14.0 Key: HUDI-6693 URL: https://issues.apache.org/jira/browse/HUDI-6693 Project: Apache Hudi Issue Type: Improvement Components: spark, writer-core Reporter: sivabalan narayanan Quick starts fails w/ streaming ingestion {code:java} scala> df.writeStream.format("hudi"). | options(getQuickstartWriteConfigs). | option(PRECOMBINE_FIELD_OPT_KEY, "ts"). | option(RECORDKEY_FIELD_OPT_KEY, "uuid"). | option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). | option(TABLE_NAME, streamingTableName). | outputMode("append"). | option("path", baseStreamingPath). | option("checkpointLocation", checkpointLocation). | trigger(Trigger.Once()). | start() warning: one deprecation; for details, enable `:setting -deprecation' or `:replay -deprecation' 23/08/10 14:31:09 WARN HoodieStreamingSink: Ignore TableNotFoundException as it is first microbatch. 23/08/10 14:31:09 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled. res12: org.apache.spark.sql.streaming.StreamingQuery = org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@75143003 scala> 23/08/10 14:31:10 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist 23/08/10 14:31:10 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist 23/08/10 14:31:10 WARN AutoRecordKeyGenerationUtils$: Precombine field ts will be ignored with auto record key generation enabled 23/08/10 14:31:10 WARN HoodieWriteConfig: Embedded timeline server is disabled, fallback to use direct marker type for spark 23/08/10 14:31:10 ERROR HoodieStreamingSink: Micro batch id=0 threw following exception: org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start(); LogicalRDD [_hoodie_commit_time#1063, _hoodie_commit_seqno#1064, _hoodie_record_key#1065, _hoodie_partition_path#1066, _hoodie_file_name#1067, begin_lat#1068, begin_lon#1069, driver#1070, end_lat#1071, end_lon#1072, fare#1073, partitionpath#1074, rider#1075, ts#1076L, uuid#1077], true at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.throwError(UnsupportedOperationChecker.scala:447) at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.$anonfun$checkForBatch$1(UnsupportedOperationChecker.scala:38) at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.$anonfun$checkForBatch$1$adapted(UnsupportedOperationChecker.scala:36) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:263) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:262) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:262) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:262) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:262) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:262) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:262) at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForBatch(UnsupportedOperationChecker.scala:36) at org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:69) at org.apache.spark.sql.execution.QueryExecution.$anonfun$withCachedData$1(QueryExecution.scala:109) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:107) at org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:107) at org.apache.spark.sql.execution.QueryExecution.$anonfun$optimize
[jira] [Updated] (HUDI-6692) If table with recordkey doesn't have recordkey in spark ds write, it will bulk insert by default
[ https://issues.apache.org/jira/browse/HUDI-6692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6692: - Labels: pull-request-available (was: ) > If table with recordkey doesn't have recordkey in spark ds write, it will > bulk insert by default > > > Key: HUDI-6692 > URL: https://issues.apache.org/jira/browse/HUDI-6692 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Blocker > Labels: pull-request-available > Fix For: 0.14.0 > > > If an existing table has a recordkey, if you write with spark ds and don't > include a recordkey, it will think it is pkless and should default to bulk > insert -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] jonvex opened a new pull request, #9444: [HUDI-6692] If pk table has no recordkey in write, it should fail
jonvex opened a new pull request, #9444: URL: https://github.com/apache/hudi/pull/9444 ### Change Logs if the write was missing the recordkey it would think it was a pkless write. now it fails ### Impact prevent unexpected behavior ### Risk level (write none, low medium or high below) low ### Documentation Update N/A ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6692) If table with recordkey doesn't have recordkey in spark ds write, it will bulk insert by default
Jonathan Vexler created HUDI-6692: - Summary: If table with recordkey doesn't have recordkey in spark ds write, it will bulk insert by default Key: HUDI-6692 URL: https://issues.apache.org/jira/browse/HUDI-6692 Project: Apache Hudi Issue Type: Bug Components: spark-sql Reporter: Jonathan Vexler Assignee: Jonathan Vexler Fix For: 0.14.0 If an existing table has a recordkey, if you write with spark ds and don't include a recordkey, it will think it is pkless and should default to bulk insert -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] Riddle4045 commented on issue #9435: [SUPPORT] Trino can't read tables created by Flink Hudi conector
Riddle4045 commented on issue #9435: URL: https://github.com/apache/hudi/issues/9435#issuecomment-1678089145 @danny0405 I checked the Table props in metastore of a table that's synced using Hudi HMS sync tool vs the Flink table I mentioned below. I see very different properties here Table props for table creating using Hudi HMS sync tool ``` TBL_ID PARAM_KEY PARAM_VALUE 250 EXTERNALTRUE 250 last_commit_time_sync 20230601210025262 250 numFiles0 250 spark.sql.sources.provider hudi 250 spark.sql.sources.schema.numParts 1 250 spark.sql.sources.schema.part.0 {"type":"struct","fields":[{"name":"_hoodie_commit_time","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_commit_seqno","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_record_key","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_partition_path","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_file_name","type":"string","nullable":true,"metadata":{}},{"name":"rideId","type":"long","nullable":true,"metadata":{}},{"name":"driverId","type":"long","nullable":false,"metadata":{}},{"name":"taxiId","type":"long","nullable":true,"metadata":{}},{"name":"startTime","type":"long","nullable":true,"metadata":{}},{"name":"tip","type":"float","nullable":true,"metadata":{}},{"name":"tolls","type":"float","nullable":true,"metadata":{}},{"name":"totalFare","type":"float","nullable":true,"metadata":{}}]} 250 totalSize 0 250 transient_lastDdlTime 1685653353 ``` HMS props for the Hudi table creating using Flink SQL ``` TBL_ID PARAM_KEY PARAM_VALUE 335 flink.comment 335 flink.connector hudi 335 flink.hive_sync.enable true 335 flink.hive_sync.metastore.uris thrift://hive-metastore:9083 335 flink.hive_sync.modehms 335 flink.partition.keys.0.name partition 335 flink.path abfs://fl...@test.dfs.core.windows.net/hudi/t1hms4 335 flink.schema.0.data-typeVARCHAR(20) 335 flink.schema.0.name uuid 335 flink.schema.1.data-typeVARCHAR(10) 335 flink.schema.1.name name 335 flink.schema.2.data-typeINT 335 flink.schema.2.name age 335 flink.schema.3.data-typeTIMESTAMP(3) 335 flink.schema.3.name ts 335 flink.schema.4.data-typeVARCHAR(20) 335 flink.schema.4.name partition 335 flink.table.typeCOPY_ON_WRITE 335 transient_lastDdlTime 1691804292 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9437: [HUDI-6689] Add record index validation in MDT validator
hudi-bot commented on PR #9437: URL: https://github.com/apache/hudi/pull/9437#issuecomment-1678003863 ## CI report: * 699793358327fe0caf4df52a0ee199a9c54ab58d Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19290) * b25b5402c1e3e14264c6bbfd38910f4b93b8a871 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19291) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org