[GitHub] [hudi] danny0405 commented on a diff in pull request #9223: [HUDI-6553] Speedup column stats and bloom index creation on large datasets.

2023-08-14 Thread via GitHub


danny0405 commented on code in PR #9223:
URL: https://github.com/apache/hudi/pull/9223#discussion_r1294257025


##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##
@@ -87,14 +88,15 @@
 import java.util.LinkedList;
 import java.util.List;
 import java.util.Map;
-import java.util.Objects;
 import java.util.Set;
 import java.util.function.BiFunction;
 import java.util.function.Function;
 import java.util.stream.Collector;
 import java.util.stream.Collectors;
 import java.util.stream.Stream;
 
+import scala.Tuple3;
+

Review Comment:
   Can we use `org.apache.hudi.common.util.collection.Triple` instead?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9223: [HUDI-6553] Speedup column stats and bloom index creation on large datasets.

2023-08-14 Thread via GitHub


danny0405 commented on code in PR #9223:
URL: https://github.com/apache/hudi/pull/9223#discussion_r1294256449


##
hudi-common/pom.xml:
##
@@ -103,6 +103,13 @@
   
 
   
+
+
+  org.scala-lang
+  scala-library

Review Comment:
   I don't think we should introduce any scala dependency in `hudi-common` 
module.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] codope commented on a diff in pull request #9433: [HUDI-6686] - Handling empty commits after s3 applyFilter api

2023-08-14 Thread via GitHub


codope commented on code in PR #9433:
URL: https://github.com/apache/hudi/pull/9433#discussion_r1294252246


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java:
##
@@ -157,26 +157,24 @@ public Pair>, String> 
fetchNextBatch(Option lastChec
 }
 
 Dataset source = queryRunner.run(queryInfo);
-if (source.isEmpty()) {
-  LOG.info("Source of file names is empty. Returning empty result and 
endInstant: "
-  + queryInfo.getEndInstant());
-  return Pair.of(Option.empty(), queryInfo.getEndInstant());
-}
-
 Dataset filteredSourceData = applyFilter(source, fileFormat);
 
 LOG.info("Adjusting end checkpoint:" + queryInfo.getEndInstant() + " based 
on sourceLimit :" + sourceLimit);
-Pair> checkPointAndDataset =
+Pair>> checkPointAndDataset 
=
 IncrSourceHelper.filterAndGenerateCheckpointBasedOnSourceLimit(
 filteredSourceData, sourceLimit, queryInfo, 
cloudObjectIncrCheckpoint);
+if (!checkPointAndDataset.getRight().isPresent()) {
+  LOG.info("Empty source, returning endpoint:" + 
queryInfo.getEndInstant());
+  return Pair.of(Option.empty(), queryInfo.getEndInstant());
+}
 LOG.info("Adjusted end checkpoint :" + checkPointAndDataset.getLeft());
 
 String s3FS = getStringWithAltKeys(props, S3_FS_PREFIX, 
true).toLowerCase();
 String s3Prefix = s3FS + "://";
 
 // Create S3 paths
 SerializableConfiguration serializableHadoopConf = new 
SerializableConfiguration(sparkContext.hadoopConfiguration());
-List cloudObjectMetadata = 
checkPointAndDataset.getRight()
+List cloudObjectMetadata = 
checkPointAndDataset.getRight().get()

Review Comment:
   Can the Option be empty or nullable? Should we check before calling get() on 
Option?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] codope opened a new pull request, #9448: [MINOR] Moving to 1.0.0-SNAPSHOT on master branch

2023-08-14 Thread via GitHub


codope opened a new pull request, #9448:
URL: https://github.com/apache/hudi/pull/9448

   ### Change Logs
   
   Changed pom version to `1.0.0-SNAPSHOT`.
   
   ### Impact
   
   none
   
   ### Risk level (write none, low medium or high below)
   
   none
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9447: [MINOR] Infer the preCombine field only if the value is not null

2023-08-14 Thread via GitHub


hudi-bot commented on PR #9447:
URL: https://github.com/apache/hudi/pull/9447#issuecomment-1678480510

   
   ## CI report:
   
   * c181bd4a3fa227cef4ab96457c38d9b207b6a981 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19298)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9447: [MINOR] Infer the preCombine field only if the value is not null

2023-08-14 Thread via GitHub


hudi-bot commented on PR #9447:
URL: https://github.com/apache/hudi/pull/9447#issuecomment-1678475353

   
   ## CI report:
   
   * c181bd4a3fa227cef4ab96457c38d9b207b6a981 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9437: [HUDI-6689] Add record index validation in MDT validator

2023-08-14 Thread via GitHub


hudi-bot commented on PR #9437:
URL: https://github.com/apache/hudi/pull/9437#issuecomment-1678469976

   
   ## CI report:
   
   * 0cc0c34422625e63bf9e421d73c22959b7cc9916 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19296)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 opened a new pull request, #9447: [MINOR] Infer the preCombine field only if the value is not null

2023-08-14 Thread via GitHub


danny0405 opened a new pull request, #9447:
URL: https://github.com/apache/hudi/pull/9447

   ### Change Logs
   
   Table created by Spark may not have the preCombine field set up.
   
   ### Impact
   
   none
   
   ### Risk level (write none, low medium or high below)
   
   none
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] ksoullpwk commented on issue #9440: [SUPPORT] Trino cannot read when there is replacecommit metadata

2023-08-14 Thread via GitHub


ksoullpwk commented on issue #9440:
URL: https://github.com/apache/hudi/issues/9440#issuecomment-1678460113

   Yes, it works. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9434: Dummy commit to trigger CI

2023-08-14 Thread via GitHub


hudi-bot commented on PR #9434:
URL: https://github.com/apache/hudi/pull/9434#issuecomment-1678442027

   
   ## CI report:
   
   * e895bfb27350f497100c3cd50246badcba99f27d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19272)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19273)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19289)
 
   * 1728274eb5640204a88c8f8915fca62f58c1cb6a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19297)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9434: Dummy commit to trigger CI

2023-08-14 Thread via GitHub


hudi-bot commented on PR #9434:
URL: https://github.com/apache/hudi/pull/9434#issuecomment-1678437811

   
   ## CI report:
   
   * e895bfb27350f497100c3cd50246badcba99f27d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19272)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19273)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19289)
 
   * 1728274eb5640204a88c8f8915fca62f58c1cb6a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #8683: [HUDI-5533] Support spark columns comments

2023-08-14 Thread via GitHub


danny0405 commented on code in PR #8683:
URL: https://github.com/apache/hudi/pull/8683#discussion_r1294178078


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/catalog/TableOptionProperties.java:
##
@@ -184,7 +184,9 @@ public static Map 
translateFlinkTableProperties2Spark(
 partitionKeys,
 sparkVersion,
 4000,
-messageType);
+messageType,
+// flink does not support comment yet
+Arrays.asList());

Review Comment:
   Collections.emptyList() ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] Riddle4045 commented on issue #9435: [SUPPORT] Trino can't read tables created by Flink Hudi conector

2023-08-14 Thread via GitHub


Riddle4045 commented on issue #9435:
URL: https://github.com/apache/hudi/issues/9435#issuecomment-1678406201

   > > HMS props for the Hudi table creating using Flink SQL
   > 
   > You are using the Flink Hive catalog, the table are actually created by 
the hive catalog. Actually we have a separate Hudi hive catalog instead, the 
syntax looks like:
   > 
   > ```sql
   >   CREATE CATALOG hoodie_catalog
   >   WITH (
   > 'type'='hudi',
   > 'catalog.path' = '${catalog root path}',
   > 'hive.conf.dir' = '${hive-site.xml dir}',
   > 'mode'='hms'
   >   );
   > ```
   > 
   > The error log in JM indicates a missing calcite-core jar, you can fix it 
by adding it to the classpath.
   
   Thanks, I'll give it a try! 
   @danny0405 in the table definition I specified `connector=hudi` is that not 
sufficient? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch asf-site updated: [DOCS] Updated image paths for blogs (#9446)

2023-08-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new b1b1b524bbd [DOCS] Updated image paths for blogs (#9446)
b1b1b524bbd is described below

commit b1b1b524bbde2423520d94c50d0c6a70d8a51e4c
Author: nadine farah 
AuthorDate: Mon Aug 14 21:01:52 2023 -0700

[DOCS] Updated image paths for blogs (#9446)
---
 ...ction-Techniques-and-Approaches-Using-AWS-Glue-by-Job-Target-LLC.mdx | 2 +-
 .../2023-07-21-AWS-Glue-Crawlers-now-supports-Apache-Hudi-Tables.mdx| 2 +-
 ...Hudi-Revolutionizing-Big-Data-Management-for-Real-Time-Analytics.mdx | 2 +-
 .../blog/2023-08-03-Apache-Hudi-on-AWS-Glue-A-Step-by-Step-Guide.mdx| 2 +-
 ...a-lake-Table-formats-Apache-Iceberg-vs-Apache-Hudi-vs-Delta-lake.mdx | 2 +-
 ...-09-Lakehouse-Trifecta-Delta-Lake-Apache-Iceberg-and-Apache-Hudi.mdx | 2 +-
 6 files changed, 6 insertions(+), 6 deletions(-)

diff --git 
a/website/blog/2023-07-20-Backfilling-Apache-Hudi-Tables-in-Production-Techniques-and-Approaches-Using-AWS-Glue-by-Job-Target-LLC.mdx
 
b/website/blog/2023-07-20-Backfilling-Apache-Hudi-Tables-in-Production-Techniques-and-Approaches-Using-AWS-Glue-by-Job-Target-LLC.mdx
index 683a4e22352..2f22e41379e 100644
--- 
a/website/blog/2023-07-20-Backfilling-Apache-Hudi-Tables-in-Production-Techniques-and-Approaches-Using-AWS-Glue-by-Job-Target-LLC.mdx
+++ 
b/website/blog/2023-07-20-Backfilling-Apache-Hudi-Tables-in-Production-Techniques-and-Approaches-Using-AWS-Glue-by-Job-Target-LLC.mdx
@@ -3,7 +3,7 @@ title: "Backfilling Apache Hudi Tables in Production: 
Techniques & Approaches Us
 authors:
 - name: Soumil Shah
 category: blog
-image: 
/assets/images/2023-07-20-Backfilling-Apache-Hudi-Tables-in-Production-Techniques-and-Approaches-Using-AWS-Glue-by-Job-Target-LLC.png
+image: 
/assets/images/blog/2023-07-20-Backfilling-Apache-Hudi-Tables-in-Production-Techniques-and-Approaches-Using-AWS-Glue-by-Job-Target-LLC.png
 tags:
 - blog
 - backfilling
diff --git 
a/website/blog/2023-07-21-AWS-Glue-Crawlers-now-supports-Apache-Hudi-Tables.mdx 
b/website/blog/2023-07-21-AWS-Glue-Crawlers-now-supports-Apache-Hudi-Tables.mdx
index 0d93d4be701..cb55c854070 100644
--- 
a/website/blog/2023-07-21-AWS-Glue-Crawlers-now-supports-Apache-Hudi-Tables.mdx
+++ 
b/website/blog/2023-07-21-AWS-Glue-Crawlers-now-supports-Apache-Hudi-Tables.mdx
@@ -3,7 +3,7 @@ title: "AWS Glue Crawlers now supports Apache Hudi Tables"
 authors:
 - name: AWS Team
 category: blog
-image: 
/assets/images/2023-07-21-AWS-Glue-Crawlers-now-supports-Apache-Hudi-Tables.png
+image: 
/assets/images/blog/2023-07-21-AWS-Glue-Crawlers-now-supports-Apache-Hudi-Tables.png
 tags:
 - blog
 - aws glue
diff --git 
a/website/blog/2023-07-27-Apache-Hudi-Revolutionizing-Big-Data-Management-for-Real-Time-Analytics.mdx
 
b/website/blog/2023-07-27-Apache-Hudi-Revolutionizing-Big-Data-Management-for-Real-Time-Analytics.mdx
index 08224b604e9..1dff86efb9f 100644
--- 
a/website/blog/2023-07-27-Apache-Hudi-Revolutionizing-Big-Data-Management-for-Real-Time-Analytics.mdx
+++ 
b/website/blog/2023-07-27-Apache-Hudi-Revolutionizing-Big-Data-Management-for-Real-Time-Analytics.mdx
@@ -3,7 +3,7 @@ title: "Apache Hudi: Revolutionizing Big Data Management for 
Real-Time Analytics
 authors:
 - name: Dev Jain
 category: blog
-image: 
/assets/images/2023-07-27-Apache-Hudi-Revolutionizing-Big-Data-Management-for-Real-Time-Analytics.png
+image: 
/assets/images/blog/2023-07-27-Apache-Hudi-Revolutionizing-Big-Data-Management-for-Real-Time-Analytics.png
 tags:
 - blog
 - medium
diff --git 
a/website/blog/2023-08-03-Apache-Hudi-on-AWS-Glue-A-Step-by-Step-Guide.mdx 
b/website/blog/2023-08-03-Apache-Hudi-on-AWS-Glue-A-Step-by-Step-Guide.mdx
index 4f0eab9402d..3a4d895a929 100644
--- a/website/blog/2023-08-03-Apache-Hudi-on-AWS-Glue-A-Step-by-Step-Guide.mdx
+++ b/website/blog/2023-08-03-Apache-Hudi-on-AWS-Glue-A-Step-by-Step-Guide.mdx
@@ -3,7 +3,7 @@ title: "Apache Hudi on AWS Glue: A Step-by-Step Guide"
 authors:
 - name: Dev Jain
 category: blog
-image: 
/assets/images/2023-08-03-Apache-Hudi-on-AWS-Glue-A-Step-by-Step-Guide.png
+image: 
/assets/images/blog/2023-08-03-Apache-Hudi-on-AWS-Glue-A-Step-by-Step-Guide.png
 tags:
 - blog
 - medium
diff --git 
a/website/blog/2023-08-03-Data-lake-Table-formats-Apache-Iceberg-vs-Apache-Hudi-vs-Delta-lake.mdx
 
b/website/blog/2023-08-03-Data-lake-Table-formats-Apache-Iceberg-vs-Apache-Hudi-vs-Delta-lake.mdx
index c7e017e6834..82c53b05179 100644
--- 
a/website/blog/2023-08-03-Data-lake-Table-formats-Apache-Iceberg-vs-Apache-Hudi-vs-Delta-lake.mdx
+++ 
b/website/blog/2023-08-03-Data-lake-Table-formats-Apache-Iceberg-vs-Apache-Hudi-vs-Delta-lake.mdx
@@ -3,7 +3,7 @@ title: "Data lake Table formats : Apache Iceberg vs Apache Hudi 
vs Delta lake"
 authors:
 - name: Shashwat Pandey
 category: blog
-image: 
/assets/images/2023-08-03-Data-lake-Tabl

[GitHub] [hudi] yihua merged pull request #9446: [DOCS] Updated image paths for blogs

2023-08-14 Thread via GitHub


yihua merged PR #9446:
URL: https://github.com/apache/hudi/pull/9446


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6654) Add new log block header type to store record positions

2023-08-14 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6654:

Epic Link: HUDI-6242

> Add new log block header type to store record positions
> ---
>
> Key: HUDI-6654
> URL: https://issues.apache.org/jira/browse/HUDI-6654
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0, 1.0.0
>
>
> To support position-based merging of base and log files, we need to encode 
> positions in the log blocks so that the positions can be used directly, 
> without having to deserialize records or delete keys for OverwriteWithLatest 
> payload, or with ordering values required only for 
> `DefaultHoodieRecordPayload` supporting event time based streaming.  We add a 
> new `HeaderMetadataType` to store the positions in the log block header.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6654) Add new log block header type to store record positions

2023-08-14 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6654:

Fix Version/s: 1.0.0

> Add new log block header type to store record positions
> ---
>
> Key: HUDI-6654
> URL: https://issues.apache.org/jira/browse/HUDI-6654
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0, 1.0.0
>
>
> To support position-based merging of base and log files, we need to encode 
> positions in the log blocks so that the positions can be used directly, 
> without having to deserialize records or delete keys for OverwriteWithLatest 
> payload, or with ordering values required only for 
> `DefaultHoodieRecordPayload` supporting event time based streaming.  We add a 
> new `HeaderMetadataType` to store the positions in the log block header.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-6654) Add new log block header type to store record positions

2023-08-14 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-6654.
---
Resolution: Fixed

> Add new log block header type to store record positions
> ---
>
> Key: HUDI-6654
> URL: https://issues.apache.org/jira/browse/HUDI-6654
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> To support position-based merging of base and log files, we need to encode 
> positions in the log blocks so that the positions can be used directly, 
> without having to deserialize records or delete keys for OverwriteWithLatest 
> payload, or with ordering values required only for 
> `DefaultHoodieRecordPayload` supporting event time based streaming.  We add a 
> new `HeaderMetadataType` to store the positions in the log block header.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] danny0405 commented on issue #9435: [SUPPORT] Trino can't read tables created by Flink Hudi conector

2023-08-14 Thread via GitHub


danny0405 commented on issue #9435:
URL: https://github.com/apache/hudi/issues/9435#issuecomment-1678355424

   > HMS props for the Hudi table creating using Flink SQL
   
   You are using the Flink Hive catalog, the table are actually created by the 
hive catalog. Actually we have a separate Hudi hive catalog instead,  the 
syntax looks like:
   
   ```sql
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on issue #8848: [SUPPORT] Hive Sync tool fails to sync Hoodi table written using Flink 1.16 to HMS

2023-08-14 Thread via GitHub


danny0405 commented on issue #8848:
URL: https://github.com/apache/hudi/issues/8848#issuecomment-1678352784

   In principle, we do not package any hadoop related jars into the bundle jar, 
the classpath of the runtime env should include it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch asf-site updated: [HUDI-6685] Fix code typo in pyspark 'Insert Overwrite' section of Quick Start Guide. (#9432)

2023-08-14 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new b0e57453d3a [HUDI-6685] Fix code typo in pyspark 'Insert Overwrite' 
section of Quick Start Guide. (#9432)
b0e57453d3a is described below

commit b0e57453d3aa1393838e177cfa15a18217da9629
Author: Amrish Lal 
AuthorDate: Mon Aug 14 19:45:54 2023 -0700

[HUDI-6685] Fix code typo in pyspark 'Insert Overwrite' section of Quick 
Start Guide. (#9432)
---
 website/docs/quick-start-guide.md  | 4 ++--
 website/versioned_docs/version-0.12.0/quick-start-guide.md | 4 ++--
 website/versioned_docs/version-0.12.1/quick-start-guide.md | 4 ++--
 website/versioned_docs/version-0.12.2/quick-start-guide.md | 4 ++--
 website/versioned_docs/version-0.12.3/quick-start-guide.md | 4 ++--
 website/versioned_docs/version-0.13.0/quick-start-guide.md | 4 ++--
 website/versioned_docs/version-0.13.1/quick-start-guide.md | 4 ++--
 7 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/website/docs/quick-start-guide.md 
b/website/docs/quick-start-guide.md
index 4e6a6e55e5c..3cad1cadc3e 100644
--- a/website/docs/quick-start-guide.md
+++ b/website/docs/quick-start-guide.md
@@ -1573,11 +1573,11 @@ spark.
 
 ```python
 # pyspark
-self.spark.read.format("hudi"). \
+spark.read.format("hudi"). \
 load(basePath). \
 select(["uuid", "partitionpath"]). \
 sort(["partitionpath", "uuid"]). \
-show(n=100, truncate=False) \
+show(n=100, truncate=False)
 
 inserts = 
sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateInserts(10))
 
 df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)). \
diff --git a/website/versioned_docs/version-0.12.0/quick-start-guide.md 
b/website/versioned_docs/version-0.12.0/quick-start-guide.md
index 9a18bcf358e..73df9aac567 100644
--- a/website/versioned_docs/version-0.12.0/quick-start-guide.md
+++ b/website/versioned_docs/version-0.12.0/quick-start-guide.md
@@ -1443,11 +1443,11 @@ spark.
 
 ```python
 # pyspark
-self.spark.read.format("hudi"). \
+spark.read.format("hudi"). \
 load(basePath). \
 select(["uuid", "partitionpath"]). \
 sort(["partitionpath", "uuid"]). \
-show(n=100, truncate=False) \
+show(n=100, truncate=False)
 
 inserts = 
sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateInserts(10))
 
 df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)). \
diff --git a/website/versioned_docs/version-0.12.1/quick-start-guide.md 
b/website/versioned_docs/version-0.12.1/quick-start-guide.md
index 8f5fc45cd3d..60658958a60 100644
--- a/website/versioned_docs/version-0.12.1/quick-start-guide.md
+++ b/website/versioned_docs/version-0.12.1/quick-start-guide.md
@@ -1443,11 +1443,11 @@ spark.
 
 ```python
 # pyspark
-self.spark.read.format("hudi"). \
+spark.read.format("hudi"). \
 load(basePath). \
 select(["uuid", "partitionpath"]). \
 sort(["partitionpath", "uuid"]). \
-show(n=100, truncate=False) \
+show(n=100, truncate=False)
 
 inserts = 
sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateInserts(10))
 
 df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)). \
diff --git a/website/versioned_docs/version-0.12.2/quick-start-guide.md 
b/website/versioned_docs/version-0.12.2/quick-start-guide.md
index e0f3e60554d..0a4eda6cbe0 100644
--- a/website/versioned_docs/version-0.12.2/quick-start-guide.md
+++ b/website/versioned_docs/version-0.12.2/quick-start-guide.md
@@ -1475,11 +1475,11 @@ spark.
 
 ```python
 # pyspark
-self.spark.read.format("hudi"). \
+spark.read.format("hudi"). \
 load(basePath). \
 select(["uuid", "partitionpath"]). \
 sort(["partitionpath", "uuid"]). \
-show(n=100, truncate=False) \
+show(n=100, truncate=False)
 
 inserts = 
sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateInserts(10))
 
 df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)). \
diff --git a/website/versioned_docs/version-0.12.3/quick-start-guide.md 
b/website/versioned_docs/version-0.12.3/quick-start-guide.md
index f21a01bd8ac..0df6150d905 100644
--- a/website/versioned_docs/version-0.12.3/quick-start-guide.md
+++ b/website/versioned_docs/version-0.12.3/quick-start-guide.md
@@ -1475,11 +1475,11 @@ spark.
 
 ```python
 # pyspark
-self.spark.read.format("hudi"). \
+spark.read.format("hudi"). \
 load(basePath). \
 select(["uuid", "partitionpath"]). \
 sort(["partitionpath", "uuid"]). \
-show(n=100, truncate=False) \
+show(n=100, truncate=False)
 
 inserts = 
sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateInserts(10))
 
 df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)). \
diff --git a/website/versioned_docs/version-0.13.0/quick-start-guide.md 
b/website/versi

[GitHub] [hudi] nsivabalan merged pull request #9432: [HUDI-6685] Fix code typo in pyspark 'Insert Overwrite' section of Quick Start Guide.

2023-08-14 Thread via GitHub


nsivabalan merged PR #9432:
URL: https://github.com/apache/hudi/pull/9432


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9437: [HUDI-6689] Add record index validation in MDT validator

2023-08-14 Thread via GitHub


hudi-bot commented on PR #9437:
URL: https://github.com/apache/hudi/pull/9437#issuecomment-1678342431

   
   ## CI report:
   
   * b25b5402c1e3e14264c6bbfd38910f4b93b8a871 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19291)
 
   * 0cc0c34422625e63bf9e421d73c22959b7cc9916 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19296)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on a diff in pull request #9437: [HUDI-6689] Add record index validation in MDT validator

2023-08-14 Thread via GitHub


yihua commented on code in PR #9437:
URL: https://github.com/apache/hudi/pull/9437#discussion_r1294125375


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java:
##
@@ -741,6 +791,116 @@ private void validateBloomFilters(
 validate(metadataBasedBloomFilters, fsBasedBloomFilters, partitionPath, 
"bloom filters");
   }
 
+  private void validateRecordIndex(HoodieSparkEngineContext sparkEngineContext,
+   HoodieTableMetaClient metaClient,
+   HoodieTableMetadata tableMetadata) {
+if (cfg.validateRecordIndexContent) {
+  validateRecordIndexContent(sparkEngineContext, metaClient, 
tableMetadata);
+} else if (cfg.validateRecordIndexCount) {
+  validateRecordIndexCount(sparkEngineContext, metaClient);
+}
+  }
+
+  private void validateRecordIndexCount(HoodieSparkEngineContext 
sparkEngineContext,
+HoodieTableMetaClient metaClient) {
+String basePath = metaClient.getBasePathV2().toString();
+long countKeyFromTable = 
sparkEngineContext.getSqlContext().read().format("hudi")
+.load(basePath)
+.select(RECORD_KEY_METADATA_FIELD)
+.distinct()
+.count();
+long countKeyFromRecordIndex = 
sparkEngineContext.getSqlContext().read().format("hudi")
+.load(getMetadataTableBasePath(basePath))
+.select("key")
+.filter("type = 5")
+.distinct()

Review Comment:
   The `distinct()` operation is removed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] boneanxs commented on a diff in pull request #9408: [HUDI-6671] Support 'alter table add partition' sql

2023-08-14 Thread via GitHub


boneanxs commented on code in PR #9408:
URL: https://github.com/apache/hudi/pull/9408#discussion_r1294125769


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/command/AlterHoodieTableAddPartitionCommand.scala:
##
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi.command
+
+import org.apache.hadoop.fs.Path
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.model.HoodiePartitionMetadata
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline
+import org.apache.spark.sql.{AnalysisException, Row, SparkSession}
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.catalyst.catalog.CatalogTypes.TablePartitionSpec
+import org.apache.spark.sql.catalyst.catalog.{CatalogTablePartition, 
HoodieCatalogTable}
+import org.apache.spark.sql.execution.command.DDLUtils
+import org.apache.spark.sql.hudi.HoodieSqlCommonUtils.{makePartitionPath, 
normalizePartitionSpec}
+
+case class AlterHoodieTableAddPartitionCommand(
+   tableIdentifier: TableIdentifier,
+   partitionSpecsAndLocs: Seq[(TablePartitionSpec, Option[String])],
+   ifNotExists: Boolean)
+  extends HoodieLeafRunnableCommand {
+
+  override def run(sparkSession: SparkSession): Seq[Row] = {
+val fullTableName = s"${tableIdentifier.database}.${tableIdentifier.table}"
+logInfo(s"start execute alter table add partition command for 
$fullTableName")
+
+val hoodieCatalogTable = HoodieCatalogTable(sparkSession, tableIdentifier)
+
+if (!hoodieCatalogTable.isPartitionedTable) {
+  throw new AnalysisException(s"$fullTableName is a non-partitioned table 
that is not allowed to add partition")
+}
+
+val catalog = sparkSession.sessionState.catalog
+val table = hoodieCatalogTable.table
+DDLUtils.verifyAlterTableType(catalog, table, isView = false)
+
+val normalizedSpecs: Seq[Map[String, String]] = partitionSpecsAndLocs.map 
{ case (spec, location) =>
+  if (location.isDefined) {
+throw new AnalysisException(s"Hoodie table does not support specify 
partition location explicitly")
+  }
+  normalizePartitionSpec(
+spec,
+hoodieCatalogTable.partitionFields,
+hoodieCatalogTable.tableName,
+sparkSession.sessionState.conf.resolver)
+}
+
+val basePath = new Path(hoodieCatalogTable.tableLocation)
+val fileSystem = hoodieCatalogTable.metaClient.getFs
+val instantTime = HoodieActiveTimeline.createNewInstantTime
+val format = hoodieCatalogTable.tableConfig.getPartitionMetafileFormat
+val (partitionMetadata, parts) = normalizedSpecs.map { spec =>
+  val partitionPath = makePartitionPath(hoodieCatalogTable, spec)
+  val fullPartitionPath: Path = FSUtils.getPartitionPath(basePath, 
partitionPath)
+  val metadata = if 
(HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, fullPartitionPath)) {
+if (!ifNotExists) {
+  throw new AnalysisException(s"Partition metadata already exists for 
path: $fullPartitionPath")
+}
+None
+  } else Some(new HoodiePartitionMetadata(fileSystem, instantTime, 
basePath, fullPartitionPath, format))
+  (metadata, CatalogTablePartition(spec, table.storage.copy(locationUri = 
Some(fullPartitionPath.toUri
+}.unzip
+partitionMetadata.flatten.foreach(_.trySave(0))
+
+// Sync new partitions in batch, enable ignoreIfExists to avoid sync 
failure.
+val batchSize = 
sparkSession.sparkContext.conf.getInt("spark.sql.addPartitionInBatch.size", 100)
+parts.toIterator.grouped(batchSize).foreach { batch =>

Review Comment:
   ping @danny0405 any thoughts for this? I see some commands catch the 
exception and only print warning(like `CreateHoodieTableCommand`), and some 
commands throw the exception out(like `DropHoodieTableCommand`, 
`CreateHoodieTableAsSelectCommand`)
   Looks we don't have any standards whether throw exceptions if syncing to HMS 
occurs error.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...

[GitHub] [hudi] yihua commented on a diff in pull request #9437: [HUDI-6689] Add record index validation in MDT validator

2023-08-14 Thread via GitHub


yihua commented on code in PR #9437:
URL: https://github.com/apache/hudi/pull/9437#discussion_r1294125576


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java:
##
@@ -741,6 +791,116 @@ private void validateBloomFilters(
 validate(metadataBasedBloomFilters, fsBasedBloomFilters, partitionPath, 
"bloom filters");
   }
 
+  private void validateRecordIndex(HoodieSparkEngineContext sparkEngineContext,
+   HoodieTableMetaClient metaClient,
+   HoodieTableMetadata tableMetadata) {
+if (cfg.validateRecordIndexContent) {
+  validateRecordIndexContent(sparkEngineContext, metaClient, 
tableMetadata);
+} else if (cfg.validateRecordIndexCount) {
+  validateRecordIndexCount(sparkEngineContext, metaClient);
+}
+  }
+
+  private void validateRecordIndexCount(HoodieSparkEngineContext 
sparkEngineContext,
+HoodieTableMetaClient metaClient) {
+String basePath = metaClient.getBasePathV2().toString();
+long countKeyFromTable = 
sparkEngineContext.getSqlContext().read().format("hudi")
+.load(basePath)
+.select(RECORD_KEY_METADATA_FIELD)
+.distinct()
+.count();
+long countKeyFromRecordIndex = 
sparkEngineContext.getSqlContext().read().format("hudi")
+.load(getMetadataTableBasePath(basePath))
+.select("key")
+.filter("type = 5")
+.distinct()
+.count();
+
+if (countKeyFromTable != countKeyFromRecordIndex) {
+  String message = String.format("Validation of record index count failed: 
"
+  + "%s entries from record index metadata, %s keys from the data 
table.",
+  countKeyFromRecordIndex, countKeyFromTable);
+  LOG.error(message);
+  throw new HoodieValidationException(message);
+} else {
+  LOG.info(String.format(
+  "Validation of record index count succeeded: %s entries.", 
countKeyFromRecordIndex));
+}
+  }
+
+  private void validateRecordIndexContent(HoodieSparkEngineContext 
sparkEngineContext,
+  HoodieTableMetaClient metaClient,
+  HoodieTableMetadata tableMetadata) {
+String basePath = metaClient.getBasePathV2().toString();
+JavaPairRDD> keyToLocationOnFsRdd =
+sparkEngineContext.getSqlContext().read().format("hudi").load(basePath)
+.select(RECORD_KEY_METADATA_FIELD, PARTITION_PATH_METADATA_FIELD, 
FILENAME_METADATA_FIELD)
+.toJavaRDD()
+.mapToPair(row -> new 
Tuple2<>(row.getString(row.fieldIndex(RECORD_KEY_METADATA_FIELD)),
+
Pair.of(row.getString(row.fieldIndex(PARTITION_PATH_METADATA_FIELD)),
+
FSUtils.getFileId(row.getString(row.fieldIndex(FILENAME_METADATA_FIELD))
+.cache();
+
+JavaPairRDD> keyToLocationFromRecordIndexRdd =
+sparkEngineContext.getSqlContext().read().format("hudi")
+.load(getMetadataTableBasePath(basePath))
+.filter("type = 5")
+.select(functions.col("key"),
+
functions.col("recordIndexMetadata.partitionName").as("partitionName"),
+
functions.col("recordIndexMetadata.fileIdHighBits").as("fileIdHighBits"),
+
functions.col("recordIndexMetadata.fileIdLowBits").as("fileIdLowBits"),
+functions.col("recordIndexMetadata.fileIndex").as("fileIndex"),
+functions.col("recordIndexMetadata.fileId").as("fileId"),
+
functions.col("recordIndexMetadata.instantTime").as("instantTime"),
+
functions.col("recordIndexMetadata.fileIdEncoding").as("fileIdEncoding"))
+.toJavaRDD()
+.mapToPair(row -> {
+  HoodieRecordGlobalLocation location = 
HoodieTableMetadataUtil.getLocationFromRecordIndexInfo(
+  row.getString(row.fieldIndex("partitionName")),
+  row.getInt(row.fieldIndex("fileIdEncoding")),
+  row.getLong(row.fieldIndex("fileIdHighBits")),
+  row.getLong(row.fieldIndex("fileIdLowBits")),
+  row.getInt(row.fieldIndex("fileIndex")),
+  row.getString(row.fieldIndex("fileId")),
+  row.getLong(row.fieldIndex("instantTime")));
+  return new Tuple2<>(row.getString(row.fieldIndex("key")),
+  Pair.of(location.getPartitionPath(), location.getFileId()));
+});
+
+long diffCount = 
keyToLocationOnFsRdd.fullOuterJoin(keyToLocationFromRecordIndexRdd, 
cfg.recordIndexParallelism)
+.map(e -> {
+  Optional> locationOnFs = e._2._1;
+  Optional> locationFromRecordIndex = e._2._2;
+  if (locationOnFs.isPresent() && locationFromRecordIndex.isPresent()) 
{
+if 
(locationOnFs.get().getLeft().equals(locationFromRecordIndex.get().getLeft())
+   

[GitHub] [hudi] codope commented on issue #9440: [SUPPORT] Trino cannot read when there is replacecommit metadata

2023-08-14 Thread via GitHub


codope commented on issue #9440:
URL: https://github.com/apache/hudi/issues/9440#issuecomment-1678338350

   @ksoullpwk Thanks for the diagnosis. Could you check if this fix helps you? 
https://github.com/trinodb/trino/pull/18213


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9437: [HUDI-6689] Add record index validation in MDT validator

2023-08-14 Thread via GitHub


hudi-bot commented on PR #9437:
URL: https://github.com/apache/hudi/pull/9437#issuecomment-1678336975

   
   ## CI report:
   
   * b25b5402c1e3e14264c6bbfd38910f4b93b8a871 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19291)
 
   * 0cc0c34422625e63bf9e421d73c22959b7cc9916 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #9444: [HUDI-6692] Do not allow switching from Primary keyed table to primary key less table

2023-08-14 Thread via GitHub


danny0405 commented on PR #9444:
URL: https://github.com/apache/hudi/pull/9444#issuecomment-1678331086

   > If a write to a table with a pk was missing the recordkey field in options 
it would think it was a pkless write. now it fails
   
   I'm confused, if we already know it is a table with pk, can we just use the 
field from table config as the record key by default. And we should not think 
it as a pkless table.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch asf-site updated: [HUDI-6676][DOCS] Add command for CreateHoodieTableLike (#9441)

2023-08-14 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new f138422fcb9 [HUDI-6676][DOCS] Add command for CreateHoodieTableLike 
(#9441)
f138422fcb9 is described below

commit f138422fcb94f54d0d0431f81766b64af5a9d519
Author: Rex(Hui) An 
AuthorDate: Tue Aug 15 10:08:54 2023 +0800

[HUDI-6676][DOCS] Add command for CreateHoodieTableLike (#9441)


Co-authored-by: Hussein Awala 
---
 website/docs/quick-start-guide.md | 62 +++
 1 file changed, 62 insertions(+)

diff --git a/website/docs/quick-start-guide.md 
b/website/docs/quick-start-guide.md
index a23ce275394..4e6a6e55e5c 100644
--- a/website/docs/quick-start-guide.md
+++ b/website/docs/quick-start-guide.md
@@ -384,6 +384,68 @@ create table hudi_ctas_cow_pt_tbl2 using hudi location 
'file:/tmp/hudi/hudi_tbl/
 partitioned by (datestr) as select * from parquet_mngd;
 ```
 
+**CREATE TABLE LIKE**
+
+The `CREATE TABLE LIKE` statement allows you to create a new Hudi table with 
the same schema and properties from an existing Hudi/hive table.
+
+:::note
+This feature is available in Apache Hudi for Spark 3 and later versions.
+:::
+
+Examples Create a HUDI table from an existing HUDI table
+
+```sql
+# create a source hudi table
+create table source_hudi (
+  id int,
+  name string,
+  price double,
+  ts long
+) using hudi
+tblproperties (
+  primaryKey = 'id,name',
+  type = 'cow'
+ );
+
+# create a new hudi table based on the source table
+create table target_hudi1
+like source_hudi
+using hudi;
+
+# create a new hudi table based on the source table with override options
+create table target_hudi2
+like source_hudi
+using hudi
+tblproperties (primaryKey = 'id');
+
+# create a new external hudi table based on the source table with location
+create table target_hudi3
+like source_hudi
+using hudi
+location 'file:/tmp/hudi/target_hudi3/';
+```
+
+Examples Create a HUDI table from an existing parquet table
+
+```sql
+# create a source parquet table
+create table source_parquet (
+  id int,
+  name string,
+  price double,
+  ts long
+) using parquet;
+
+# create a new hudi table based on the source table
+create table target_hudi1
+like source_parquet
+using hudi
+tblproperties (
+ primaryKey = 'id,name',
+ type = 'cow'
+);
+```
+
 **Create Table Properties**
 
 Users can set table properties while creating a hudi table. Critical options 
are listed here.



[GitHub] [hudi] danny0405 merged pull request #9441: [HUDI-6676][DOCS] Add command for CreateHoodieTableLike

2023-08-14 Thread via GitHub


danny0405 merged PR #9441:
URL: https://github.com/apache/hudi/pull/9441


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-6683) Added kafka key as part of hudi metadata columns for Json & Avro KafkaSource

2023-08-14 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-6683.

Resolution: Fixed

Fixed via master branch: 4099e1d18b78583d739fdb252f85b58d991d2fb0

> Added kafka key as part of hudi metadata columns for Json & Avro KafkaSource
> 
>
> Key: HUDI-6683
> URL: https://issues.apache.org/jira/browse/HUDI-6683
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: deltastreamer
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[hudi] branch master updated: [HUDI-6683] Added kafka key as part of hudi metadata columns for Json & Avro KafkaSource (#9403)

2023-08-14 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 4099e1d18b7 [HUDI-6683] Added kafka key as part of hudi metadata 
columns for Json & Avro KafkaSource (#9403)
4099e1d18b7 is described below

commit 4099e1d18b78583d739fdb252f85b58d991d2fb0
Author: Prathit malik <53890994+prathi...@users.noreply.github.com>
AuthorDate: Tue Aug 15 07:37:26 2023 +0530

[HUDI-6683] Added kafka key as part of hudi metadata columns for Json & 
Avro KafkaSource (#9403)
---
 .../hudi/utilities/schema/KafkaOffsetPostProcessor.java   |  6 +-
 .../org/apache/hudi/utilities/sources/JsonKafkaSource.java|  3 +++
 .../apache/hudi/utilities/sources/helpers/AvroConvertor.java  |  3 +++
 .../apache/hudi/utilities/sources/TestAvroKafkaSource.java| 11 ++-
 .../apache/hudi/utilities/sources/TestJsonKafkaSource.java|  9 +
 5 files changed, 22 insertions(+), 10 deletions(-)

diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/schema/KafkaOffsetPostProcessor.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/schema/KafkaOffsetPostProcessor.java
index 63473c3bce8..500bb0c7f99 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/schema/KafkaOffsetPostProcessor.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/schema/KafkaOffsetPostProcessor.java
@@ -18,6 +18,7 @@
 
 package org.apache.hudi.utilities.schema;
 
+import org.apache.avro.JsonProperties;
 import org.apache.hudi.common.config.ConfigProperty;
 import org.apache.hudi.common.config.TypedProperties;
 import org.apache.hudi.internal.schema.HoodieSchemaException;
@@ -31,6 +32,7 @@ import org.slf4j.LoggerFactory;
 import java.util.List;
 import java.util.stream.Collectors;
 
+import static org.apache.hudi.avro.AvroSchemaUtils.createNullableSchema;
 import static org.apache.hudi.common.util.ConfigUtils.getBooleanWithAltKeys;
 
 /**
@@ -54,6 +56,7 @@ public class KafkaOffsetPostProcessor extends 
SchemaPostProcessor {
   public static final String KAFKA_SOURCE_OFFSET_COLUMN = 
"_hoodie_kafka_source_offset";
   public static final String KAFKA_SOURCE_PARTITION_COLUMN = 
"_hoodie_kafka_source_partition";
   public static final String KAFKA_SOURCE_TIMESTAMP_COLUMN = 
"_hoodie_kafka_source_timestamp";
+  public static final String KAFKA_SOURCE_KEY_COLUMN = 
"_hoodie_kafka_source_key";
 
   public KafkaOffsetPostProcessor(TypedProperties props, JavaSparkContext 
jssc) {
 super(props, jssc);
@@ -61,7 +64,7 @@ public class KafkaOffsetPostProcessor extends 
SchemaPostProcessor {
 
   @Override
   public Schema processSchema(Schema schema) {
-// this method adds kafka offset fields namely source offset, partition 
and timestamp to the schema of the batch.
+// this method adds kafka offset fields namely source offset, partition, 
timestamp and kafka message key to the schema of the batch.
 try {
   List fieldList = schema.getFields();
   List newFieldList = fieldList.stream()
@@ -69,6 +72,7 @@ public class KafkaOffsetPostProcessor extends 
SchemaPostProcessor {
   newFieldList.add(new Schema.Field(KAFKA_SOURCE_OFFSET_COLUMN, 
Schema.create(Schema.Type.LONG), "offset column", 0));
   newFieldList.add(new Schema.Field(KAFKA_SOURCE_PARTITION_COLUMN, 
Schema.create(Schema.Type.INT), "partition column", 0));
   newFieldList.add(new Schema.Field(KAFKA_SOURCE_TIMESTAMP_COLUMN, 
Schema.create(Schema.Type.LONG), "timestamp column", 0));
+  newFieldList.add(new Schema.Field(KAFKA_SOURCE_KEY_COLUMN, 
createNullableSchema(Schema.Type.STRING), "kafka key column", 
JsonProperties.NULL_VALUE));
   Schema newSchema = Schema.createRecord(schema.getName() + "_processed", 
schema.getDoc(), schema.getNamespace(), false, newFieldList);
   return newSchema;
 } catch (Exception e) {
diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JsonKafkaSource.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JsonKafkaSource.java
index 775bd095fe0..de67dc171a9 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JsonKafkaSource.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JsonKafkaSource.java
@@ -47,6 +47,7 @@ import static 
org.apache.hudi.common.util.ConfigUtils.getStringWithAltKeys;
 import static 
org.apache.hudi.utilities.schema.KafkaOffsetPostProcessor.KAFKA_SOURCE_OFFSET_COLUMN;
 import static 
org.apache.hudi.utilities.schema.KafkaOffsetPostProcessor.KAFKA_SOURCE_PARTITION_COLUMN;
 import static 
org.apache.hudi.utilities.schema.KafkaOffsetPostProcessor.KAFKA_SOURCE_TIMESTAMP_COLUMN;
+import static 
org.apache.hudi.utilities.schema.KafkaOffsetPostProcessor.KAFKA_SOURCE_KEY_COLUMN;
 
 /**
  * Read json kafka data.
@@ -80,11 +81,13 @@ public class JsonKafkaSource extends KafkaSource {

[GitHub] [hudi] danny0405 merged pull request #9403: [HUDI-6683] Added kafka key as part of hudi metadata columns for Json & Avro KafkaSource

2023-08-14 Thread via GitHub


danny0405 merged PR #9403:
URL: https://github.com/apache/hudi/pull/9403


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #9403: [HUDI-6683] Added kafka key as part of hudi metadata columns for Json & Avro KafkaSource

2023-08-14 Thread via GitHub


danny0405 commented on PR #9403:
URL: https://github.com/apache/hudi/pull/9403#issuecomment-1678327812

   Thanks for the nice feedback @hussein-awala , maybe you can fire a separate 
PR to address it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6683) Added kafka key as part of hudi metadata columns for Json & Avro KafkaSource

2023-08-14 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6683:
-
Fix Version/s: 1.1.0
   (was: 1.0.0)

> Added kafka key as part of hudi metadata columns for Json & Avro KafkaSource
> 
>
> Key: HUDI-6683
> URL: https://issues.apache.org/jira/browse/HUDI-6683
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: deltastreamer
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6585) Certify DedupeSparkJob for both table types

2023-08-14 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6585:
-
Fix Version/s: 1.1.0
   0.15.0
   (was: 1.0.0)

> Certify DedupeSparkJob for both table types
> ---
>
> Key: HUDI-6585
> URL: https://issues.apache.org/jira/browse/HUDI-6585
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Major
> Fix For: 1.1.0, 0.15.0
>
>
> Hudi has a utility `DedupeSparkJob` which can deduplicate data present in a 
> partition. Need to check if it can dedupe across table for both table types.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6586) Add Incremental scan support to dbt

2023-08-14 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6586:
-
Fix Version/s: 0.15.0
   0.14.1

> Add Incremental scan support to dbt
> ---
>
> Key: HUDI-6586
> URL: https://issues.apache.org/jira/browse/HUDI-6586
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: connectors
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Major
> Fix For: 1.0.0, 0.15.0, 0.14.1
>
>
> The current dbt support adds only the basic hudi primitives, but with deeper 
> integration we could enable faster ETL queries using the incremental read 
> primitive similar to the deltastreamer support.
>  
> The goal of this epic is to enable incremental data processing for dbt.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-6490) Implement support for applying updates as deletes + inserts

2023-08-14 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17754362#comment-17754362
 ] 

Vinoth Chandar commented on HUDI-6490:
--

[~tim.brown] Do you want to take this work up? This can be done even on the 0.X 
code line. 

> Implement support for applying updates as deletes + inserts
> ---
>
> Key: HUDI-6490
> URL: https://issues.apache.org/jira/browse/HUDI-6490
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: performance
>Reporter: Vinoth Chandar
>Assignee: Timothy Brown
>Priority: Major
> Fix For: 1.0.0, 0.15.0, 0.14.1
>
>
> This needs to happen at the higher layer of writing from Spark/Flink etc. 
> Hudi can already support this, by 
> - Logging delete blocks to the old file group. 
> - Writing new data blocks/base files to the new file group.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6490) Implement support for applying updates as deletes + inserts

2023-08-14 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6490:
-
Fix Version/s: 0.15.0
   0.14.1

> Implement support for applying updates as deletes + inserts
> ---
>
> Key: HUDI-6490
> URL: https://issues.apache.org/jira/browse/HUDI-6490
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: performance
>Reporter: Vinoth Chandar
>Priority: Major
> Fix For: 1.0.0, 0.15.0, 0.14.1
>
>
> This needs to happen at the higher layer of writing from Spark/Flink etc. 
> Hudi can already support this, by 
> - Logging delete blocks to the old file group. 
> - Writing new data blocks/base files to the new file group.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-6490) Implement support for applying updates as deletes + inserts

2023-08-14 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-6490:


Assignee: Timothy Brown

> Implement support for applying updates as deletes + inserts
> ---
>
> Key: HUDI-6490
> URL: https://issues.apache.org/jira/browse/HUDI-6490
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: performance
>Reporter: Vinoth Chandar
>Assignee: Timothy Brown
>Priority: Major
> Fix For: 1.0.0, 0.15.0, 0.14.1
>
>
> This needs to happen at the higher layer of writing from Spark/Flink etc. 
> Hudi can already support this, by 
> - Logging delete blocks to the old file group. 
> - Writing new data blocks/base files to the new file group.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6296) Add Scala 2.13 build profile to support scala 2.13

2023-08-14 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6296:
-
Fix Version/s: 1.1.0
   (was: 1.0.0)

> Add Scala 2.13 build profile to support scala 2.13
> --
>
> Key: HUDI-6296
> URL: https://issues.apache.org/jira/browse/HUDI-6296
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Aditya Goenka
>Priority: Minor
> Fix For: 1.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-6640) Non-blocking concurrency control

2023-08-14 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17754361#comment-17754361
 ] 

Vinoth Chandar commented on HUDI-6640:
--

This is a duplicate of HUDI-5672 

> Non-blocking concurrency control
> 
>
> Key: HUDI-6640
> URL: https://issues.apache.org/jira/browse/HUDI-6640
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: core
>Reporter: Danny Chen
>Assignee: Jing Zhang
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-1045) Support updates during clustering

2023-08-14 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1045:
-
Epic Link: HUDI-5672  (was: HUDI-1042)

> Support updates during clustering
> -
>
> Key: HUDI-1045
> URL: https://issues.apache.org/jira/browse/HUDI-1045
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: leesf
>Assignee: leesf
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5672) Non-blocking multi writer support

2023-08-14 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-5672:
-
Summary: Non-blocking multi writer support  (was: Lockless multi writer 
support)

> Non-blocking multi writer support
> -
>
> Key: HUDI-5672
> URL: https://issues.apache.org/jira/browse/HUDI-5672
> Project: Apache Hudi
>  Issue Type: Epic
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-1238) [UMBRELLA] Perf test env

2023-08-14 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1238:
-
Fix Version/s: 1.1.0
   (was: 1.0.0)

> [UMBRELLA] Perf test env
> 
>
> Key: HUDI-1238
> URL: https://issues.apache.org/jira/browse/HUDI-1238
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: performance, Testing
>Reporter: sivabalan narayanan
>Assignee: Rajesh Mahindra
>Priority: Blocker
>  Labels: hudi-umbrellas
> Fix For: 1.1.0
>
>
> We need to build a perf test environment which monitors metrics from a long 
> running test suite and displays via dashboards. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-2519) [UMBRELLA] Seamless meta sync

2023-08-14 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2519:
-
Fix Version/s: 1.0.0
   (was: 1.1.0)

> [UMBRELLA] Seamless meta sync
> -
>
> Key: HUDI-2519
> URL: https://issues.apache.org/jira/browse/HUDI-2519
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: hive
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: hive, hive3, hudi-umbrellas
> Fix For: 1.0.0
>
>
> Hudi to Hive sync is a common use case which enables querying Hudi tables 
> through other query engines that support hive connector such as Presto and 
> Trino. Currently, Hudi supports syncing to Hive asynchronously using 
> run_sync_tool or synchronously through deltastreamer.
> The goal of this umbrella JIRA is to imrpove the current sync mechanism and 
> support Hive3. Additionally, we need to improve the documentation around 
> different configs and sync modes. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-2519) [UMBRELLA] Seamless meta sync

2023-08-14 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2519:
-
Fix Version/s: 1.1.0
   (was: 1.0.0)

> [UMBRELLA] Seamless meta sync
> -
>
> Key: HUDI-2519
> URL: https://issues.apache.org/jira/browse/HUDI-2519
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: hive
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: hive, hive3, hudi-umbrellas
> Fix For: 1.1.0
>
>
> Hudi to Hive sync is a common use case which enables querying Hudi tables 
> through other query engines that support hive connector such as Presto and 
> Trino. Currently, Hudi supports syncing to Hive asynchronously using 
> run_sync_tool or synchronously through deltastreamer.
> The goal of this umbrella JIRA is to imrpove the current sync mechanism and 
> support Hive3. Additionally, we need to improve the documentation around 
> different configs and sync modes. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #9444: [HUDI-6692] Do not allow switching from Primary keyed table to primary key less table

2023-08-14 Thread via GitHub


hudi-bot commented on PR #9444:
URL: https://github.com/apache/hudi/pull/9444#issuecomment-167822

   
   ## CI report:
   
   * c7e99fd19a00469c0e181b6c64b63aa9cfb7ed4e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19292)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9425: When invalidate the table in the spark sql query cache, verify if the…

2023-08-14 Thread via GitHub


danny0405 commented on code in PR #9425:
URL: https://github.com/apache/hudi/pull/9425#discussion_r1294098528


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala:
##
@@ -965,8 +965,9 @@ object HoodieSparkSqlWriter {
 // we must invalidate this table in the cache so writes are reflected in 
later queries
 if (metaSyncEnabled) {
   getHiveTableNames(hoodieConfig).foreach(name => {
-val qualifiedTableName = String.join(".", 
hoodieConfig.getStringOrDefault(HIVE_DATABASE), name)
-if (spark.catalog.tableExists(qualifiedTableName)) {
+val syncDb = hoodieConfig.getStringOrDefault(HIVE_DATABASE)
+val qualifiedTableName = String.join(".", syncDb, name)

Review Comment:
   Reasonable, should we also take the default database name into consideration?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-2638) Rewrite tests around Hudi index

2023-08-14 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2638:
-
Fix Version/s: (was: 1.0.0)

> Rewrite tests around Hudi index
> ---
>
> Key: HUDI-2638
> URL: https://issues.apache.org/jira/browse/HUDI-2638
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Ethan Guo
>Assignee: Raymond Xu
>Priority: Major
> Fix For: 1.1.0
>
>
> There are duplicate code between `TestFlinkHoodieBloomIndex` and 
> `TestHoodieBloomIndex`, among other test classes.  We should do one pass to 
> clean the test code once the refactoring is done.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3121) Spark datasource with bucket index unit test reuse

2023-08-14 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3121:
-
Fix Version/s: 1.1.0
   (was: 1.0.0)

> Spark datasource with bucket index unit test reuse
> --
>
> Key: HUDI-3121
> URL: https://issues.apache.org/jira/browse/HUDI-3121
> Project: Apache Hudi
>  Issue Type: Test
>  Components: index, tests-ci
>Reporter: XiaoyuGeng
>Priority: Major
> Fix For: 1.1.0
>
>
> let `TestMORDataSourceWithBucket` reuse existing unit test by parameterizing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-2638) Rewrite tests around Hudi index

2023-08-14 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2638:
-
Fix Version/s: 1.1.0

> Rewrite tests around Hudi index
> ---
>
> Key: HUDI-2638
> URL: https://issues.apache.org/jira/browse/HUDI-2638
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Ethan Guo
>Assignee: Raymond Xu
>Priority: Major
> Fix For: 1.0.0, 1.1.0
>
>
> There are duplicate code between `TestFlinkHoodieBloomIndex` and 
> `TestHoodieBloomIndex`, among other test classes.  We should do one pass to 
> clean the test code once the refactoring is done.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-1916) Create a matrix of datatypes across spark, hive, presto, Avro, parquet.

2023-08-14 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1916:
-
Fix Version/s: 1.1.0
   (was: 1.0.0)

> Create a matrix of datatypes across spark, hive, presto, Avro, parquet. 
> 
>
> Key: HUDI-1916
> URL: https://issues.apache.org/jira/browse/HUDI-1916
> Project: Apache Hudi
>  Issue Type: Task
>  Components: docs
>Reporter: sivabalan narayanan
>Assignee: Nishith Agarwal
>Priority: Major
> Fix For: 1.1.0
>
>
> Create a matrix of datatypes across spark, hive, presto, Avro, parquet.
> Follow up with Flink. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-2375) Create common SchemaProvider and RecordPayloads for spark, flink etc.

2023-08-14 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2375:
-
Fix Version/s: 1.1.0
   (was: 1.0.0)

> Create common SchemaProvider and RecordPayloads for spark, flink etc.
> -
>
> Key: HUDI-2375
> URL: https://issues.apache.org/jira/browse/HUDI-2375
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: kafka-connect, writer-core
>Reporter: Rajesh Mahindra
>Priority: Blocker
> Fix For: 1.1.0
>
>
> Create common SchemaProvider and RecordPayloads for spark, flink etc.
> - Currently the class org.apache.hudi.utilities.schema.SchemaProvider takes 
> in input JavaSparkContext, and is specific to Spark Engine. So we have 
> created a separate SchemaProvider for flink. Now for Kafka connect, we can 
> use neither, since its neither spark nor flink. Implement a common class that 
> uses HoodieEngineContext ..



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-309) General Redesign of Archived Timeline for efficient scan and management

2023-08-14 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-309:
---

Assignee: Danny Chen  (was: Balaji Varadarajan)

> General Redesign of Archived Timeline for efficient scan and management
> ---
>
> Key: HUDI-309
> URL: https://issues.apache.org/jira/browse/HUDI-309
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.0.0
>
> Attachments: Archive TImeline Notes by Vinoth 1.jpg, Archived 
> Timeline Notes by Vinoth 2.jpg
>
>
> As designed by Vinoth:
> Goals
>  # Archived Metadata should be scannable in the same way as data
>  # Provides more safety by always serving committed data independent of 
> timeframe when the corresponding commit action was tried. Currently, we 
> implicitly assume a data file to be valid if its commit time is older than 
> the earliest time in the active timeline. While this works ok, any inherent 
> bugs in rollback could inadvertently expose a possibly duplicate file when 
> its commit timestamp becomes older than that of any commits in the timeline.
>  # We had to deal with lot of corner cases because of the way we treat a 
> "commit" as special after it gets archived. Examples also include Savepoint 
> handling logic by cleaner.
>  # Small Files : For Cloud stores, archiving simply moves fils from one 
> directory to another causing the archive folder to grow. We need a way to 
> efficiently compact these files and at the same time be friendly to scans
> Design:
>  The basic file-group abstraction for managing file versions for data files 
> can be extended to managing archived commit metadata. The idea is to use an 
> optimal format (like HFile) for storing compacted version of  Metadata> pairs. Every archiving run will read  pairs 
> from active timeline and append to indexable log files. We will run periodic 
> minor compactions to merge multiple log files to a compacted HFile storing 
> metadata for a time-range. It should be also noted that we will partition by 
> the action types (commit/clean).  This design would allow for the archived 
> timeline to be queryable for determining whether a timeline is valid or not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] danny0405 commented on pull request #9199: [HUDI-6534]Support consistent hashing row writer

2023-08-14 Thread via GitHub


danny0405 commented on PR #9199:
URL: https://github.com/apache/hudi/pull/9199#issuecomment-1678284808

   @prashantwason You can cherry pick https://github.com/apache/hudi/pull/9401


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6676) Add command for CreateHoodieTableLike

2023-08-14 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-6676:
-
Fix Version/s: 1.0.0

> Add command for CreateHoodieTableLike
> -
>
> Key: HUDI-6676
> URL: https://issues.apache.org/jira/browse/HUDI-6676
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Hui An
>Assignee: Hui An
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> 1. Create table from non-hudi table
> 2. Create table from hudi table(The properties related to Hudi in the source 
> Hudi table will be carried over)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-6676) Add command for CreateHoodieTableLike

2023-08-14 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-6676.

Resolution: Fixed

Fixed via master branch: 8220d23be19af4783a9a776dfffa48167975a6a2

> Add command for CreateHoodieTableLike
> -
>
> Key: HUDI-6676
> URL: https://issues.apache.org/jira/browse/HUDI-6676
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Hui An
>Assignee: Hui An
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> 1. Create table from non-hudi table
> 2. Create table from hudi table(The properties related to Hudi in the source 
> Hudi table will be carried over)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3529) Improve dependency management and bundling

2023-08-14 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3529:
-
Fix Version/s: 1.1.0
   (was: 1.0.0)

> Improve dependency management and bundling
> --
>
> Key: HUDI-3529
> URL: https://issues.apache.org/jira/browse/HUDI-3529
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: dependencies
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] danny0405 merged pull request #9412: [HUDI-6676] Add command for CreateHoodieTableLike

2023-08-14 Thread via GitHub


danny0405 merged PR #9412:
URL: https://github.com/apache/hudi/pull/9412


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch master updated: [HUDI-6676] Add command for CreateHoodieTableLike (#9412)

2023-08-14 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 8220d23be19 [HUDI-6676] Add command for CreateHoodieTableLike (#9412)
8220d23be19 is described below

commit 8220d23be19af4783a9a776dfffa48167975a6a2
Author: Rex(Hui) An 
AuthorDate: Tue Aug 15 09:02:04 2023 +0800

[HUDI-6676] Add command for CreateHoodieTableLike (#9412)

* add command for CreateHoodieTableLike
* don't support spark2
---
 .../spark/sql/HoodieCatalystPlansUtils.scala   |   7 ++
 .../org/apache/spark/sql/hudi/SparkAdapter.scala   |   8 +-
 .../apache/spark/sql/hudi/HoodieOptionConfig.scala |   8 ++
 .../command/CreateHoodieTableLikeCommand.scala | 110 
 .../spark/sql/hudi/analysis/HoodieAnalysis.scala   |  13 +-
 .../apache/spark/sql/hudi/TestCreateTable.scala| 139 +
 .../spark/sql/HoodieSpark2CatalystPlanUtils.scala  |   9 ++
 .../spark/sql/HoodieSpark3CatalystPlanUtils.scala  |  13 +-
 8 files changed, 302 insertions(+), 5 deletions(-)

diff --git 
a/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieCatalystPlansUtils.scala
 
b/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieCatalystPlansUtils.scala
index 58789681c54..9cfe23f86cc 100644
--- 
a/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieCatalystPlansUtils.scala
+++ 
b/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieCatalystPlansUtils.scala
@@ -18,6 +18,7 @@
 package org.apache.spark.sql
 
 import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.catalyst.catalog.CatalogStorageFormat
 import org.apache.spark.sql.catalyst.expressions.{Attribute, Expression}
 import org.apache.spark.sql.catalyst.plans.JoinType
 import org.apache.spark.sql.catalyst.plans.logical.{Join, LogicalPlan}
@@ -93,6 +94,12 @@ trait HoodieCatalystPlansUtils {
*/
   def unapplyInsertIntoStatement(plan: LogicalPlan): Option[(LogicalPlan, 
Map[String, Option[String]], LogicalPlan, Boolean, Boolean)]
 
+  /**
+   * Decomposes [[CreateTableLikeCommand]] into its arguments allowing to 
accommodate for API
+   * changes in Spark 3
+   */
+  def unapplyCreateTableLikeCommand(plan: LogicalPlan): 
Option[(TableIdentifier, TableIdentifier, CatalogStorageFormat, Option[String], 
Map[String, String], Boolean)]
+
   /**
* Rebases instance of {@code InsertIntoStatement} onto provided instance of 
{@code targetTable} and {@code query}
*/
diff --git 
a/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/hudi/SparkAdapter.scala
 
b/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/hudi/SparkAdapter.scala
index 041beba95df..1c6111afe47 100644
--- 
a/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/hudi/SparkAdapter.scala
+++ 
b/hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/hudi/SparkAdapter.scala
@@ -150,11 +150,11 @@ trait SparkAdapter extends Serializable {
   }
 
   def isHoodieTable(map: java.util.Map[String, String]): Boolean = {
-map.getOrDefault("provider", "").equals("hudi")
+isHoodieTable(map.getOrDefault("provider", ""))
   }
 
   def isHoodieTable(table: CatalogTable): Boolean = {
-table.provider.map(_.toLowerCase(Locale.ROOT)).orNull == "hudi"
+isHoodieTable(table.provider.map(_.toLowerCase(Locale.ROOT)).orNull)
   }
 
   def isHoodieTable(tableId: TableIdentifier, spark: SparkSession): Boolean = {
@@ -162,6 +162,10 @@ trait SparkAdapter extends Serializable {
 isHoodieTable(table)
   }
 
+  def isHoodieTable(provider: String): Boolean = {
+"hudi".equalsIgnoreCase(provider)
+  }
+
   /**
* Create instance of [[ParquetFileFormat]]
*/
diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/HoodieOptionConfig.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/HoodieOptionConfig.scala
index d715a108d62..abe98bb46cf 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/HoodieOptionConfig.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/HoodieOptionConfig.scala
@@ -182,6 +182,14 @@ object HoodieOptionConfig {
 options.filterNot(_._1.startsWith("hoodie.")).filterNot(kv => 
sqlOptionKeyToWriteConfigKey.contains(kv._1))
   }
 
+  /**
+   * The opposite of `deleteHoodieOptions`, this method extract all hoodie 
related
+   * options(start with `hoodie.` and all sql options)
+   */
+  def extractHoodieOptions(options: Map[String, String]): Map[String, String] 
= {
+options.filter(_._1.startsWith("hoodie.")) ++ extractSqlOptions(options)
+  }
+
   // extract primaryKey, preCombineField, type options
   def extractSqlOptions(options: Map[String, String]): Map[String, String] = {
 

[jira] [Updated] (HUDI-2871) Decouple metrics dependencies from hudi-client-common

2023-08-14 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2871:
-
Fix Version/s: 1.1.0
   (was: 1.0.0)

> Decouple metrics dependencies from hudi-client-common
> -
>
> Key: HUDI-2871
> URL: https://issues.apache.org/jira/browse/HUDI-2871
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: code-quality, dependencies, metrics, writer-core
>Reporter: Vinoth Chandar
>Assignee: Sagar Sumit
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.0, 1.1.0
>
>
> There are some metrics stuff  - Cloudwatch, graphite, prometheus etc are all 
> pulled in. 
> might be good to break these out into their own modules and include during 
> packaging. This needs some way of reflection based instantiation of the 
> Metrics reporter



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6483) MERGE INTO should support schema evolution for partial updates.

2023-08-14 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6483:
-
Fix Version/s: 1.1.0

> MERGE INTO should support schema evolution for partial updates.
> ---
>
> Key: HUDI-6483
> URL: https://issues.apache.org/jira/browse/HUDI-6483
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql
>Reporter: Aditya Goenka
>Priority: Major
> Fix For: 1.1.0, 0.15.0
>
>
> Following code is example for doing MERGE INTO along with schema evolution 
> which is not yet supported by hudi. Currently, Hudi tries to use target table 
> schema during MERGE INTO.
> Following code should be supported - 
> ```
> create table test_insert3 (
>     id int,
> name string,
> updated_at timestamp
> ) using hudi
> options (
>     type = 'cow',
>     primaryKey = 'id',
>     preCombineField = 'updated_at'
> ) location 'file:///tmp/test_insert3';
> merge into test_insert3 as target
> using (
>     select 1 as id, 'c' as name, 1 as new_col, current_timestamp as updated_at
> ) source
> on target.id = source.id
> when matched then update set target.new_col = source.new_col
> when not matched then insert *;
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-2687) [UMBRELLA] A new Trino connector for Hudi

2023-08-14 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2687:
-
Fix Version/s: 1.0.0
   (was: 1.1.0)

> [UMBRELLA] A new Trino connector for Hudi
> -
>
> Key: HUDI-2687
> URL: https://issues.apache.org/jira/browse/HUDI-2687
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: trino-presto
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Critical
>  Labels: hudi-umbrellas
> Fix For: 0.14.0, 1.0.0, 0.15.0
>
> Attachments: image-2021-11-05-14-16-57-324.png, 
> image-2021-11-05-14-17-03-211.png
>
>
> This JIRA tracks all the tasks related to building a new Hudi connector in 
> Trino.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-1574) Trim existing unit tests to finish in much shorter amount of time

2023-08-14 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1574:
-
Fix Version/s: 1.1.0
   (was: 1.0.0)

> Trim existing unit tests to finish in much shorter amount of time
> -
>
> Key: HUDI-1574
> URL: https://issues.apache.org/jira/browse/HUDI-1574
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: Testing, tests-ci
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Priority: Critical
> Fix For: 1.1.0, 0.15.0
>
>
> spark-client-tests
> 278.165 s - in org.apache.hudi.table.TestHoodieMergeOnReadTable
> 201.628 s - in org.apache.hudi.metadata.TestHoodieBackedMetadata
> 185.716 s - in org.apache.hudi.client.TestHoodieClientOnCopyOnWriteStorage
> 158.361 s - in org.apache.hudi.index.TestHoodieIndex
> 156.196 s - in org.apache.hudi.table.TestCleaner
> 132.369 s - in 
> org.apache.hudi.table.action.commit.TestCopyOnWriteActionExecutor
> 93.307 s - in org.apache.hudi.table.action.compact.TestAsyncCompaction
> 67.301 s - in org.apache.hudi.table.upgrade.TestUpgradeDowngrade
> 45.794 s - in org.apache.hudi.client.TestHoodieReadClient
> 38.615 s - in org.apache.hudi.index.bloom.TestHoodieBloomIndex
> 31.181 s - in org.apache.hudi.client.TestTableSchemaEvolution
> 20.072 s - in org.apache.hudi.table.action.compact.TestInlineCompaction
> grep " Time elapsed" hudi-client/hudi-spark-client/target/surefire-reports/* 
> | awk -F',' ' { print $5 } ' | awk -F':' ' { print $2 } ' | sort -nr | less
> hudi-utilities
> 209.936 s - in org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer
> 204.653 s - in 
> org.apache.hudi.utilities.functional.TestHoodieMultiTableDeltaStreamer
> 34.116 s - in org.apache.hudi.utilities.sources.TestKafkaSource
> 29.865 s - in org.apache.hudi.utilities.sources.TestParquetDFSSource
> 26.189 s - in 
> org.apache.hudi.utilities.sources.helpers.TestDatePartitionPathSelector
> Other Tests
> 42.595 s - in org.apache.hudi.common.functional.TestHoodieLogFormat
> 38.918 s - in org.apache.hudi.common.bootstrap.TestBootstrapIndex
> 22.046 s - in 
> org.apache.hudi.common.functional.TestHoodieLogFormatAppendFailure



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-1457) Add multi writing to Hudi tables using DFS based locking (only HDFS atomic renames)

2023-08-14 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17754349#comment-17754349
 ] 

Vinoth Chandar commented on HUDI-1457:
--

this does not work on cloud storage, since we cannot rely just on atomic puts. 
Just a note for anyone who is picking this up.

> Add multi writing to Hudi tables using DFS based locking (only HDFS atomic 
> renames)
> ---
>
> Key: HUDI-1457
> URL: https://issues.apache.org/jira/browse/HUDI-1457
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: writer-core
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
> Fix For: 1.1.0, 0.15.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3057) Instants should be generated strictly under locks

2023-08-14 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3057:
-
Fix Version/s: 1.1.0

> Instants should be generated strictly under locks
> -
>
> Key: HUDI-3057
> URL: https://issues.apache.org/jira/browse/HUDI-3057
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: multi-writer, writer-core
>Reporter: Alexey Kudinkin
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: sev:high
> Fix For: 0.14.0, 1.1.0
>
> Attachments: logs.txt
>
>
> While looking into the flakiness of the tests outlined here:
> https://issues.apache.org/jira/browse/HUDI-3043
>  
> I've stumbled upon following failure where one of the writers tries to 
> complete the Commit but it couldn't b/c such file does already exist:
> {code:java}
> java.util.concurrent.ExecutionException: java.lang.RuntimeException: 
> org.apache.hudi.exception.HoodieIOException: Failed to create file 
> /var/folders/kb/cnff55vj041g2nnlzs5ylqk0gn/T/junit5142536255031969586/testtable_MERGE_ON_READ/.hoodie/20211217150157632.commit
>     at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>     at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>     at 
> org.apache.hudi.utilities.functional.TestHoodieDeltaStreamerWithMultiWriter.runJobsInParallel(TestHoodieDeltaStreamerWithMultiWriter.java:336)
>     at 
> org.apache.hudi.utilities.functional.TestHoodieDeltaStreamerWithMultiWriter.testUpsertsContinuousModeWithMultipleWriters(TestHoodieDeltaStreamerWithMultiWriter.java:150)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at 
> org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:688)
>     at 
> org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
>     at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131)
>     at 
> org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:149)
>     at 
> org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:140)
>     at 
> org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestTemplateMethod(TimeoutExtension.java:92)
>     at 
> org.junit.jupiter.engine.execution.ExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(ExecutableInvoker.java:115)
>     at 
> org.junit.jupiter.engine.execution.ExecutableInvoker.lambda$invoke$0(ExecutableInvoker.java:105)
>     at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106)
>     at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64)
>     at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45)
>     at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37)
>     at 
> org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:104)
>     at 
> org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:98)
>     at 
> org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$6(TestMethodTestDescriptor.java:212)
>     at 
> org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
>     at 
> org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:208)
>     at 
> org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:137)
>     at 
> org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:71)
>     at 
> org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$5(NodeTestTask.java:139)
>     at 
> org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
>     at 
> org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$7(NodeTestTask.java:129)
>     at 
> org.junit.platform.engine.support.hierarchical.Node.around(Node.java:137)
>     at 
> org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$8(NodeTestTask.java:127)
>     at 
> org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.

[jira] [Updated] (HUDI-1457) Add multi writing to Hudi tables using DFS based locking (only HDFS atomic renames)

2023-08-14 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1457:
-
Fix Version/s: (was: 0.15.0)

> Add multi writing to Hudi tables using DFS based locking (only HDFS atomic 
> renames)
> ---
>
> Key: HUDI-1457
> URL: https://issues.apache.org/jira/browse/HUDI-1457
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: writer-core
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
> Fix For: 1.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4068) Add Cosmos based lock provider for Azure

2023-08-14 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-4068:
-
Fix Version/s: 1.1.0
   (was: 1.0.0)

> Add Cosmos based lock provider for Azure
> 
>
> Key: HUDI-4068
> URL: https://issues.apache.org/jira/browse/HUDI-4068
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Major
> Fix For: 1.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4067) Add Spanner based lock provider for GCP

2023-08-14 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-4067:
-
Fix Version/s: 1.1.0
   (was: 1.0.0)

> Add Spanner based lock provider for GCP
> ---
>
> Key: HUDI-4067
> URL: https://issues.apache.org/jira/browse/HUDI-4067
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Major
>  Labels: concurrency, multi-writer
> Fix For: 1.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-2173) Enhancing DynamoDB based LockProvider

2023-08-14 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2173:
-
Fix Version/s: 1.1.0

> Enhancing DynamoDB based LockProvider
> -
>
> Key: HUDI-2173
> URL: https://issues.apache.org/jira/browse/HUDI-2173
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: writer-core
>Reporter: Vinoth Chandar
>Assignee: Dave Hagman
>Priority: Major
> Fix For: 0.14.0, 1.1.0
>
>
> Currently, we have ZK and HMS based Lock providers, which can be limited to 
> co-ordinating across a single EMR or Hadoop cluster. 
> For aws users, DynamoDB is a readily available , fully managed , geo 
> replicated datastore, that can actually be used to hold locks, that can now 
> span across EMR/hadoop clusters. 
> This effort involves supporting a new `DynamoDB` lock provider that 
> implements org.apache.hudi.common.lock.LockProvider. We can place the 
> implementation itself in hudi-client-common, so it can be used across Spark, 
> Flink, Deltastreamer etc. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-2687) [UMBRELLA] A new Trino connector for Hudi

2023-08-14 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2687:
-
Fix Version/s: 1.1.0
   (was: 1.0.0)

> [UMBRELLA] A new Trino connector for Hudi
> -
>
> Key: HUDI-2687
> URL: https://issues.apache.org/jira/browse/HUDI-2687
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: trino-presto
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Critical
>  Labels: hudi-umbrellas
> Fix For: 0.14.0, 1.1.0, 0.15.0
>
> Attachments: image-2021-11-05-14-16-57-324.png, 
> image-2021-11-05-14-17-03-211.png
>
>
> This JIRA tracks all the tasks related to building a new Hudi connector in 
> Trino.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hussein-awala commented on a diff in pull request #9441: [HUDI-6676][DOCS] Add command for CreateHoodieTableLike

2023-08-14 Thread via GitHub


hussein-awala commented on code in PR #9441:
URL: https://github.com/apache/hudi/pull/9441#discussion_r1294085403


##
website/docs/quick-start-guide.md:
##
@@ -384,6 +384,68 @@ create table hudi_ctas_cow_pt_tbl2 using hudi location 
'file:/tmp/hudi/hudi_tbl/
 partitioned by (datestr) as select * from parquet_mngd;
 ```
 
+**CREATE TABLE LIKE**
+
+The "CREATE TABLE LIKE" statement allows you to create a new Hudi table with 
the same schema and properties from an existing Hudi/hive table.

Review Comment:
   Nit
   ```suggestion
   The `CREATE TABLE LIKE` statement allows you to create a new Hudi table with 
the same schema and properties from an existing Hudi/hive table.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4141) [RFC-64] Table Format APIs

2023-08-14 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-4141:
-
Start Date: 4/Sep/23
  Due Date: 4/Oct/23

> [RFC-64] Table Format APIs
> --
>
> Key: HUDI-4141
> URL: https://issues.apache.org/jira/browse/HUDI-4141
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: reader-core, writer-core
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Critical
> Fix For: 1.0.0
>
>
> RFC: [https://github.com/apache/hudi/pull/7080]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hussein-awala commented on a diff in pull request #9444: [HUDI-6692] Do not allow switching from Primary keyed table to primary key less table

2023-08-14 Thread via GitHub


hussein-awala commented on code in PR #9444:
URL: https://github.com/apache/hudi/pull/9444#discussion_r1294078912


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieWriterUtils.scala:
##
@@ -179,9 +179,11 @@ object HoodieWriterUtils {
   if (null != tableConfig) {
 val datasourceRecordKey = params.getOrElse(RECORDKEY_FIELD.key(), null)
 val tableConfigRecordKey = 
tableConfig.getString(HoodieTableConfig.RECORDKEY_FIELDS)
-if ((null != datasourceRecordKey && null != tableConfigRecordKey
-  && datasourceRecordKey != tableConfigRecordKey) || (null != 
datasourceRecordKey && datasourceRecordKey.nonEmpty
-  && tableConfigRecordKey == null)) {
+val dsnull = datasourceRecordKey == null
+val tcnull = tableConfigRecordKey == null
+if ((!dsnull && !tcnull && datasourceRecordKey != tableConfigRecordKey)
+  || (!dsnull && datasourceRecordKey.nonEmpty
+  && tcnull) || ((dsnull || datasourceRecordKey.isEmpty) && !tcnull)) {

Review Comment:
   I'm not sure, but I wonder if tableConfigRecordKey could be empty string
   ```suggestion
 && tcnull) || ((dsnull || datasourceRecordKey.isEmpty) && !tcnull 
&& tableConfigRecordKey.nonEmpty)) {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9445: [HUDI-6694] Fix log file CLI around command blocks

2023-08-14 Thread via GitHub


hudi-bot commented on PR #9445:
URL: https://github.com/apache/hudi/pull/9445#issuecomment-1678258723

   
   ## CI report:
   
   * 06d72d5563b9cd26e131c3907dcc653e59a2b8be Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19293)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9445: [HUDI-6694] Fix log file CLI around command blocks

2023-08-14 Thread via GitHub


hudi-bot commented on PR #9445:
URL: https://github.com/apache/hudi/pull/9445#issuecomment-1678253584

   
   ## CI report:
   
   * 06d72d5563b9cd26e131c3907dcc653e59a2b8be UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9437: [HUDI-6689] Add record index validation in MDT validator

2023-08-14 Thread via GitHub


hudi-bot commented on PR #9437:
URL: https://github.com/apache/hudi/pull/9437#issuecomment-1678253520

   
   ## CI report:
   
   * b25b5402c1e3e14264c6bbfd38910f4b93b8a871 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19291)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hussein-awala commented on a diff in pull request #9403: [HUDI-6683] Added kafka key as part of hudi metadata columns for Json & Avro KafkaSource

2023-08-14 Thread via GitHub


hussein-awala commented on code in PR #9403:
URL: https://github.com/apache/hudi/pull/9403#discussion_r1294069377


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/AvroConvertor.java:
##
@@ -175,9 +176,11 @@ public GenericRecord 
withKafkaFieldsAppended(ConsumerRecord consumerRecord) {
 for (Schema.Field field :  record.getSchema().getFields()) {
   recordBuilder.set(field, record.get(field.name()));
 }
+
 recordBuilder.set(KAFKA_SOURCE_OFFSET_COLUMN, consumerRecord.offset());
 recordBuilder.set(KAFKA_SOURCE_PARTITION_COLUMN, 
consumerRecord.partition());
 recordBuilder.set(KAFKA_SOURCE_TIMESTAMP_COLUMN, 
consumerRecord.timestamp());
+recordBuilder.set(KAFKA_SOURCE_KEY_COLUMN, 
String.valueOf(consumerRecord.key()));

Review Comment:
   ```suggestion
   recordBuilder.set(KAFKA_SOURCE_KEY_COLUMN, 
consumerRecord.key().toString());
   ```



##
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JsonKafkaSource.java:
##
@@ -80,11 +81,13 @@ protected  JavaRDD 
maybeAppendKafkaOffsets(JavaRDD {
   String record = consumerRecord.value().toString();

Review Comment:
   I think renaming this variable to `recordValue` might make the code more 
readable:
   ```suggestion
 String recordValue = consumerRecord.value().toString();
   ```



##
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JsonKafkaSource.java:
##
@@ -80,11 +81,13 @@ protected  JavaRDD 
maybeAppendKafkaOffsets(JavaRDD {
   String record = consumerRecord.value().toString();
+  String recordKey = (String) consumerRecord.key();

Review Comment:
   ```suggestion
 String recordKey = consumerRecord.key().toString();
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua opened a new pull request, #5269: [HUDI-3636] Create new write clients for async table services in DeltaStreamer and Spark streaming sink

2023-08-14 Thread via GitHub


yihua opened a new pull request, #5269:
URL: https://github.com/apache/hudi/pull/5269

   ## What is the purpose of the pull request
   
   - In Deltastreamer, we re-instantiate WriteClient whenever schema changes. 
Same write client is used by all async table services as well. This poses an 
issue, because the new write client when re-instantiated is intimated to the 
async table service. but if the async table service is in the middle of 
compaction, uses a local copy of write client. and hence may not be able to 
reach the timeline server and will run into connection issues. We are fixing 
this in this patch. 
   - We have a singleton instance of embedded timeline service which regular 
writers and all table services will use. And within async table services, we 
will listen to write config changes and re-instantiate write client before any 
new compaction execution. 
   - Even between multiple re-instantiations of write clients for regular 
writer (due to schema change), uses the same singleton embedded timeline 
server. 
   - Previously embedded timeline server was shutdown when write client was 
shutdown. Fixed that in this patch, so that a single instantiation and tear 
down of embedded timeline server will span entire process start and stop. 
   - This also fixes a long standing issue w/ spark structured streaming. 
Apparently, this is what is happening in spark structured streaming flow. We 
start a new write client during first batch and close it at the end. But keep 
re-using the same instance of writeClient for subsequent batches. Only core 
entity that is impacted here was the embedded timeline server since we were 
closing it when write client was closed. So, after batch1, if timeline server 
was enabled, pipeline will fail since timeline server is shutdown. So, in this 
patch we are fixing that as well. Embedded timeline server is externally 
instantiated and so writeClient.close() will not shutdown the timeline server. 
We have a singleton instance of timeline server through entire pipeline. 
Previously we hard coded DIRECT style markers for spark streaming, but after 
this patch, we should be able to relax that. 
   
   
   ## Brief change log
   
   - Fixed Deltastreamer and Spark streaking sink to ensure timeline server 
sustains multiple instantiations of write client by different wriiters. 
   
   ## Verify this pull request
   
   This change added tests and can be verified as follows:
   
 - *Manually verified the change by running a job locally.*
 - For structured streaming, existing tests cover all flows. 
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nfarah86 opened a new pull request, #9446: updated image path for /blog

2023-08-14 Thread via GitHub


nfarah86 opened a new pull request, #9446:
URL: https://github.com/apache/hudi/pull/9446

   ### Change Logs
fixed broken images 
   https://github.com/apache/hudi/assets/5392555/055efb07-c4bc-4727-a4e4-bdb81fdbf546";>
   
   @nsivabalan please review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6694) Fix log file CLI around command blocks

2023-08-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6694:
-
Labels: pull-request-available  (was: )

> Fix log file CLI around command blocks
> --
>
> Key: HUDI-6694
> URL: https://issues.apache.org/jira/browse/HUDI-6694
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
>
> When there are rollback command blocks in the log files, the log file command 
> throws NPE:
> {code:java}
> hudi:hoodie_table->show logfile metadata --logFilePathPattern 
> file:/.1414abd2-346b-4c84-b380-c6ea6ec0863a-0_20230813220941456.log*
> java.lang.NullPointerException
>   at java.util.Objects.requireNonNull(Objects.java:203)
>   at 
> org.apache.hudi.cli.commands.HoodieLogFileCommand.showLogFileCommits(HoodieLogFileCommand.java:102)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.springframework.shell.command.invocation.InvocableShellMethod.doInvoke(InvocableShellMethod.java:306)
>   at 
> org.springframework.shell.command.invocation.InvocableShellMethod.invoke(InvocableShellMethod.java:232)
>   at 
> org.springframework.shell.command.CommandExecution$DefaultCommandExecution.evaluate(CommandExecution.java:158)
>   at org.springframework.shell.Shell.evaluate(Shell.java:208)
>   at org.springframework.shell.Shell.run(Shell.java:140)
>   at 
> org.springframework.shell.jline.InteractiveShellRunner.run(InteractiveShellRunner.java:73)
>   at 
> org.springframework.shell.DefaultShellApplicationRunner.run(DefaultShellApplicationRunner.java:65)
>   at 
> org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:762)
>   at 
> org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:752)
>   at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:315)
>   at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:1306)
>   at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:1295)
>   at org.apache.hudi.cli.Main.main(Main.java:34)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] yihua opened a new pull request, #9445: [HUDI-6694] Fix log file CLI around command blocks

2023-08-14 Thread via GitHub


yihua opened a new pull request, #9445:
URL: https://github.com/apache/hudi/pull/9445

   ### Change Logs
   
   This PR fixes the log file CLI commands when the log file contains command 
blocks like rollback commands.
   
   The tests are adjusted to consider such a scenario.  Without the fix, the 
new tests fail.
   
   Before the fix, when there are rollback command blocks in the log files, the 
log file command throws NPE:
   ```
   hudi:hoodie_table->show logfile metadata --logFilePathPattern 
file:/.1414abd2-346b-4c84-b380-c6ea6ec0863a-0_20230813220941456.log*
   
   java.lang.NullPointerException
at java.util.Objects.requireNonNull(Objects.java:203)
at 
org.apache.hudi.cli.commands.HoodieLogFileCommand.showLogFileCommits(HoodieLogFileCommand.java:102)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.springframework.shell.command.invocation.InvocableShellMethod.doInvoke(InvocableShellMethod.java:306)
at 
org.springframework.shell.command.invocation.InvocableShellMethod.invoke(InvocableShellMethod.java:232)
at 
org.springframework.shell.command.CommandExecution$DefaultCommandExecution.evaluate(CommandExecution.java:158)
at org.springframework.shell.Shell.evaluate(Shell.java:208)
at org.springframework.shell.Shell.run(Shell.java:140)
at 
org.springframework.shell.jline.InteractiveShellRunner.run(InteractiveShellRunner.java:73)
at 
org.springframework.shell.DefaultShellApplicationRunner.run(DefaultShellApplicationRunner.java:65)
at 
org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:762)
at 
org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:752)
at 
org.springframework.boot.SpringApplication.run(SpringApplication.java:315)
at 
org.springframework.boot.SpringApplication.run(SpringApplication.java:1306)
at 
org.springframework.boot.SpringApplication.run(SpringApplication.java:1295)
at org.apache.hudi.cli.Main.main(Main.java:34)
   ```
   
   ### Impact
   
   Bug fix on log file CLI.
   
   ### Risk level
   
   none
   
   ### Documentation Update
   
   N/A
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6694) Fix log file CLI around command blocks

2023-08-14 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6694:

Description: 
When there are rollback command blocks in the log files, the log file command 
throws NPE:

{code:java}
hudi:hoodie_table->show logfile metadata --logFilePathPattern 
file:/.1414abd2-346b-4c84-b380-c6ea6ec0863a-0_20230813220941456.log*

java.lang.NullPointerException
at java.util.Objects.requireNonNull(Objects.java:203)
at 
org.apache.hudi.cli.commands.HoodieLogFileCommand.showLogFileCommits(HoodieLogFileCommand.java:102)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.springframework.shell.command.invocation.InvocableShellMethod.doInvoke(InvocableShellMethod.java:306)
at 
org.springframework.shell.command.invocation.InvocableShellMethod.invoke(InvocableShellMethod.java:232)
at 
org.springframework.shell.command.CommandExecution$DefaultCommandExecution.evaluate(CommandExecution.java:158)
at org.springframework.shell.Shell.evaluate(Shell.java:208)
at org.springframework.shell.Shell.run(Shell.java:140)
at 
org.springframework.shell.jline.InteractiveShellRunner.run(InteractiveShellRunner.java:73)
at 
org.springframework.shell.DefaultShellApplicationRunner.run(DefaultShellApplicationRunner.java:65)
at 
org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:762)
at 
org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:752)
at 
org.springframework.boot.SpringApplication.run(SpringApplication.java:315)
at 
org.springframework.boot.SpringApplication.run(SpringApplication.java:1306)
at 
org.springframework.boot.SpringApplication.run(SpringApplication.java:1295)
at org.apache.hudi.cli.Main.main(Main.java:34)
{code}


  was:
When there are rollback command blocks in the log files, the 



> Fix log file CLI around command blocks
> --
>
> Key: HUDI-6694
> URL: https://issues.apache.org/jira/browse/HUDI-6694
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>
> When there are rollback command blocks in the log files, the log file command 
> throws NPE:
> {code:java}
> hudi:hoodie_table->show logfile metadata --logFilePathPattern 
> file:/.1414abd2-346b-4c84-b380-c6ea6ec0863a-0_20230813220941456.log*
> java.lang.NullPointerException
>   at java.util.Objects.requireNonNull(Objects.java:203)
>   at 
> org.apache.hudi.cli.commands.HoodieLogFileCommand.showLogFileCommits(HoodieLogFileCommand.java:102)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.springframework.shell.command.invocation.InvocableShellMethod.doInvoke(InvocableShellMethod.java:306)
>   at 
> org.springframework.shell.command.invocation.InvocableShellMethod.invoke(InvocableShellMethod.java:232)
>   at 
> org.springframework.shell.command.CommandExecution$DefaultCommandExecution.evaluate(CommandExecution.java:158)
>   at org.springframework.shell.Shell.evaluate(Shell.java:208)
>   at org.springframework.shell.Shell.run(Shell.java:140)
>   at 
> org.springframework.shell.jline.InteractiveShellRunner.run(InteractiveShellRunner.java:73)
>   at 
> org.springframework.shell.DefaultShellApplicationRunner.run(DefaultShellApplicationRunner.java:65)
>   at 
> org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:762)
>   at 
> org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:752)
>   at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:315)
>   at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:1306)
>   at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:1295)
>   at org.apache.hudi.cli.Main.main(Main.java:34)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6694) Fix log file CLI around command blocks

2023-08-14 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6694:

Description: 
When there are rollback command blocks in the log files, the 


> Fix log file CLI around command blocks
> --
>
> Key: HUDI-6694
> URL: https://issues.apache.org/jira/browse/HUDI-6694
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Priority: Major
>
> When there are rollback command blocks in the log files, the 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6694) Fix log file CLI around command blocks

2023-08-14 Thread Ethan Guo (Jira)
Ethan Guo created HUDI-6694:
---

 Summary: Fix log file CLI around command blocks
 Key: HUDI-6694
 URL: https://issues.apache.org/jira/browse/HUDI-6694
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-6694) Fix log file CLI around command blocks

2023-08-14 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-6694:
---

Assignee: Ethan Guo

> Fix log file CLI around command blocks
> --
>
> Key: HUDI-6694
> URL: https://issues.apache.org/jira/browse/HUDI-6694
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>
> When there are rollback command blocks in the log files, the 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] yihua commented on a diff in pull request #9437: [HUDI-6689] Add record index validation in MDT validator

2023-08-14 Thread via GitHub


yihua commented on code in PR #9437:
URL: https://github.com/apache/hudi/pull/9437#discussion_r1294049901


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java:
##
@@ -741,6 +791,116 @@ private void validateBloomFilters(
 validate(metadataBasedBloomFilters, fsBasedBloomFilters, partitionPath, 
"bloom filters");
   }
 
+  private void validateRecordIndex(HoodieSparkEngineContext sparkEngineContext,
+   HoodieTableMetaClient metaClient,
+   HoodieTableMetadata tableMetadata) {
+if (cfg.validateRecordIndexContent) {
+  validateRecordIndexContent(sparkEngineContext, metaClient, 
tableMetadata);
+} else if (cfg.validateRecordIndexCount) {
+  validateRecordIndexCount(sparkEngineContext, metaClient);
+}
+  }
+
+  private void validateRecordIndexCount(HoodieSparkEngineContext 
sparkEngineContext,
+HoodieTableMetaClient metaClient) {
+String basePath = metaClient.getBasePathV2().toString();
+long countKeyFromTable = 
sparkEngineContext.getSqlContext().read().format("hudi")
+.load(basePath)
+.select(RECORD_KEY_METADATA_FIELD)
+.distinct()
+.count();
+long countKeyFromRecordIndex = 
sparkEngineContext.getSqlContext().read().format("hudi")
+.load(getMetadataTableBasePath(basePath))
+.select("key")
+.filter("type = 5")
+.distinct()
+.count();
+
+if (countKeyFromTable != countKeyFromRecordIndex) {
+  String message = String.format("Validation of record index count failed: 
"
+  + "%s entries from record index metadata, %s keys from the data 
table.",
+  countKeyFromRecordIndex, countKeyFromTable);
+  LOG.error(message);
+  throw new HoodieValidationException(message);
+} else {
+  LOG.info(String.format(
+  "Validation of record index count succeeded: %s entries.", 
countKeyFromRecordIndex));
+}
+  }
+
+  private void validateRecordIndexContent(HoodieSparkEngineContext 
sparkEngineContext,
+  HoodieTableMetaClient metaClient,
+  HoodieTableMetadata tableMetadata) {
+String basePath = metaClient.getBasePathV2().toString();
+JavaPairRDD> keyToLocationOnFsRdd =
+sparkEngineContext.getSqlContext().read().format("hudi").load(basePath)
+.select(RECORD_KEY_METADATA_FIELD, PARTITION_PATH_METADATA_FIELD, 
FILENAME_METADATA_FIELD)
+.toJavaRDD()
+.mapToPair(row -> new 
Tuple2<>(row.getString(row.fieldIndex(RECORD_KEY_METADATA_FIELD)),
+
Pair.of(row.getString(row.fieldIndex(PARTITION_PATH_METADATA_FIELD)),
+
FSUtils.getFileId(row.getString(row.fieldIndex(FILENAME_METADATA_FIELD))
+.cache();
+
+JavaPairRDD> keyToLocationFromRecordIndexRdd =
+sparkEngineContext.getSqlContext().read().format("hudi")
+.load(getMetadataTableBasePath(basePath))
+.filter("type = 5")
+.select(functions.col("key"),
+
functions.col("recordIndexMetadata.partitionName").as("partitionName"),
+
functions.col("recordIndexMetadata.fileIdHighBits").as("fileIdHighBits"),
+
functions.col("recordIndexMetadata.fileIdLowBits").as("fileIdLowBits"),
+functions.col("recordIndexMetadata.fileIndex").as("fileIndex"),
+functions.col("recordIndexMetadata.fileId").as("fileId"),
+
functions.col("recordIndexMetadata.instantTime").as("instantTime"),
+
functions.col("recordIndexMetadata.fileIdEncoding").as("fileIdEncoding"))
+.toJavaRDD()
+.mapToPair(row -> {
+  HoodieRecordGlobalLocation location = 
HoodieTableMetadataUtil.getLocationFromRecordIndexInfo(
+  row.getString(row.fieldIndex("partitionName")),
+  row.getInt(row.fieldIndex("fileIdEncoding")),
+  row.getLong(row.fieldIndex("fileIdHighBits")),
+  row.getLong(row.fieldIndex("fileIdLowBits")),
+  row.getInt(row.fieldIndex("fileIndex")),
+  row.getString(row.fieldIndex("fileId")),
+  row.getLong(row.fieldIndex("instantTime")));
+  return new Tuple2<>(row.getString(row.fieldIndex("key")),
+  Pair.of(location.getPartitionPath(), location.getFileId()));
+});
+
+long diffCount = 
keyToLocationOnFsRdd.fullOuterJoin(keyToLocationFromRecordIndexRdd, 
cfg.recordIndexParallelism)
+.map(e -> {
+  Optional> locationOnFs = e._2._1;
+  Optional> locationFromRecordIndex = e._2._2;
+  if (locationOnFs.isPresent() && locationFromRecordIndex.isPresent()) 
{
+if 
(locationOnFs.get().getLeft().equals(locationFromRecordIndex.get().getLeft())
+   

[GitHub] [hudi] hudi-bot commented on pull request #9444: [HUDI-6692] Do not allow switching from Primary keyed table to primary key less table

2023-08-14 Thread via GitHub


hudi-bot commented on PR #9444:
URL: https://github.com/apache/hudi/pull/9444#issuecomment-1678162678

   
   ## CI report:
   
   * c7e99fd19a00469c0e181b6c64b63aa9cfb7ed4e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19292)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9444: [HUDI-6692] Do not allow switching from Primary keyed table to primary key less table

2023-08-14 Thread via GitHub


hudi-bot commented on PR #9444:
URL: https://github.com/apache/hudi/pull/9444#issuecomment-1678153770

   
   ## CI report:
   
   * c7e99fd19a00469c0e181b6c64b63aa9cfb7ed4e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #9437: [HUDI-6689] Add record index validation in MDT validator

2023-08-14 Thread via GitHub


nsivabalan commented on code in PR #9437:
URL: https://github.com/apache/hudi/pull/9437#discussion_r1294013795


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java:
##
@@ -741,6 +791,116 @@ private void validateBloomFilters(
 validate(metadataBasedBloomFilters, fsBasedBloomFilters, partitionPath, 
"bloom filters");
   }
 
+  private void validateRecordIndex(HoodieSparkEngineContext sparkEngineContext,
+   HoodieTableMetaClient metaClient,
+   HoodieTableMetadata tableMetadata) {
+if (cfg.validateRecordIndexContent) {
+  validateRecordIndexContent(sparkEngineContext, metaClient, 
tableMetadata);
+} else if (cfg.validateRecordIndexCount) {
+  validateRecordIndexCount(sparkEngineContext, metaClient);
+}
+  }
+
+  private void validateRecordIndexCount(HoodieSparkEngineContext 
sparkEngineContext,
+HoodieTableMetaClient metaClient) {
+String basePath = metaClient.getBasePathV2().toString();
+long countKeyFromTable = 
sparkEngineContext.getSqlContext().read().format("hudi")
+.load(basePath)
+.select(RECORD_KEY_METADATA_FIELD)
+.distinct()
+.count();
+long countKeyFromRecordIndex = 
sparkEngineContext.getSqlContext().read().format("hudi")
+.load(getMetadataTableBasePath(basePath))
+.select("key")
+.filter("type = 5")
+.distinct()

Review Comment:
   snapshot read by itself should return unique values. if there are dups, its 
a bug.can we remove distinct() here?



##
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java:
##
@@ -741,6 +791,116 @@ private void validateBloomFilters(
 validate(metadataBasedBloomFilters, fsBasedBloomFilters, partitionPath, 
"bloom filters");
   }
 
+  private void validateRecordIndex(HoodieSparkEngineContext sparkEngineContext,
+   HoodieTableMetaClient metaClient,
+   HoodieTableMetadata tableMetadata) {
+if (cfg.validateRecordIndexContent) {
+  validateRecordIndexContent(sparkEngineContext, metaClient, 
tableMetadata);
+} else if (cfg.validateRecordIndexCount) {
+  validateRecordIndexCount(sparkEngineContext, metaClient);
+}
+  }
+
+  private void validateRecordIndexCount(HoodieSparkEngineContext 
sparkEngineContext,
+HoodieTableMetaClient metaClient) {
+String basePath = metaClient.getBasePathV2().toString();
+long countKeyFromTable = 
sparkEngineContext.getSqlContext().read().format("hudi")
+.load(basePath)
+.select(RECORD_KEY_METADATA_FIELD)
+.distinct()
+.count();
+long countKeyFromRecordIndex = 
sparkEngineContext.getSqlContext().read().format("hudi")
+.load(getMetadataTableBasePath(basePath))
+.select("key")
+.filter("type = 5")
+.distinct()
+.count();
+
+if (countKeyFromTable != countKeyFromRecordIndex) {
+  String message = String.format("Validation of record index count failed: 
"
+  + "%s entries from record index metadata, %s keys from the data 
table.",
+  countKeyFromRecordIndex, countKeyFromTable);
+  LOG.error(message);
+  throw new HoodieValidationException(message);
+} else {
+  LOG.info(String.format(
+  "Validation of record index count succeeded: %s entries.", 
countKeyFromRecordIndex));
+}
+  }
+
+  private void validateRecordIndexContent(HoodieSparkEngineContext 
sparkEngineContext,
+  HoodieTableMetaClient metaClient,
+  HoodieTableMetadata tableMetadata) {
+String basePath = metaClient.getBasePathV2().toString();
+JavaPairRDD> keyToLocationOnFsRdd =
+sparkEngineContext.getSqlContext().read().format("hudi").load(basePath)
+.select(RECORD_KEY_METADATA_FIELD, PARTITION_PATH_METADATA_FIELD, 
FILENAME_METADATA_FIELD)
+.toJavaRDD()
+.mapToPair(row -> new 
Tuple2<>(row.getString(row.fieldIndex(RECORD_KEY_METADATA_FIELD)),
+
Pair.of(row.getString(row.fieldIndex(PARTITION_PATH_METADATA_FIELD)),
+
FSUtils.getFileId(row.getString(row.fieldIndex(FILENAME_METADATA_FIELD))
+.cache();
+
+JavaPairRDD> keyToLocationFromRecordIndexRdd =
+sparkEngineContext.getSqlContext().read().format("hudi")
+.load(getMetadataTableBasePath(basePath))
+.filter("type = 5")
+.select(functions.col("key"),
+
functions.col("recordIndexMetadata.partitionName").as("partitionName"),
+
functions.col("recordIndexMetadata.fileIdHighBits").as("fileIdHighBits"),
+
functions.col("recordIndexMetadata.fileIdLowBits").as("fileIdLowBits"),
+function

[GitHub] [hudi] prashantwason commented on pull request #9199: [HUDI-6534]Support consistent hashing row writer

2023-08-14 Thread via GitHub


prashantwason commented on PR #9199:
URL: https://github.com/apache/hudi/pull/9199#issuecomment-1678132749

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] prashantwason commented on pull request #9199: [HUDI-6534]Support consistent hashing row writer

2023-08-14 Thread via GitHub


prashantwason commented on PR #9199:
URL: https://github.com/apache/hudi/pull/9199#issuecomment-1678132491

   @stream2000 The build is failing due to a test failure due to this commit. 
Can you please check?
   
https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/19273/logs/25
   
   This is blocking 0.14.0 release so please prioritize if possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-6693) Streaming writes fail in quick start w/ 0.14.0

2023-08-14 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-6693:
-

 Summary: Streaming writes fail in quick start w/ 0.14.0 
 Key: HUDI-6693
 URL: https://issues.apache.org/jira/browse/HUDI-6693
 Project: Apache Hudi
  Issue Type: Improvement
  Components: spark, writer-core
Reporter: sivabalan narayanan


Quick starts fails w/ streaming ingestion 

 
{code:java}
scala> df.writeStream.format("hudi").
 |   options(getQuickstartWriteConfigs).
 |   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
 |   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
 |   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
 |   option(TABLE_NAME, streamingTableName).
 |   outputMode("append").
 |   option("path", baseStreamingPath).
 |   option("checkpointLocation", checkpointLocation).
 |   trigger(Trigger.Once()).
 |   start()
warning: one deprecation; for details, enable `:setting -deprecation' or 
`:replay -deprecation'
23/08/10 14:31:09 WARN HoodieStreamingSink: Ignore TableNotFoundException as it 
is first microbatch.
23/08/10 14:31:09 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not 
supported in streaming DataFrames/Datasets and will be disabled.
res12: org.apache.spark.sql.streaming.StreamingQuery = 
org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@75143003

scala> 23/08/10 14:31:10 WARN HiveConf: HiveConf of name 
hive.stats.jdbc.timeout does not exist
23/08/10 14:31:10 WARN HiveConf: HiveConf of name hive.stats.retries.wait does 
not exist
23/08/10 14:31:10 WARN AutoRecordKeyGenerationUtils$: Precombine field ts will 
be ignored with auto record key generation enabled
23/08/10 14:31:10 WARN HoodieWriteConfig: Embedded timeline server is disabled, 
fallback to use direct marker type for spark
23/08/10 14:31:10 ERROR HoodieStreamingSink: Micro batch id=0 threw following 
exception: 
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be 
executed with writeStream.start();
LogicalRDD [_hoodie_commit_time#1063, _hoodie_commit_seqno#1064, 
_hoodie_record_key#1065, _hoodie_partition_path#1066, _hoodie_file_name#1067, 
begin_lat#1068, begin_lon#1069, driver#1070, end_lat#1071, end_lon#1072, 
fare#1073, partitionpath#1074, rider#1075, ts#1076L, uuid#1077], true

at 
org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.throwError(UnsupportedOperationChecker.scala:447)
at 
org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.$anonfun$checkForBatch$1(UnsupportedOperationChecker.scala:38)
at 
org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.$anonfun$checkForBatch$1$adapted(UnsupportedOperationChecker.scala:36)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:263)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:262)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:262)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:262)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:262)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:262)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:262)
at 
org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForBatch(UnsupportedOperationChecker.scala:36)
at 
org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:69)
at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$withCachedData$1(QueryExecution.scala:109)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at 
org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:107)
at 
org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:107)
at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$optimize

[jira] [Updated] (HUDI-6692) If table with recordkey doesn't have recordkey in spark ds write, it will bulk insert by default

2023-08-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6692:
-
Labels: pull-request-available  (was: )

> If table with recordkey doesn't have recordkey in spark ds write, it will 
> bulk insert by default
> 
>
> Key: HUDI-6692
> URL: https://issues.apache.org/jira/browse/HUDI-6692
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> If an existing table has a recordkey, if you write with spark ds and don't 
> include a recordkey, it will think it is pkless and should default to bulk 
> insert



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] jonvex opened a new pull request, #9444: [HUDI-6692] If pk table has no recordkey in write, it should fail

2023-08-14 Thread via GitHub


jonvex opened a new pull request, #9444:
URL: https://github.com/apache/hudi/pull/9444

   ### Change Logs
   
   if the write was missing the recordkey it would think it was a pkless write. 
now it fails
   
   ### Impact
   
   prevent unexpected behavior
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   N/A
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-6692) If table with recordkey doesn't have recordkey in spark ds write, it will bulk insert by default

2023-08-14 Thread Jonathan Vexler (Jira)
Jonathan Vexler created HUDI-6692:
-

 Summary: If table with recordkey doesn't have recordkey in spark 
ds write, it will bulk insert by default
 Key: HUDI-6692
 URL: https://issues.apache.org/jira/browse/HUDI-6692
 Project: Apache Hudi
  Issue Type: Bug
  Components: spark-sql
Reporter: Jonathan Vexler
Assignee: Jonathan Vexler
 Fix For: 0.14.0


If an existing table has a recordkey, if you write with spark ds and don't 
include a recordkey, it will think it is pkless and should default to bulk 
insert



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] Riddle4045 commented on issue #9435: [SUPPORT] Trino can't read tables created by Flink Hudi conector

2023-08-14 Thread via GitHub


Riddle4045 commented on issue #9435:
URL: https://github.com/apache/hudi/issues/9435#issuecomment-1678089145

   @danny0405  I checked the Table props in metastore of a table that's synced 
using Hudi HMS sync tool vs the Flink table I mentioned below. I see very 
different  properties here 
   
   Table props for table creating using Hudi HMS sync tool
   
   ```
   TBL_ID   PARAM_KEY   PARAM_VALUE
   250  EXTERNALTRUE
   250  last_commit_time_sync   20230601210025262
   250  numFiles0
   250  spark.sql.sources.provider  hudi
   250  spark.sql.sources.schema.numParts   1
   250  spark.sql.sources.schema.part.0 
{"type":"struct","fields":[{"name":"_hoodie_commit_time","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_commit_seqno","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_record_key","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_partition_path","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_file_name","type":"string","nullable":true,"metadata":{}},{"name":"rideId","type":"long","nullable":true,"metadata":{}},{"name":"driverId","type":"long","nullable":false,"metadata":{}},{"name":"taxiId","type":"long","nullable":true,"metadata":{}},{"name":"startTime","type":"long","nullable":true,"metadata":{}},{"name":"tip","type":"float","nullable":true,"metadata":{}},{"name":"tolls","type":"float","nullable":true,"metadata":{}},{"name":"totalFare","type":"float","nullable":true,"metadata":{}}]}
   250  totalSize   0
   250  transient_lastDdlTime   1685653353
   ```
   
   HMS props for the Hudi table creating using Flink SQL 
   
   ```
   TBL_ID   PARAM_KEY   PARAM_VALUE
   335  flink.comment   
   335  flink.connector hudi
   335  flink.hive_sync.enable  true
   335  flink.hive_sync.metastore.uris  thrift://hive-metastore:9083
   335  flink.hive_sync.modehms
   335  flink.partition.keys.0.name partition
   335  flink.path  abfs://fl...@test.dfs.core.windows.net/hudi/t1hms4
   335  flink.schema.0.data-typeVARCHAR(20)
   335  flink.schema.0.name uuid
   335  flink.schema.1.data-typeVARCHAR(10)
   335  flink.schema.1.name name
   335  flink.schema.2.data-typeINT
   335  flink.schema.2.name age
   335  flink.schema.3.data-typeTIMESTAMP(3)
   335  flink.schema.3.name ts
   335  flink.schema.4.data-typeVARCHAR(20)
   335  flink.schema.4.name partition
   335  flink.table.typeCOPY_ON_WRITE
   335  transient_lastDdlTime   1691804292
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9437: [HUDI-6689] Add record index validation in MDT validator

2023-08-14 Thread via GitHub


hudi-bot commented on PR #9437:
URL: https://github.com/apache/hudi/pull/9437#issuecomment-1678003863

   
   ## CI report:
   
   * 699793358327fe0caf4df52a0ee199a9c54ab58d Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19290)
 
   * b25b5402c1e3e14264c6bbfd38910f4b93b8a871 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19291)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



  1   2   >