date:20240411

Re: [PR] [HUDI-7378] Fix Spark SQL DML with custom key generator [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #10615:
URL: https://github.com/apache/hudi/pull/10615#issuecomment-2051092856

   
   ## CI report:
   
   * dfab8e1285bf0241eea2e71f9d85607c647446d7 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23212)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7577] Avoid MDT compaction instant time conflicts [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #10992:
URL: https://github.com/apache/hudi/pull/10992#issuecomment-2051093641

   
   ## CI report:
   
   * 1f421909625781304a531ccadcbf6a37ca5185a4 UNKNOWN
   * c8423769cd6ef01b7afcaafd63f51b9f450ec7ea Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23208)
 
   * d8dda49ff97feca5172346047aacb007746568ae Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23214)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7577] Avoid MDT compaction instant time conflicts [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #10992:
URL: https://github.com/apache/hudi/pull/10992#issuecomment-2051085060

   
   ## CI report:
   
   * 1f421909625781304a531ccadcbf6a37ca5185a4 UNKNOWN
   * c8423769cd6ef01b7afcaafd63f51b9f450ec7ea Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23208)
 
   * d8dda49ff97feca5172346047aacb007746568ae UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

(hudi) branch master updated: [HUDI-7601] Add heartbeat mechanism to refresh lock (#10994)

2024-04-11 Thread leesf

This is an automated email from the ASF dual-hosted git repository.

leesf pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 4fbf5c52f19 [HUDI-7601] Add heartbeat mechanism to refresh lock 
(#10994)
4fbf5c52f19 is described below

commit 4fbf5c52f19b1e3192b09f9362e35dd22c3a0da6
Author: Yann Byron 
AuthorDate: Fri Apr 12 14:12:04 2024 +0800

[HUDI-7601] Add heartbeat mechanism to refresh lock (#10994)

* [HUDI-7601] Add heartbeat mechanism to refresh lock
---
 .../org/apache/hudi/config/HoodieLockConfig.java   | 13 +++
 .../hudi/common/config/LockConfiguration.java  |  3 ++
 .../hudi/hive/transaction/lock/Heartbeat.java  | 42 ++
 .../lock/HiveMetastoreBasedLockProvider.java   | 23 ++--
 4 files changed, 79 insertions(+), 2 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLockConfig.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLockConfig.java
index fa38da8f8ab..1c51b6db8b3 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLockConfig.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLockConfig.java
@@ -37,6 +37,7 @@ import java.util.Properties;
 import static 
org.apache.hudi.common.config.LockConfiguration.DEFAULT_LOCK_ACQUIRE_NUM_RETRIES;
 import static 
org.apache.hudi.common.config.LockConfiguration.DEFAULT_LOCK_ACQUIRE_RETRY_WAIT_TIME_IN_MILLIS;
 import static 
org.apache.hudi.common.config.LockConfiguration.DEFAULT_LOCK_ACQUIRE_WAIT_TIMEOUT_MS;
+import static 
org.apache.hudi.common.config.LockConfiguration.DEFAULT_LOCK_HEARTBEAT_INTERVAL_MS;
 import static 
org.apache.hudi.common.config.LockConfiguration.DEFAULT_ZK_CONNECTION_TIMEOUT_MS;
 import static 
org.apache.hudi.common.config.LockConfiguration.DEFAULT_ZK_SESSION_TIMEOUT_MS;
 import static 
org.apache.hudi.common.config.LockConfiguration.FILESYSTEM_LOCK_EXPIRE_PROP_KEY;
@@ -50,6 +51,7 @@ import static 
org.apache.hudi.common.config.LockConfiguration.LOCK_ACQUIRE_NUM_R
 import static 
org.apache.hudi.common.config.LockConfiguration.LOCK_ACQUIRE_RETRY_MAX_WAIT_TIME_IN_MILLIS_PROP_KEY;
 import static 
org.apache.hudi.common.config.LockConfiguration.LOCK_ACQUIRE_RETRY_WAIT_TIME_IN_MILLIS_PROP_KEY;
 import static 
org.apache.hudi.common.config.LockConfiguration.LOCK_ACQUIRE_WAIT_TIMEOUT_MS_PROP_KEY;
+import static 
org.apache.hudi.common.config.LockConfiguration.LOCK_HEARTBEAT_INTERVAL_MS_KEY;
 import static org.apache.hudi.common.config.LockConfiguration.LOCK_PREFIX;
 import static 
org.apache.hudi.common.config.LockConfiguration.ZK_BASE_PATH_PROP_KEY;
 import static 
org.apache.hudi.common.config.LockConfiguration.ZK_CONNECTION_TIMEOUT_MS_PROP_KEY;
@@ -112,6 +114,12 @@ public class HoodieLockConfig extends HoodieConfig {
   .sinceVersion("0.8.0")
   .withDocumentation("Timeout in ms, to wait on an individual lock 
acquire() call, at the lock provider.");
 
+  public static final ConfigProperty LOCK_HEARTBEAT_INTERVAL_MS = 
ConfigProperty
+  .key(LOCK_HEARTBEAT_INTERVAL_MS_KEY)
+  .defaultValue(DEFAULT_LOCK_HEARTBEAT_INTERVAL_MS)
+  .sinceVersion("1.0.0")
+  .withDocumentation("Heartbeat interval in ms, to send a heartbeat to 
indicate that hive client holding locks.");
+
   public static final ConfigProperty FILESYSTEM_LOCK_PATH = 
ConfigProperty
   .key(FILESYSTEM_LOCK_PATH_PROP_KEY)
   .noDefaultValue()
@@ -343,6 +351,11 @@ public class HoodieLockConfig extends HoodieConfig {
   return this;
 }
 
+public HoodieLockConfig.Builder withHeartbeatIntervalInMillis(Long 
intervalInMillis) {
+  lockConfig.setValue(LOCK_HEARTBEAT_INTERVAL_MS, 
String.valueOf(intervalInMillis));
+  return this;
+}
+
 public HoodieLockConfig.Builder 
withConflictResolutionStrategy(ConflictResolutionStrategy 
conflictResolutionStrategy) {
   lockConfig.setValue(WRITE_CONFLICT_RESOLUTION_STRATEGY_CLASS_NAME, 
conflictResolutionStrategy.getClass().getName());
   return this;
diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/config/LockConfiguration.java
 
b/hudi-common/src/main/java/org/apache/hudi/common/config/LockConfiguration.java
index 1171dcf3fce..9d79be37810 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/config/LockConfiguration.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/config/LockConfiguration.java
@@ -45,6 +45,9 @@ public class LockConfiguration implements Serializable {
   public static final String LOCK_ACQUIRE_WAIT_TIMEOUT_MS_PROP_KEY = 
LOCK_PREFIX + "wait_time_ms";
   public static final int DEFAULT_LOCK_ACQUIRE_WAIT_TIMEOUT_MS = 60 * 1000;
 
+  public static final String LOCK_HEARTBEAT_INTERVAL_MS_KEY = LOCK_PREFIX + 
"heartbeat_interval_ms";
+  public static final int DEFAULT_LOCK_HEARTBEAT_INTERVAL_MS = 60 * 1

Re: [PR] [HUDI-7601] Add heartbeat mechanism to refresh lock [hudi]

2024-04-11 Thread via GitHub



leesf merged PR #10994:
URL: https://github.com/apache/hudi/pull/10994


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7576] Improve efficiency of getRelativePartitionPath, reduce computation of partitionPath in AbstractTableFileSystemView [hudi]

2024-04-11 Thread via GitHub



danny0405 commented on PR #11001:
URL: https://github.com/apache/hudi/pull/11001#issuecomment-2051024781

   @the-other-tim-brown Can you fix the Azure CI failure?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7565] Create spark file readers to read a single file instead of an entire partition [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #10954:
URL: https://github.com/apache/hudi/pull/10954#issuecomment-2051021308

   
   ## CI report:
   
   * dbdefad652d5c51b19175ca70374b7737a004952 UNKNOWN
   * 8f1ba6d46d8777f39c522d8bcac545ba3d4fd544 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23211)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR] Streamer test setup performance [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #10806:
URL: https://github.com/apache/hudi/pull/10806#issuecomment-2051021047

   
   ## CI report:
   
   * e0414708ebbd734156c0383cb4e5dbfe5ff4151a UNKNOWN
   * 11c19fa8fd39ed058a4e3487c99c793610b61564 UNKNOWN
   * 72970e41f06fb68466eba338ddfbd6553d2e96b1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23210)
 
   * d9f583043f1a5ffd532d613b2ce95aa7a8fddc47 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23213)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR] Streamer test setup performance [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #10806:
URL: https://github.com/apache/hudi/pull/10806#issuecomment-2051013783

   
   ## CI report:
   
   * e0414708ebbd734156c0383cb4e5dbfe5ff4151a UNKNOWN
   * 11c19fa8fd39ed058a4e3487c99c793610b61564 UNKNOWN
   * 72970e41f06fb68466eba338ddfbd6553d2e96b1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23210)
 
   * d9f583043f1a5ffd532d613b2ce95aa7a8fddc47 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] Different system parse different time zone of timestamp type from the parquet file created by hudi [hudi]

2024-04-11 Thread via GitHub



AshinGau opened a new issue, #11003:
URL: https://github.com/apache/hudi/issues/11003

   **Describe the problem you faced**
   I am a committer of [Doris](https://github.com/apache/doris). When I use 
Doris to read the parquet file created by hudi, I find that the output of 
timestamp is decreased by 8 hours. Then I use other tools(arrow, trino, spark) 
to check the result, it seems that different system parse different time zone 
of timestamp type:
   
![img_v3_029s_b170ce06-8440-4e8d-b37b-1962ea92b7bg](https://github.com/apache/hudi/assets/19337507/27e39867-8117-4b2c-a725-cb9a3eb20544)
   The results of arrow, Doris, trino are the same, while the results of hudi, 
spark-shell are increased 8 hours.
   
   **To Reproduce**
   Spark 3.3 + Hudi 0.14.1
   1. create hudi table
   ```
   create table hudi_evolution_mor(
 id int,
 name string,
 create_time timestamp,
 price double,
 ts bigint,
 fs_col string) using hudi
   options(
 type = 'mor',
 primaryKey = 'id'
   )
   ```
   2. insert data
   ```
   insert into hudi_evolution_mor values
   (1, 'name1', timestamp'2023-09-17 13:14:35.142', 1.01, 1001, '2023-09-17'),
   (2, 'name2', timestamp'2024-03-10 15:17:21.4172', 2.02, 1002, '2024-03-10');
   ```
   
   **Expected behavior**
   Maybe it's right to output `2023-09-17 13:14:35.142`, because string like 
`timestamp'2023-09-17 13:14:35.142'` is inserted. However the parquet file 
create by hudi indicates that the timestamp type should be adjusted to UTC:
   
![img_v3_029s_8f01c4d1-f8d3-47cf-93ea-8205fbdc8f2g](https://github.com/apache/hudi/assets/19337507/3434ee04-8f40-42e0-a4b8-8ac6023c581d)
   
![img_v3_029s_3e622d05-90a9-4912-a973-8c37bdcf331g](https://github.com/apache/hudi/assets/19337507/cfa7892b-5260-42a8-8599-465f790cee72)
   The meta data shows that the timestamp is relative to UTC0, not the local 
time zone, according to the [definition of 
isAdjustedToUTC=true](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#instant-semantics-timestamps-normalized-to-utc)
   
![image](https://github.com/apache/hudi/assets/19337507/1f1139a6-f58f-4026-8d9c-d828dc81779b)
   
   **Environment Description**
   
   * Hudi version : 0.14.1
   
   * Spark version : 3.3.2
   
   * Hive version : 3.1.1
   
   * Hadoop version : 3.2.2
   
   * Storage (HDFS/S3/GCS..) :
   
   * Running on Docker? (no) :
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7378] Fix Spark SQL DML with custom key generator [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #10615:
URL: https://github.com/apache/hudi/pull/10615#issuecomment-2050971324

   
   ## CI report:
   
   * 989ffd5220e4f5ae666a05afdd0e7de3c6543972 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23202)
 
   * dfab8e1285bf0241eea2e71f9d85607c647446d7 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23212)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7576] Improve efficiency of getRelativePartitionPath, reduce computation of partitionPath in AbstractTableFileSystemView [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #11001:
URL: https://github.com/apache/hudi/pull/11001#issuecomment-2050965613

   
   ## CI report:
   
   * fe5ed81020fb8d974c306f61a222f9583e2dab29 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23209)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7378] Fix Spark SQL DML with custom key generator [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #10615:
URL: https://github.com/apache/hudi/pull/10615#issuecomment-2050965097

   
   ## CI report:
   
   * 989ffd5220e4f5ae666a05afdd0e7de3c6543972 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23202)
 
   * dfab8e1285bf0241eea2e71f9d85607c647446d7 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7290] Don't assume ReplaceCommits are always Clustering [hudi]

2024-04-11 Thread via GitHub



bvaradar merged PR #10479:
URL: https://github.com/apache/hudi/pull/10479


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

(hudi) branch master updated (a41d7aeafed -> c9256e5e784)

2024-04-11 Thread vbalaji

This is an automated email from the ASF dual-hosted git repository.

vbalaji pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from a41d7aeafed [HUDI-7605] Allow merger strategy to be set in spark sql 
writer (#10999)
 add c9256e5e784 [HUDI-7290]  Don't assume ReplaceCommits are always 
Clustering (#10479)

No new revisions were added by this update.

Summary of changes:
 .../hudi/client/BaseHoodieTableServiceClient.java  | 10 ---
 .../org/apache/hudi/table/marker/WriteMarkers.java |  2 ++
 .../table/timeline/HoodieDefaultTimeline.java  | 31 --
 .../hudi/common/table/timeline/HoodieTimeline.java | 11 
 .../table/view/AbstractTableFileSystemView.java|  5 +---
 .../table/view/TestHoodieTableFileSystemView.java  | 30 +++--
 .../clustering/ClusteringPlanSourceFunction.java   |  2 +-
 .../java/org/apache/hudi/util/ClusteringUtil.java  |  2 +-
 .../apache/hudi/utilities/HoodieClusteringJob.java | 12 -
 9 files changed, 86 insertions(+), 19 deletions(-)

Re: [PR] [HUDI-7577] Avoid MDT compaction instant time conflicts [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #10992:
URL: https://github.com/apache/hudi/pull/10992#issuecomment-2050913689

   
   ## CI report:
   
   * 1f421909625781304a531ccadcbf6a37ca5185a4 UNKNOWN
   * c8423769cd6ef01b7afcaafd63f51b9f450ec7ea Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23208)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR] Streamer test setup performance [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #10806:
URL: https://github.com/apache/hudi/pull/10806#issuecomment-2050907684

   
   ## CI report:
   
   * e0414708ebbd734156c0383cb4e5dbfe5ff4151a UNKNOWN
   * 11c19fa8fd39ed058a4e3487c99c793610b61564 UNKNOWN
   * c68fd47d3080f055eb4b688f7e75b261ff6803d6 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23207)
 
   * 72970e41f06fb68466eba338ddfbd6553d2e96b1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23210)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7606] Unpersist RDDs after table services, mainly compaction [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #11000:
URL: https://github.com/apache/hudi/pull/11000#issuecomment-2050873544

   
   ## CI report:
   
   * 12cf06d732847bf9ca925bf2bb4e2e0eb39b8855 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23205)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7565] Create spark file readers to read a single file instead of an entire partition [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #10954:
URL: https://github.com/apache/hudi/pull/10954#issuecomment-2050873413

   
   ## CI report:
   
   * dbdefad652d5c51b19175ca70374b7737a004952 UNKNOWN
   * f6c5bebf97872d05f27137febbc727d5ad9f8e78 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23200)
 
   * 8f1ba6d46d8777f39c522d8bcac545ba3d4fd544 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23211)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] [SUPPORT] can't retrieve original partition column value when exacting date with CustomKeyGenerator [hudi]

2024-04-11 Thread via GitHub



liangchen-datanerd opened a new issue, #11002:
URL: https://github.com/apache/hudi/issues/11002

   
   **problem**
   
   the requirement was to extract date value as partition from event_time 
column. According to the hudi offical doc the ingestion config for hoodie would 
be like this
   ```
   --hoodie-conf 
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
 \
   --hoodie-conf hoodie.keygen.timebased.timestamp.type="DATE_STRING" \
   --hoodie-conf hoodie.keygen.timebased.input.dateformat="-MM-dd HH:mm:ss" 
\   
   --hoodie-conf hoodie.keygen.timebased.output.dateformat="-MM-dd" \
   ```
   the problem is that partition value was correct but when I query the table 
the partition column would be the partition value not the original value. For 
example the event_time is '2023-01-01 12:00:00' then partition value would be 
2023-01-01. But when query hudi table the event_time would be 2023-01-01 not 
the orginal value. But when I query the parquet file the event_time would be 
orginal value. 
   
   
   **To Reproduce**
   
   Steps to reproduce the behavior:   
   using pyspark shell. 
   ```
   pyspark \
   --master spark://node1:7077 \
   --packages 
'org.apache.hadoop:hadoop-aws:3.3.1,com.amazonaws:aws-java-sdk:1.11.469' \
   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
   --conf 
spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog 
\
   --conf 
spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension \
   --conf spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar  
   ```
   
   ```
   # Create a DataFrame
   data = [("James", "Sales", "2023-01-02 12:12:23"),
   ("Michael", "Sales", "2023-01-01 12:12:23"),
   ("Robert", "Sales", "2023-01-02 01:12:23"),
   ("Maria", "Finance", "2023-01-01 01:15:23")]
   df = spark.createDataFrame(data, ["employee_name", "department", "time"])
   
   # Define Hudi options
   hudi_options = {
   "hoodie.table.name":"employee_hudi",
   "hoodie.datasource.write.operation":"insert_overwrite_table",
   "hoodie.datasource.write.recordkey.field":"employee_name",
   "hoodie.datasource.write.partitionpath.field":"time:TIMESTAMP",
   
"hoodie.datasource.write.keygenerator.class":"org.apache.hudi.keygen.CustomKeyGenerator",
   "hoodie.keygen.timebased.timestamp.type":"DATE_STRING",
   "hoodie.keygen.timebased.input.dateformat":"-MM-dd HH:mm:ss",
   "hoodie.keygen.timebased.output.dateformat":"-MM-dd"
   }
   
   
   # Write DataFrame to Hudi
   df.write.format("hudi"). \
 options(**hudi_options). \
 mode("overwrite"). \
 save("s3a://hudi-warehouse/test/")
   
   # query hudi table
   spark.read.format("hudi") \
   .option("hoodie.schema.on.read.enable","true") \
   .load("s3a://hudi-warehouse/test/") \
   .show(truncate=False)  
   
   # read parquet file\
   spark.read.format("parquet") \
   
.load("s3a://hudi-warehouse/test/2023-01-01/ec109c4b-723f-46ce-8bb2-5d1e57ecc204-0_0-134-191_20240411142532923.parquet")
 \
   .show(truncate=False)
   ```
   
   when I query hudi table the result:
   ```
   
+---+-+--+--+--+-+--+--+
   |_hoodie_commit_time|_hoodie_commit_seqno 
|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name
 |employee_name|department|time  |
   
+---+-+--+--+--+-+--+--+
   |20240411142532923  |20240411142532923_1_0|James |2023-01-02 
   
|ea678686-d3d3-4555-b894-30ecb1da2a47-0_1-134-190_20240411142532923.parquet|James
|Sales |2023-01-02|
   |20240411142532923  |20240411142532923_1_1|Robert|2023-01-02 
   
|ea678686-d3d3-4555-b894-30ecb1da2a47-0_1-134-190_20240411142532923.parquet|Robert
   |Sales |2023-01-02|
   |20240411142532923  |20240411142532923_0_0|Michael   |2023-01-01 
   
|ec109c4b-723f-46ce-8bb2-5d1e57ecc204-0_0-134-191_20240411142532923.parquet|Michael
  |Sales |2023-01-01|
   |20240411142532923  |20240411142532923_0_1|Maria |2023-01-01 
   
|ec109c4b-723f-46ce-8bb2-5d1e57ecc204-0_0-134-191_20240411142532923.parquet|Maria
|Finance   |2023-01-01|
   
+---+-+--+--+--+-+--+--+
   ```
   when I read the parquet file the result:  
   ```
   
+---+-+--+--+---

Re: [PR] [MINOR] Streamer test setup performance [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #10806:
URL: https://github.com/apache/hudi/pull/10806#issuecomment-2050867333

   
   ## CI report:
   
   * e0414708ebbd734156c0383cb4e5dbfe5ff4151a UNKNOWN
   * 10f0484ea6b5b820c257711dc8cd4da9cfa366cd Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22764)
 
   * 11c19fa8fd39ed058a4e3487c99c793610b61564 UNKNOWN
   * c68fd47d3080f055eb4b688f7e75b261ff6803d6 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23207)
 
   * 72970e41f06fb68466eba338ddfbd6553d2e96b1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23210)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7565] Create spark file readers to read a single file instead of an entire partition [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #10954:
URL: https://github.com/apache/hudi/pull/10954#issuecomment-2050867560

   
   ## CI report:
   
   * dbdefad652d5c51b19175ca70374b7737a004952 UNKNOWN
   * f6c5bebf97872d05f27137febbc727d5ad9f8e78 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23200)
 
   * 8f1ba6d46d8777f39c522d8bcac545ba3d4fd544 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7576] Improve efficiency of getRelativePartitionPath, reduce computation of partitionPath in AbstractTableFileSystemView [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #11001:
URL: https://github.com/apache/hudi/pull/11001#issuecomment-2050862646

   
   ## CI report:
   
   * 09e4971db9ad7d5677a5757ed0b718e24ca4fb0b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23206)
 
   * fe5ed81020fb8d974c306f61a222f9583e2dab29 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23209)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR] Hudi CLI 'version' command output empty string [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #10973:
URL: https://github.com/apache/hudi/pull/10973#issuecomment-2050862535

   
   ## CI report:
   
   * fac97b368a646aeddcc7e6728d7228f75f30bd82 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23204)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-7595) Investigate and fix flaky tests in ITTestHoodieDataSource

2024-04-11 Thread Vova Kolmakov (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vova Kolmakov updated HUDI-7595:

Labels: test-stability  (was: )

> Investigate and fix flaky tests in ITTestHoodieDataSource
> -
>
> Key: HUDI-7595
> URL: https://issues.apache.org/jira/browse/HUDI-7595
> Project: Apache Hudi
>  Issue Type: Test
>  Components: tests-ci
>Reporter: Vova Kolmakov
>Priority: Major
>  Labels: test-stability
>
>  
> {code:java}
> Failures: 
> 2024-04-10T00:47:48.7776428Z [ERROR]   
> ITTestHoodieDataSource.testWriteAndReadWithDataSkipping:1613 
> 2024-04-10T00:47:48.7779109Z Expected: is "[+I[id1, Danny, 23, 
> 1970-01-01T00:00:01, par1], +I[id2, Stephen, 33, 1970-01-01T00:00:02, par1], 
> +I[id3, Julian, 53, 1970-01-01T00:00:03, par2], +I[id4, Fabian, 31, 
> 1970-01-01T00:00:04, par2], +I[id5, Sophia, 18, 1970-01-01T00:00:05, par3], 
> +I[id6, Emma, 20, 1970-01-01T00:00:06, par3], +I[id7, Bob, 44, 
> 1970-01-01T00:00:07, par4], +I[id8, Han, 56, 1970-01-01T00:00:08, par4]]"
> 2024-04-10T00:47:48.7780657Z      but: was "[]"
> 2024-04-10T00:47:48.7781108Z [ERROR]   
> ITTestHoodieDataSource.testWriteAndReadWithDataSkipping:1613 
> 2024-04-10T00:47:48.7783330Z Expected: is "[+I[id1, Danny, 23, 
> 1970-01-01T00:00:01, par1], +I[id2, Stephen, 33, 1970-01-01T00:00:02, par1], 
> +I[id3, Julian, 53, 1970-01-01T00:00:03, par2], +I[id4, Fabian, 31, 
> 1970-01-01T00:00:04, par2], +I[id5, Sophia, 18, 1970-01-01T00:00:05, par3], 
> +I[id6, Emma, 20, 1970-01-01T00:00:06, par3], +I[id7, Bob, 44, 
> 1970-01-01T00:00:07, par4], +I[id8, Han, 56, 1970-01-01T00:00:08, par4]]"
> 2024-04-10T00:47:48.7784796Z      but: was "[]"
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7595) Investigate and fix flaky tests in ITTestHoodieDataSource

2024-04-11 Thread Vova Kolmakov (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vova Kolmakov updated HUDI-7595:

Component/s: tests-ci

> Investigate and fix flaky tests in ITTestHoodieDataSource
> -
>
> Key: HUDI-7595
> URL: https://issues.apache.org/jira/browse/HUDI-7595
> Project: Apache Hudi
>  Issue Type: Test
>  Components: tests-ci
>Reporter: Vova Kolmakov
>Priority: Major
>
>  
> {code:java}
> Failures: 
> 2024-04-10T00:47:48.7776428Z [ERROR]   
> ITTestHoodieDataSource.testWriteAndReadWithDataSkipping:1613 
> 2024-04-10T00:47:48.7779109Z Expected: is "[+I[id1, Danny, 23, 
> 1970-01-01T00:00:01, par1], +I[id2, Stephen, 33, 1970-01-01T00:00:02, par1], 
> +I[id3, Julian, 53, 1970-01-01T00:00:03, par2], +I[id4, Fabian, 31, 
> 1970-01-01T00:00:04, par2], +I[id5, Sophia, 18, 1970-01-01T00:00:05, par3], 
> +I[id6, Emma, 20, 1970-01-01T00:00:06, par3], +I[id7, Bob, 44, 
> 1970-01-01T00:00:07, par4], +I[id8, Han, 56, 1970-01-01T00:00:08, par4]]"
> 2024-04-10T00:47:48.7780657Z      but: was "[]"
> 2024-04-10T00:47:48.7781108Z [ERROR]   
> ITTestHoodieDataSource.testWriteAndReadWithDataSkipping:1613 
> 2024-04-10T00:47:48.7783330Z Expected: is "[+I[id1, Danny, 23, 
> 1970-01-01T00:00:01, par1], +I[id2, Stephen, 33, 1970-01-01T00:00:02, par1], 
> +I[id3, Julian, 53, 1970-01-01T00:00:03, par2], +I[id4, Fabian, 31, 
> 1970-01-01T00:00:04, par2], +I[id5, Sophia, 18, 1970-01-01T00:00:05, par3], 
> +I[id6, Emma, 20, 1970-01-01T00:00:06, par3], +I[id7, Bob, 44, 
> 1970-01-01T00:00:07, par4], +I[id8, Han, 56, 1970-01-01T00:00:08, par4]]"
> 2024-04-10T00:47:48.7784796Z      but: was "[]"
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7578) Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance

2024-04-11 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7578:

Fix Version/s: 0.15.0
   1.0.0

> Avoid unnecessary rewriting when copy old data from old base to new base file 
> to improve compaction performance 
> 
>
> Key: HUDI-7578
> URL: https://issues.apache.org/jira/browse/HUDI-7578
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Jing Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> After upgrade a hudi table from 0.10 version to 0.14 version, the compaction 
> job become much more slower.
> The hudi table is a MOR table without partition field. And the hudi table 
> does not do any schema evolution.
> The compaction job would finished in 52 minutes using 0.14 version. But the 
> compaction job would finished in 25 minutes using 0.10 version.
> And in the 0.14 version, the task jstack become much more complex. Including 
> the following content:
> !https://private-user-images.githubusercontent.com/1525333/320377766-9394a3b4-3074-4ba5-bd07-7c73f195085f.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTI1NjQ4ODAsIm5iZiI6MTcxMjU2NDU4MCwicGF0aCI6Ii8xNTI1MzMzLzMyMDM3Nzc2Ni05Mzk0YTNiNC0zMDc0LTRiYTUtYmQwNy03YzczZjE5NTA4NWYucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDQwOCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA0MDhUMDgyMzAwWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9ZTk3M2E1NjVkZDYwNTZiNTllZmUwOWEzOTNlMzEwMDA5NDBjYzk1NDE1ZDk4NjQ5ODM0ZjM3N2MwMmFmNzQ3ZSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.XCgE-sg9BovCyB7USURbPJfcaTB0NKLopRDZJXB-8os!
> After compare 0.14 and 0.10 version, we found there is a difference when copy 
> the old record from old base file to new base file.
> In 0.14 version, the cost is much more heavy.
> !https://private-user-images.githubusercontent.com/1525333/320378794-879b0f8e-dbc8-458b-9b45-afdced25580c.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTI1NjQ4ODAsIm5iZiI6MTcxMjU2NDU4MCwicGF0aCI6Ii8xNTI1MzMzLzMyMDM3ODc5NC04NzliMGY4ZS1kYmM4LTQ1OGItOWI0NS1hZmRjZWQyNTU4MGMucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDQwOCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA0MDhUMDgyMzAwWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NTc0ZjIzNTQ5NGFhMjY4NDBjNGU0MzFmM2MyY2JhZjVjNjM5YmU1Mjk1Njk5MmM1MjA0NDI1M2FiMjAxZjkzYiZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.Vwjh9nyBAXiVEgSTquFVke-8brqi87QkmgI7uYB5ooI!
> !https://private-user-images.githubusercontent.com/1525333/320379033-d22835b2-7d6c-44ae-aaf1-967d1622c9ae.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTI1NjQ4ODAsIm5iZiI6MTcxMjU2NDU4MCwicGF0aCI6Ii8xNTI1MzMzLzMyMDM3OTAzMy1kMjI4MzViMi03ZDZjLTQ0YWUtYWFmMS05NjdkMTYyMmM5YWUucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDQwOCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA0MDhUMDgyMzAwWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9MDBjYzk3OTEyM2I3Njc2OTJiODZlYjI1ZjcxZjA0ZjE0NDdlNzI1YTljYmZkOTA5ODNmNmE5YjVkNDVkMTkwZCZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.5KQiAFWIbAV3HXGRb4K0c5J2R8C_BRvlsjE5dGvkDDA!
>  
> !https://private-user-images.githubusercontent.com/1525333/320379285-438984f7-5d3f-4635-ae64-d3221d73cc34.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTI1NjQ4ODAsIm5iZiI6MTcxMjU2NDU4MCwicGF0aCI6Ii8xNTI1MzMzLzMyMDM3OTI4NS00Mzg5ODRmNy01ZDNmLTQ2MzUtYWU2NC1kMzIyMWQ3M2NjMzQucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDQwOCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA0MDhUMDgyMzAwWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9ZDg0M2M1OTBkOTYxNWUwYTU2NzI3MmE3NzhlYTQzY2M3YmFmZTdlZWQ1YWNhMGQzY2FhMjk1ZTQ1ODI1MmQxMCZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.OhoLRMr_PSLPgs9CY3vRcc_kwhXKped41HnsXR35poE!
> !https://private-user-images.githubusercontent.com/1525333/320379415-e1d5ddb4-1544-4f17-b9f9-6193765c8bed.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtl

Re: [PR] [HUDI-7565] Create spark file readers to read a single file instead of an entire partition [hudi]

2024-04-11 Thread via GitHub



jonvex commented on code in PR #10954:
URL: https://github.com/apache/hudi/pull/10954#discussion_r1561923113


##
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/functional/TestSparkHoodieParquetReader.java:
##
@@ -0,0 +1,48 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.functional;
+
+import org.apache.hudi.common.testutils.HoodieTestDataGenerator;
+import org.apache.hudi.util.JavaConversions;
+
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.junit.jupiter.api.Tag;
+import org.junit.jupiter.api.Test;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+
+@Tag("functional")
+public class TestSparkHoodieParquetReader extends TestBootstrapReadBase {

Review Comment:
   there are some util methods in there that I like to use but we don't have to.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7565] Create spark file readers to read a single file instead of an entire partition [hudi]

2024-04-11 Thread via GitHub



jonvex commented on code in PR #10954:
URL: https://github.com/apache/hudi/pull/10954#discussion_r1561922013


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/SparkHoodieParquetReaderBase.scala:
##
@@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.parquet
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.datasources.PartitionedFile
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.sources.Filter
+import org.apache.spark.sql.types.StructType
+
+abstract class SparkHoodieParquetReaderBase(enableVectorizedReader: Boolean,

Review Comment:
   Spark adapter is in the spark client so we need it here so that we can 
define a return type



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7577] Avoid MDT compaction instant time conflicts [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #10992:
URL: https://github.com/apache/hudi/pull/10992#issuecomment-2050828266

   
   ## CI report:
   
   * 1f421909625781304a531ccadcbf6a37ca5185a4 UNKNOWN
   * d13639a9823e827c45e3e619eebf9c93c8c2085c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23186)
 
   * c8423769cd6ef01b7afcaafd63f51b9f450ec7ea Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23208)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-7607) Test with timestamp based key generator

2024-04-11 Thread Jonathan Vexler (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Vexler updated HUDI-7607:
--
Fix Version/s: 1.0.0

> Test with timestamp based key generator
> ---
>
> Key: HUDI-7607
> URL: https://issues.apache.org/jira/browse/HUDI-7607
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: hive
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
> Fix For: 1.0.0
>
>
> make sure timestamp based keygen works correctly for partition



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7607) Test with timestamp based key generator

2024-04-11 Thread Jonathan Vexler (Jira)

Jonathan Vexler created HUDI-7607:
-

 Summary: Test with timestamp based key generator
 Key: HUDI-7607
 URL: https://issues.apache.org/jira/browse/HUDI-7607
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: hive
Reporter: Jonathan Vexler
Assignee: Jonathan Vexler


make sure timestamp based keygen works correctly for partition



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7378) Fix Spark SQL DML with custom key generator

2024-04-11 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7378:

Reviewers: Jonathan Vexler, Sagar Sumit

> Fix Spark SQL DML with custom key generator
> ---
>
> Key: HUDI-7378
> URL: https://issues.apache.org/jira/browse/HUDI-7378
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>   Original Estimate: 0.05h
>  Remaining Estimate: 0.05h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [HUDI-7577] Avoid MDT compaction instant time conflicts [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #10992:
URL: https://github.com/apache/hudi/pull/10992#issuecomment-2050822230

   
   ## CI report:
   
   * 1f421909625781304a531ccadcbf6a37ca5185a4 UNKNOWN
   * d13639a9823e827c45e3e619eebf9c93c8c2085c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23186)
 
   * c8423769cd6ef01b7afcaafd63f51b9f450ec7ea UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR] Streamer test setup performance [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #10806:
URL: https://github.com/apache/hudi/pull/10806#issuecomment-2050821831

   
   ## CI report:
   
   * e0414708ebbd734156c0383cb4e5dbfe5ff4151a UNKNOWN
   * 10f0484ea6b5b820c257711dc8cd4da9cfa366cd Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22764)
 
   * 11c19fa8fd39ed058a4e3487c99c793610b61564 UNKNOWN
   * c68fd47d3080f055eb4b688f7e75b261ff6803d6 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23207)
 
   * 72970e41f06fb68466eba338ddfbd6553d2e96b1 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7576] Improve efficiency of getRelativePartitionPath, reduce computation of partitionPath in AbstractTableFileSystemView [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #11001:
URL: https://github.com/apache/hudi/pull/11001#issuecomment-2050822300

   
   ## CI report:
   
   * 09e4971db9ad7d5677a5757ed0b718e24ca4fb0b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23206)
 
   * fe5ed81020fb8d974c306f61a222f9583e2dab29 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23209)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-7577) Avoid MDT compaction instant time conflicts

2024-04-11 Thread Danny Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-7577:
-
Status: Patch Available  (was: In Progress)

> Avoid MDT compaction instant time conflicts
> ---
>
> Key: HUDI-7577
> URL: https://issues.apache.org/jira/browse/HUDI-7577
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7580) Inserting rows into partitioned table leads to data sanity issues

2024-04-11 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-7580:
--
Sprint: Sprint 2024-03-25

> Inserting rows into partitioned table leads to data sanity issues
> -
>
> Key: HUDI-7580
> URL: https://issues.apache.org/jira/browse/HUDI-7580
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 1.0.0-beta1, 0.14.1
>Reporter: Vinaykumar Bhat
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: hudi-1.0.0-beta2
> Fix For: 0.15.0, 1.0.0
>
>   Original Estimate: 4m
>  Remaining Estimate: 4m
>
> Came across this behaviour of partitioned tables when trying to debug some 
> other issue with functional-index. It seems that the column ordering gets 
> messed up while inserting records into a hudi table. Hence, a subsequent 
> query returns wrong results. An example follows:
>  
> The following is a scala test:
> {code:java}
>   test("Test Create Functional Index") {
> if (HoodieSparkUtils.gteqSpark3_2) {
>   withTempDir { tmp =>
> val tableType = "cow"
>   val tableName = "rides"
>   val basePath = s"${tmp.getCanonicalPath}/$tableName"
>   spark.sql("set hoodie.metadata.enable=true")
>   spark.sql(
> s"""
>|create table $tableName (
>|  id int,
>|  name string,
>|  price int,
>|  ts long
>|) using hudi
>| options (
>|  primaryKey ='id',
>|  type = '$tableType',
>|  preCombineField = 'ts',
>|  hoodie.metadata.record.index.enable = 'true',
>|  hoodie.datasource.write.recordkey.field = 'id'
>| )
>| partitioned by(price)
>| location '$basePath'
>""".stripMargin)
>   spark.sql(s"insert into $tableName (id, name, price, ts) values(1, 
> 'a1', 10, 1000)")
>   spark.sql(s"insert into $tableName (id, name, price, ts) values(2, 
> 'a2', 100, 20)")
>   spark.sql(s"insert into $tableName (id, name, price, ts) values(3, 
> 'a3', 1000, 20)")
>   spark.sql(s"select id, name, price, ts from $tableName").show(false)
>   }
> }
>   } {code}
>  
> The query returns the following result (note how *price* and *ts* columns are 
> mixed up). 
> {code:java}
> +---++--++
> |id |name|price |ts  |
> +---++--++
> |3  |a3  |20|1000|
> |2  |a2  |20|100 |
> |1  |a1  |1000  |10  |
> +---++--++
>  {code}
>  
> Having the partition column as the last column in the schema does not cause 
> this problem. If the mixed-up columns are of incompatible datatypes, then the 
> insert fails with an error.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7604) DataSourceWriteOptions.TABLE_NAME() does not work

2024-04-11 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7604:

Epic Link: HUDI-7537

> DataSourceWriteOptions.TABLE_NAME() does not work
> -
>
> Key: HUDI-7604
> URL: https://issues.apache.org/jira/browse/HUDI-7604
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, spark-sql
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> DataSourceWriteOptions.TABLE_NAME(). If you set that the write will fail 
> saying the table name is missing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7580) Inserting rows into partitioned table leads to data sanity issues

2024-04-11 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-7580:
--
Status: In Progress  (was: Open)

> Inserting rows into partitioned table leads to data sanity issues
> -
>
> Key: HUDI-7580
> URL: https://issues.apache.org/jira/browse/HUDI-7580
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 1.0.0-beta1, 0.14.1
>Reporter: Vinaykumar Bhat
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: hudi-1.0.0-beta2
> Fix For: 0.15.0, 1.0.0
>
>   Original Estimate: 4m
>  Remaining Estimate: 4m
>
> Came across this behaviour of partitioned tables when trying to debug some 
> other issue with functional-index. It seems that the column ordering gets 
> messed up while inserting records into a hudi table. Hence, a subsequent 
> query returns wrong results. An example follows:
>  
> The following is a scala test:
> {code:java}
>   test("Test Create Functional Index") {
> if (HoodieSparkUtils.gteqSpark3_2) {
>   withTempDir { tmp =>
> val tableType = "cow"
>   val tableName = "rides"
>   val basePath = s"${tmp.getCanonicalPath}/$tableName"
>   spark.sql("set hoodie.metadata.enable=true")
>   spark.sql(
> s"""
>|create table $tableName (
>|  id int,
>|  name string,
>|  price int,
>|  ts long
>|) using hudi
>| options (
>|  primaryKey ='id',
>|  type = '$tableType',
>|  preCombineField = 'ts',
>|  hoodie.metadata.record.index.enable = 'true',
>|  hoodie.datasource.write.recordkey.field = 'id'
>| )
>| partitioned by(price)
>| location '$basePath'
>""".stripMargin)
>   spark.sql(s"insert into $tableName (id, name, price, ts) values(1, 
> 'a1', 10, 1000)")
>   spark.sql(s"insert into $tableName (id, name, price, ts) values(2, 
> 'a2', 100, 20)")
>   spark.sql(s"insert into $tableName (id, name, price, ts) values(3, 
> 'a3', 1000, 20)")
>   spark.sql(s"select id, name, price, ts from $tableName").show(false)
>   }
> }
>   } {code}
>  
> The query returns the following result (note how *price* and *ts* columns are 
> mixed up). 
> {code:java}
> +---++--++
> |id |name|price |ts  |
> +---++--++
> |3  |a3  |20|1000|
> |2  |a2  |20|100 |
> |1  |a1  |1000  |10  |
> +---++--++
>  {code}
>  
> Having the partition column as the last column in the schema does not cause 
> this problem. If the mixed-up columns are of incompatible datatypes, then the 
> insert fails with an error.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7604) DataSourceWriteOptions.TABLE_NAME() does not work

2024-04-11 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7604:

Fix Version/s: 1.0.0

> DataSourceWriteOptions.TABLE_NAME() does not work
> -
>
> Key: HUDI-7604
> URL: https://issues.apache.org/jira/browse/HUDI-7604
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, spark-sql
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> DataSourceWriteOptions.TABLE_NAME(). If you set that the write will fail 
> saying the table name is missing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7605) Unable to set merger strategy with DataSourceWriteOptions.RECORD_MERGER_STRATEGY

2024-04-11 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7605:

Epic Link: HUDI-7322

> Unable to set merger strategy with 
> DataSourceWriteOptions.RECORD_MERGER_STRATEGY
> 
>
> Key: HUDI-7605
> URL: https://issues.apache.org/jira/browse/HUDI-7605
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, spark-sql
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> DataSourceWriteOptions.RECORD_MERGER_STRATEGY.key() should change the 
> strategy set in the tableconfigs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7605) Unable to set merger strategy with DataSourceWriteOptions.RECORD_MERGER_STRATEGY

2024-04-11 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7605:

Fix Version/s: 1.0.0

> Unable to set merger strategy with 
> DataSourceWriteOptions.RECORD_MERGER_STRATEGY
> 
>
> Key: HUDI-7605
> URL: https://issues.apache.org/jira/browse/HUDI-7605
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, spark-sql
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> DataSourceWriteOptions.RECORD_MERGER_STRATEGY.key() should change the 
> strategy set in the tableconfigs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [HUDI-7576] Improve efficiency of getRelativePartitionPath, reduce computation of partitionPath in AbstractTableFileSystemView [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #11001:
URL: https://github.com/apache/hudi/pull/11001#issuecomment-2050815033

   
   ## CI report:
   
   * 09e4971db9ad7d5677a5757ed0b718e24ca4fb0b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23206)
 
   * fe5ed81020fb8d974c306f61a222f9583e2dab29 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR] Hudi CLI 'version' command output empty string [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #10973:
URL: https://github.com/apache/hudi/pull/10973#issuecomment-2050814856

   
   ## CI report:
   
   * e093cc8dec1a4aab10e29aad164569dbfd3a1667 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23203)
 
   * fac97b368a646aeddcc7e6728d7228f75f30bd82 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23204)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT] Issue with Repartition on Kafka Input DataFrame and Same Precombine Value Rows In One Batch [hudi]

2024-04-11 Thread via GitHub



brightwon commented on issue #10995:
URL: https://github.com/apache/hudi/issues/10995#issuecomment-2050814895

   @ad1happy2go Thank you for your reply.
   
   What I want is to speed up the tagging stage. Could you suggest a solution? 
I can achieve this by using repartition with a completely unique precombine 
field (such as the Kafka offset), but Hudi does not allow changing the 
precombine field of an existing table.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR] Streamer test setup performance [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #10806:
URL: https://github.com/apache/hudi/pull/10806#issuecomment-2050814560

   
   ## CI report:
   
   * e0414708ebbd734156c0383cb4e5dbfe5ff4151a UNKNOWN
   * 10f0484ea6b5b820c257711dc8cd4da9cfa366cd Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22764)
 
   * 11c19fa8fd39ed058a4e3487c99c793610b61564 UNKNOWN
   * c68fd47d3080f055eb4b688f7e75b261ff6803d6 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23207)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-7580) Inserting rows into partitioned table leads to data sanity issues

2024-04-11 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-7580:
--
Labels: hudi-1.0.0-beta2  (was: )

> Inserting rows into partitioned table leads to data sanity issues
> -
>
> Key: HUDI-7580
> URL: https://issues.apache.org/jira/browse/HUDI-7580
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 1.0.0-beta1, 0.14.1
>Reporter: Vinaykumar Bhat
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: hudi-1.0.0-beta2
> Fix For: 0.15.0, 1.0.0
>
>   Original Estimate: 4m
>  Remaining Estimate: 4m
>
> Came across this behaviour of partitioned tables when trying to debug some 
> other issue with functional-index. It seems that the column ordering gets 
> messed up while inserting records into a hudi table. Hence, a subsequent 
> query returns wrong results. An example follows:
>  
> The following is a scala test:
> {code:java}
>   test("Test Create Functional Index") {
> if (HoodieSparkUtils.gteqSpark3_2) {
>   withTempDir { tmp =>
> val tableType = "cow"
>   val tableName = "rides"
>   val basePath = s"${tmp.getCanonicalPath}/$tableName"
>   spark.sql("set hoodie.metadata.enable=true")
>   spark.sql(
> s"""
>|create table $tableName (
>|  id int,
>|  name string,
>|  price int,
>|  ts long
>|) using hudi
>| options (
>|  primaryKey ='id',
>|  type = '$tableType',
>|  preCombineField = 'ts',
>|  hoodie.metadata.record.index.enable = 'true',
>|  hoodie.datasource.write.recordkey.field = 'id'
>| )
>| partitioned by(price)
>| location '$basePath'
>""".stripMargin)
>   spark.sql(s"insert into $tableName (id, name, price, ts) values(1, 
> 'a1', 10, 1000)")
>   spark.sql(s"insert into $tableName (id, name, price, ts) values(2, 
> 'a2', 100, 20)")
>   spark.sql(s"insert into $tableName (id, name, price, ts) values(3, 
> 'a3', 1000, 20)")
>   spark.sql(s"select id, name, price, ts from $tableName").show(false)
>   }
> }
>   } {code}
>  
> The query returns the following result (note how *price* and *ts* columns are 
> mixed up). 
> {code:java}
> +---++--++
> |id |name|price |ts  |
> +---++--++
> |3  |a3  |20|1000|
> |2  |a2  |20|100 |
> |1  |a1  |1000  |10  |
> +---++--++
>  {code}
>  
> Having the partition column as the last column in the schema does not cause 
> this problem. If the mixed-up columns are of incompatible datatypes, then the 
> insert fails with an error.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

(hudi) branch master updated (c870da2f375 -> a41d7aeafed)

2024-04-11 Thread yihua

This is an automated email from the ASF dual-hosted git repository.

yihua pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from c870da2f375 [HUDI-6441] Passing custom Headers with Hudi Callback URL 
(#10970)
 add a41d7aeafed [HUDI-7605] Allow merger strategy to be set in spark sql 
writer (#10999)

No new revisions were added by this update.

Summary of changes:
 .../scala/org/apache/hudi/HoodieSparkSqlWriter.scala |  1 +
 .../apache/hudi/functional/TestMORDataSource.scala   | 20 
 2 files changed, 21 insertions(+)

Re: [PR] [HUDI-7605] allow merger strategy to be set in spark sql writer [hudi]

2024-04-11 Thread via GitHub



yihua merged PR #10999:
URL: https://github.com/apache/hudi/pull/10999


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7576] Improve efficiency of getRelativePartitionPath, reduce computation of partitionPath in AbstractTableFileSystemView [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #11001:
URL: https://github.com/apache/hudi/pull/11001#issuecomment-2050777376

   
   ## CI report:
   
   * 09e4971db9ad7d5677a5757ed0b718e24ca4fb0b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23206)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR] Streamer test setup performance [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #10806:
URL: https://github.com/apache/hudi/pull/10806#issuecomment-2050776939

   
   ## CI report:
   
   * e0414708ebbd734156c0383cb4e5dbfe5ff4151a UNKNOWN
   * 10f0484ea6b5b820c257711dc8cd4da9cfa366cd Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22764)
 
   * 11c19fa8fd39ed058a4e3487c99c793610b61564 UNKNOWN
   * c68fd47d3080f055eb4b688f7e75b261ff6803d6 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7606] Unpersist RDDs after table services, mainly compaction [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #11000:
URL: https://github.com/apache/hudi/pull/11000#issuecomment-2050777340

   
   ## CI report:
   
   * 12cf06d732847bf9ca925bf2bb4e2e0eb39b8855 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23205)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7378] Fix Spark SQL DML with custom key generator [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #10615:
URL: https://github.com/apache/hudi/pull/10615#issuecomment-2050776752

   
   ## CI report:
   
   * 989ffd5220e4f5ae666a05afdd0e7de3c6543972 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23202)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-7606) Ensure that rdds persisted by table services are released in SparkRDDWriteClient

2024-04-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7606:
-
Labels: pull-request-available  (was: )

> Ensure that rdds persisted by table services are released in 
> SparkRDDWriteClient
> 
>
> Key: HUDI-7606
> URL: https://issues.apache.org/jira/browse/HUDI-7606
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Rajesh Mahindra
>Assignee: Rajesh Mahindra
>Priority: Major
>  Labels: pull-request-available
>
> Ensure that rdds persisted by table services are released in 
> SparkRDDWriteClient since the RDDs are currently release prior to the table 
> services.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [HUDI-7576] Improve efficiency of getRelativePartitionPath, reduce computation of partitionPath in AbstractTableFileSystemView [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #11001:
URL: https://github.com/apache/hudi/pull/11001#issuecomment-2050771916

   
   ## CI report:
   
   * 09e4971db9ad7d5677a5757ed0b718e24ca4fb0b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7606] Unpersist RDDs after table services, mainly compaction [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #11000:
URL: https://github.com/apache/hudi/pull/11000#issuecomment-2050771888

   
   ## CI report:
   
   * 12cf06d732847bf9ca925bf2bb4e2e0eb39b8855 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR] Streamer test setup performance [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #10806:
URL: https://github.com/apache/hudi/pull/10806#issuecomment-2050771563

   
   ## CI report:
   
   * e0414708ebbd734156c0383cb4e5dbfe5ff4151a UNKNOWN
   * 10f0484ea6b5b820c257711dc8cd4da9cfa366cd Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22764)
 
   * 11c19fa8fd39ed058a4e3487c99c793610b61564 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] RLI Spark Hudi Error occurs when executing map [hudi]

2024-04-11 Thread via GitHub



jayakasadev commented on issue #10609:
URL: https://github.com/apache/hudi/issues/10609#issuecomment-2050767899

   I hit the same error when I try to use record indexing:
   ```
   hoodie.metadata.record.index.enable=true
   hoodie.index.type=RECORD_INDEX
   ```
   Are there additional configs/jars that are needed?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7605] allow merger strategy to be set in spark sql writer [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #10999:
URL: https://github.com/apache/hudi/pull/10999#issuecomment-2050765903

   
   ## CI report:
   
   * 15e59507262bb635269fc03c820b518558eb267a Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23201)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR] Hudi CLI 'version' command output empty string [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #10973:
URL: https://github.com/apache/hudi/pull/10973#issuecomment-2050765806

   
   ## CI report:
   
   *  Unknown: [CANCELED](TBD) 
   * e093cc8dec1a4aab10e29aad164569dbfd3a1667 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23203)
 
   * fac97b368a646aeddcc7e6728d7228f75f30bd82 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23204)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7576] add partitionPath as an instance variable to HoodieBaseFile and HoodieLogFile [hudi]

2024-04-11 Thread via GitHub



the-other-tim-brown closed pull request #10975: [HUDI-7576] add partitionPath 
as an instance variable to HoodieBaseFile and HoodieLogFile
URL: https://github.com/apache/hudi/pull/10975


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7576] add partitionPath as an instance variable to HoodieBaseFile and HoodieLogFile [hudi]

2024-04-11 Thread via GitHub



the-other-tim-brown commented on PR #10975:
URL: https://github.com/apache/hudi/pull/10975#issuecomment-2050759641

   @danny0405 I've made this PR instead to get the same performance win without 
taking on the big refactor: https://github.com/apache/hudi/pull/11001


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] [HUDI-7576] Improve efficiency of getRelativePartitionPath, reduce computation of partitionPath in AbstractTableFileSystemView [hudi]

2024-04-11 Thread via GitHub



the-other-tim-brown opened a new pull request, #11001:
URL: https://github.com/apache/hudi/pull/11001

   ### Change Logs
   
   - Improve the efficiency of `getRelativePartitionPath` by reducing the 
number of operations on the path object that are required to get the final 
result
   - Reduce the number of times a partitionPath is computed by supplying a 
partition path argument where possible in the AbstractFileSystemView
   
   ### Impact
   
   Reduces overhead of building FSViews with large numbers of files
   
   ### Risk level (write none, low medium or high below)
   
   None
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change. If not, put "none"._
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-7576) Avoid recomputing partition path in AbstractFileSystemView

2024-04-11 Thread Timothy Brown (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Brown updated HUDI-7576:

Summary: Avoid recomputing partition path in AbstractFileSystemView  (was: 
Add partitionPath to the HoodieBaseFile and HoodieLogFile objects)

> Avoid recomputing partition path in AbstractFileSystemView
> --
>
> Key: HUDI-7576
> URL: https://issues.apache.org/jira/browse/HUDI-7576
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Timothy Brown
>Assignee: Timothy Brown
>Priority: Major
>  Labels: pull-request-available
>
> Adding this field to the classes will allow us to avoid repeatedly computing 
> the partition path per file in other parts of the code. This can cut down on 
> the CPU overhead associated with creating the FS View.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7576) Avoid recomputing partition path in AbstractFileSystemView

2024-04-11 Thread Timothy Brown (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Brown updated HUDI-7576:

Description: We have observed a non-negligible amount of CPU spent simply 
computing the partition paths of base and log files when building a file system 
view. We should aim to improve the efficiency of these calls and reduce the 
number of them.  (was: Adding this field to the classes will allow us to avoid 
repeatedly computing the partition path per file in other parts of the code. 
This can cut down on the CPU overhead associated with creating the FS View.)

> Avoid recomputing partition path in AbstractFileSystemView
> --
>
> Key: HUDI-7576
> URL: https://issues.apache.org/jira/browse/HUDI-7576
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Timothy Brown
>Assignee: Timothy Brown
>Priority: Major
>  Labels: pull-request-available
>
> We have observed a non-negligible amount of CPU spent simply computing the 
> partition paths of base and log files when building a file system view. We 
> should aim to improve the efficiency of these calls and reduce the number of 
> them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7606) Ensure that rdds persisted by table services are released in SparkRDDWriteClient

2024-04-11 Thread Rajesh Mahindra (Jira)

Rajesh Mahindra created HUDI-7606:
-

 Summary: Ensure that rdds persisted by table services are released 
in SparkRDDWriteClient
 Key: HUDI-7606
 URL: https://issues.apache.org/jira/browse/HUDI-7606
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Rajesh Mahindra


Ensure that rdds persisted by table services are released in 
SparkRDDWriteClient since the RDDs are currently release prior to the table 
services.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-7606) Ensure that rdds persisted by table services are released in SparkRDDWriteClient

2024-04-11 Thread Rajesh Mahindra (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra reassigned HUDI-7606:
-

Assignee: Rajesh Mahindra

> Ensure that rdds persisted by table services are released in 
> SparkRDDWriteClient
> 
>
> Key: HUDI-7606
> URL: https://issues.apache.org/jira/browse/HUDI-7606
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Rajesh Mahindra
>Assignee: Rajesh Mahindra
>Priority: Major
>
> Ensure that rdds persisted by table services are released in 
> SparkRDDWriteClient since the RDDs are currently release prior to the table 
> services.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[PR] [] Unpersist RDDs after table services, mainly compaction [hudi]

2024-04-11 Thread via GitHub



rmahindra123 opened a new pull request, #11000:
URL: https://github.com/apache/hudi/pull/11000

   ### Change Logs
   
   Unpersist RDDs after table services. Currently, the releaseResources is 
called before running inline table services. Tests show that the RDDs persisted 
by compaction may not be explicitly unpersisted.
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change. If not, put "none"._
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR] Hudi CLI 'version' command output empty string [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #10973:
URL: https://github.com/apache/hudi/pull/10973#issuecomment-2050729563

   
   ## CI report:
   
   *  Unknown: [CANCELED](TBD) 
   * e093cc8dec1a4aab10e29aad164569dbfd3a1667 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23203)
 
   * fac97b368a646aeddcc7e6728d7228f75f30bd82 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR] Hudi CLI 'version' command output empty string [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #10973:
URL: https://github.com/apache/hudi/pull/10973#issuecomment-2050723861

   
   ## CI report:
   
   *  Unknown: [CANCELED](TBD) 
   * e093cc8dec1a4aab10e29aad164569dbfd3a1667 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7565] Create spark file readers to read a single file instead of an entire partition [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #10954:
URL: https://github.com/apache/hudi/pull/10954#issuecomment-2050723773

   
   ## CI report:
   
   * dbdefad652d5c51b19175ca70374b7737a004952 UNKNOWN
   * f6c5bebf97872d05f27137febbc727d5ad9f8e78 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23200)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7378] Fix Spark SQL DML with custom key generator [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #10615:
URL: https://github.com/apache/hudi/pull/10615#issuecomment-2050723314

   
   ## CI report:
   
   * 50b27846bf118909f3fd69f20cf5d7654d8a87c7 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23189)
 
   * 989ffd5220e4f5ae666a05afdd0e7de3c6543972 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23202)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR] Hudi CLI 'version' command output empty string [hudi]

2024-04-11 Thread via GitHub



pt657407064 commented on code in PR #10973:
URL: https://github.com/apache/hudi/pull/10973#discussion_r1561855940


##
hudi-cli/src/main/resources/application.yml:
##
@@ -20,4 +20,7 @@ spring:
   shell:
 history:
   enabled: true
-  name: hoodie-cmd.log
\ No newline at end of file
+  name: hoodie-cmd.log
+command:
+  version:
+template: "classpath:version.txt"

Review Comment:
   just fixed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR] Hudi CLI 'version' command output empty string [hudi]

2024-04-11 Thread via GitHub



pt657407064 commented on PR #10973:
URL: https://github.com/apache/hudi/pull/10973#issuecomment-2050718867

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR] Hudi CLI 'version' command output empty string [hudi]

2024-04-11 Thread via GitHub



pt657407064 commented on code in PR #10973:
URL: https://github.com/apache/hudi/pull/10973#discussion_r1561855940


##
hudi-cli/src/main/resources/application.yml:
##
@@ -20,4 +20,7 @@ spring:
   shell:
 history:
   enabled: true
-  name: hoodie-cmd.log
\ No newline at end of file
+  name: hoodie-cmd.log
+command:
+  version:
+template: "classpath:version.txt"

Review Comment:
   just fixed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7378] Fix Spark SQL DML with custom key generator [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #10615:
URL: https://github.com/apache/hudi/pull/10615#issuecomment-2050717174

   
   ## CI report:
   
   * 50b27846bf118909f3fd69f20cf5d7654d8a87c7 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23189)
 
   * 989ffd5220e4f5ae666a05afdd0e7de3c6543972 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7605] allow merger strategy to be set in spark sql writer [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #10999:
URL: https://github.com/apache/hudi/pull/10999#issuecomment-2050680031

   
   ## CI report:
   
   * d392ef9a33b9019a8fadb9c4117cdca48116b48f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23198)
 
   * 15e59507262bb635269fc03c820b518558eb267a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23201)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7565] Create spark file readers to read a single file instead of an entire partition [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #10954:
URL: https://github.com/apache/hudi/pull/10954#issuecomment-2050679866

   
   ## CI report:
   
   * 120226ac7bc6eeb735307745dfa47782a311470b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23199)
 
   * dbdefad652d5c51b19175ca70374b7737a004952 UNKNOWN
   * f6c5bebf97872d05f27137febbc727d5ad9f8e78 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23200)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7605] allow merger strategy to be set in spark sql writer [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #10999:
URL: https://github.com/apache/hudi/pull/10999#issuecomment-2050673612

   
   ## CI report:
   
   * d392ef9a33b9019a8fadb9c4117cdca48116b48f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23198)
 
   * 15e59507262bb635269fc03c820b518558eb267a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7565] Create spark file readers to read a single file instead of an entire partition [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #10954:
URL: https://github.com/apache/hudi/pull/10954#issuecomment-2050673466

   
   ## CI report:
   
   * 120226ac7bc6eeb735307745dfa47782a311470b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23199)
 
   * dbdefad652d5c51b19175ca70374b7737a004952 UNKNOWN
   * f6c5bebf97872d05f27137febbc727d5ad9f8e78 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7565] Create spark file readers to read a single file instead of an entire partition [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #10954:
URL: https://github.com/apache/hudi/pull/10954#issuecomment-205013

   
   ## CI report:
   
   * 120226ac7bc6eeb735307745dfa47782a311470b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23199)
 
   * dbdefad652d5c51b19175ca70374b7737a004952 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7604] Make table name config work properly [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #10998:
URL: https://github.com/apache/hudi/pull/10998#issuecomment-2050666813

   
   ## CI report:
   
   * e7e51394cc39b914503b7e1e3608cdb3ff690a30 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23197)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7604] Make table name config work properly [hudi]

2024-04-11 Thread via GitHub



jonvex commented on code in PR #10998:
URL: https://github.com/apache/hudi/pull/10998#discussion_r1561821789


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala:
##
@@ -964,6 +964,11 @@ object DataSourceOptionsHelper {
 
   def translateConfigurations(optParams: Map[String, String]): Map[String, 
String] = {
 val translatedOpt = scala.collection.mutable.Map[String, String]() ++= 
optParams
+if (!translatedOpt.contains(HoodieTableConfig.NAME.key()) &&

Review Comment:
   I tried adding, that, but it removes the configs with the new ones, so it 
was making other things fail if I used that instead



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7565] Create spark file readers to read a single file instead of an entire partition [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #10954:
URL: https://github.com/apache/hudi/pull/10954#issuecomment-2050619920

   
   ## CI report:
   
   * 815b6fd6af5676590079cf6f9e23b7a2fdb4ccd8 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23192)
 
   * 120226ac7bc6eeb735307745dfa47782a311470b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7604] Make table name config work properly [hudi]

2024-04-11 Thread via GitHub



yihua commented on code in PR #10998:
URL: https://github.com/apache/hudi/pull/10998#discussion_r1561742623


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala:
##
@@ -964,6 +964,11 @@ object DataSourceOptionsHelper {
 
   def translateConfigurations(optParams: Map[String, String]): Map[String, 
String] = {
 val translatedOpt = scala.collection.mutable.Map[String, String]() ++= 
optParams
+if (!translatedOpt.contains(HoodieTableConfig.NAME.key()) &&

Review Comment:
   `HoodieWriterUtils` has `sparkDatasourceConfigsToTableConfigsMap` which 
contains `hoodie.datasource.write.table.name` to `hoodie.table.name` mapping.  
Is it not working?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7604] Make table name config work properly [hudi]

2024-04-11 Thread via GitHub



yihua commented on code in PR #10998:
URL: https://github.com/apache/hudi/pull/10998#discussion_r1561716214


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala:
##
@@ -964,6 +964,11 @@ object DataSourceOptionsHelper {
 
   def translateConfigurations(optParams: Map[String, String]): Map[String, 
String] = {
 val translatedOpt = scala.collection.mutable.Map[String, String]() ++= 
optParams
+if (!translatedOpt.contains(HoodieTableConfig.NAME.key()) &&

Review Comment:
   Should this logic be incorporated into L972?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7605] allow merger strategy to be set in spark sql writer [hudi]

2024-04-11 Thread via GitHub



yihua commented on code in PR #10999:
URL: https://github.com/apache/hudi/pull/10999#discussion_r1561708412


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala:
##
@@ -1405,4 +1405,24 @@ class TestMORDataSource extends 
HoodieSparkClientTestBase with SparkDatasetMixin
   basePath
 }
   }
+
+  @Test
+  def testMergerStrategySet(): Unit = {
+val (writeOpts, _) = getWriterReaderOpts()
+val input = recordsToStrings(dataGen.generateInserts("000", 1)).asScala
+val inputDf= spark.read.json(spark.sparkContext.parallelize(input, 1))
+val mergerStrategyName = "asfdasf"

Review Comment:
   make a more readable name here?



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala:
##
@@ -309,6 +310,7 @@ class HoodieSparkSqlWriterInternal {
   .setPartitionMetafileUseBaseFormat(useBaseFormatMetaFile)
   
.setShouldDropPartitionColumns(hoodieConfig.getBooleanOrDefault(HoodieTableConfig.DROP_PARTITION_COLUMNS))
   .setCommitTimezone(timelineTimeZone)
+  .setRecordMergerStrategy(recordMergerStrategy)

Review Comment:
   inline the 
`hoodieConfig.getStringOrDefault(DataSourceWriteOptions.RECORD_MERGER_STRATEGY)`?



##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala:
##
@@ -1405,4 +1405,24 @@ class TestMORDataSource extends 
HoodieSparkClientTestBase with SparkDatasetMixin
   basePath
 }
   }
+
+  @Test
+  def testMergerStrategySet(): Unit = {
+val (writeOpts, _) = getWriterReaderOpts()
+val input = recordsToStrings(dataGen.generateInserts("000", 1)).asScala
+val inputDf= spark.read.json(spark.sparkContext.parallelize(input, 1))
+val mergerStrategyName = "asfdasf"
+inputDf.write.format("org.apache.hudi")

Review Comment:
   ```suggestion
   inputDf.write.format("hudi")
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7605] allow merger strategy to be set in spark sql writer [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #10999:
URL: https://github.com/apache/hudi/pull/10999#issuecomment-2050521034

   
   ## CI report:
   
   * d392ef9a33b9019a8fadb9c4117cdca48116b48f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23198)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7604] Make table name config work properly [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #10998:
URL: https://github.com/apache/hudi/pull/10998#issuecomment-2050520984

   
   ## CI report:
   
   * ea501ae87f61ff965f558360bb703bfad595c2a0 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23196)
 
   * e7e51394cc39b914503b7e1e3608cdb3ff690a30 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23197)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7605] allow merger strategy to be set in spark sql writer [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #10999:
URL: https://github.com/apache/hudi/pull/10999#issuecomment-2050508350

   
   ## CI report:
   
   * d392ef9a33b9019a8fadb9c4117cdca48116b48f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7604] Make table name config work properly [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #10998:
URL: https://github.com/apache/hudi/pull/10998#issuecomment-2050508252

   
   ## CI report:
   
   * ea501ae87f61ff965f558360bb703bfad595c2a0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23196)
 
   * e7e51394cc39b914503b7e1e3608cdb3ff690a30 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7604] Make table name config work properly [hudi]

2024-04-11 Thread via GitHub



nsivabalan commented on code in PR #10998:
URL: https://github.com/apache/hudi/pull/10998#discussion_r1561644552


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala:
##
@@ -964,6 +964,11 @@ object DataSourceOptionsHelper {
 
   def translateConfigurations(optParams: Map[String, String]): Map[String, 
String] = {
 val translatedOpt = scala.collection.mutable.Map[String, String]() ++= 
optParams
+if (!translatedOpt.contains(HoodieTableConfig.HOODIE_TABLE_NAME_KEY) &&
+  translatedOpt.contains(HoodieTableConfig.HOODIE_WRITE_TABLE_NAME_KEY)) {

Review Comment:
   can we replace DataSourceWriteOptions.TABLE_NAME.key() instead of 
HoodieTableConfig.HOODIE_WRITE_TABLE_NAME_KEY 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7565] Create spark file readers to read a single file instead of an entire partition [hudi]

2024-04-11 Thread via GitHub



yihua commented on code in PR #10954:
URL: https://github.com/apache/hudi/pull/10954#discussion_r1561624097


##
hudi-spark-datasource/hudi-spark3.2.x/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/Spark32HoodieParquetReader.scala:
##
@@ -0,0 +1,267 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.parquet
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+import org.apache.hadoop.mapred.FileSplit
+import org.apache.hadoop.mapreduce._
+import org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
+import org.apache.parquet.filter2.compat.FilterCompat
+import org.apache.parquet.filter2.predicate.FilterApi
+import 
org.apache.parquet.format.converter.ParquetMetadataConverter.SKIP_ROW_GROUPS
+import org.apache.parquet.hadoop._
+import org.apache.spark.TaskContext
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.JoinedRow
+import 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
+import org.apache.spark.sql.execution.datasources._
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.sources._
+import org.apache.spark.sql.types._
+
+import java.net.URI
+
+class Spark32HoodieParquetReader(enableVectorizedReader: Boolean,

Review Comment:
   ```suggestion
   class Spark32ParquetReader(enableVectorizedReader: Boolean,
   ```



##
hudi-spark-datasource/hudi-spark3.4.x/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/Spark34HoodieParquetReader.scala:
##
@@ -0,0 +1,277 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.parquet
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.mapred.FileSplit
+import org.apache.hadoop.mapreduce._
+import org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
+import org.apache.parquet.filter2.compat.FilterCompat
+import org.apache.parquet.filter2.predicate.FilterApi
+import 
org.apache.parquet.format.converter.ParquetMetadataConverter.SKIP_ROW_GROUPS
+import org.apache.parquet.hadoop._
+import org.apache.spark.TaskContext
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.JoinedRow
+import 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
+import org.apache.spark.sql.execution.datasources._
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.sources._
+import org.apache.spark.sql.types._
+
+class Spark34HoodieParquetReader(enableVectorizedReader: Boolean,

Review Comment:
   ```suggestion
   class Spark34ParquetReader(enableVectorizedReader: Boolean,
   ```



##
hudi-spark-datasource/hudi-spark3.3.x/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/Spark33HoodieParquetReader.scala:
##
@@ -0,0 +1,268 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  Y

[jira] [Updated] (HUDI-7605) Unable to set merger strategy with DataSourceWriteOptions.RECORD_MERGER_STRATEGY

2024-04-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7605:
-
Labels: pull-request-available  (was: )

> Unable to set merger strategy with 
> DataSourceWriteOptions.RECORD_MERGER_STRATEGY
> 
>
> Key: HUDI-7605
> URL: https://issues.apache.org/jira/browse/HUDI-7605
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, spark-sql
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
>
> DataSourceWriteOptions.RECORD_MERGER_STRATEGY.key() should change the 
> strategy set in the tableconfigs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[PR] [HUDI-7605] allow merger strategy to be set in spark sql writer [hudi]

2024-04-11 Thread via GitHub



jonvex opened a new pull request, #10999:
URL: https://github.com/apache/hudi/pull/10999

   ### Change Logs
   
   DataSourceWriteOptions.RECORD_MERGER_STRATEGY.key() should change the 
strategy set in the table configs but currently does not
   
   ### Impact
   
   make config work
   
   ### Risk level (write none, low medium or high below)
   
   none
   
   ### Documentation Update
   
   N/A
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-7605) Unable to set merger strategy with DataSourceWriteOptions.RECORD_MERGER_STRATEGY

2024-04-11 Thread Jonathan Vexler (Jira)

Jonathan Vexler created HUDI-7605:
-

 Summary: Unable to set merger strategy with 
DataSourceWriteOptions.RECORD_MERGER_STRATEGY
 Key: HUDI-7605
 URL: https://issues.apache.org/jira/browse/HUDI-7605
 Project: Apache Hudi
  Issue Type: Bug
  Components: spark, spark-sql
Reporter: Jonathan Vexler
Assignee: Jonathan Vexler


DataSourceWriteOptions.RECORD_MERGER_STRATEGY.key() should change the strategy 
set in the tableconfigs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [HUDI-7604] Make table name config work properly [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #10998:
URL: https://github.com/apache/hudi/pull/10998#issuecomment-2050413788

   
   ## CI report:
   
   * ea501ae87f61ff965f558360bb703bfad595c2a0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23196)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7604] Make table name config work properly [hudi]

2024-04-11 Thread via GitHub



hudi-bot commented on PR #10998:
URL: https://github.com/apache/hudi/pull/10998#issuecomment-2050403305

   
   ## CI report:
   
   * ea501ae87f61ff965f558360bb703bfad595c2a0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7565] Create spark file readers to read a single file instead of an entire partition [hudi]

2024-04-11 Thread via GitHub



yihua commented on code in PR #10954:
URL: https://github.com/apache/hudi/pull/10954#discussion_r1561538422


##
hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/SparkHoodieParquetReader.scala:
##
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.parquet
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.datasources.PartitionedFile
+import org.apache.spark.sql.sources.Filter
+import org.apache.spark.sql.types.StructType
+
+trait SparkHoodieParquetReader extends Serializable {
+
+  /**
+   * Read an individual parquet file
+   *
+   * @param fileparquet file to read
+   * @param requiredSchema  desired output schema of the data
+   * @param partitionSchema schema of the partition columns. Partition values 
will be appended to the end of every row
+   * @param filters filters for data skipping. Not guaranteed to be 
used; the spark plan will also apply the filters.
+   * @param sharedConf  the hadoop conf
+   * @return iterator of rows read from the file output type says 
[[InternalRow]] but could be [[ColumnarBatch]]
+   */
+  def read(file: PartitionedFile,

Review Comment:
   I was thinking that `SparkHoodieParquetReader.read` can be unit-tested by 
passing in parameters and validating the output iterator of the `InternalRow`s. 
 For now, the functional test serves similar purpose.



##
hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/SparkHoodieParquetReader.scala:
##
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.parquet
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.datasources.PartitionedFile
+import org.apache.spark.sql.sources.Filter
+import org.apache.spark.sql.types.StructType
+
+trait SparkHoodieParquetReader extends Serializable {
+
+  /**
+   * Read an individual parquet file
+   *
+   * @param fileparquet file to read
+   * @param requiredSchema  desired output schema of the data
+   * @param partitionSchema schema of the partition columns. Partition values 
will be appended to the end of every row
+   * @param filters filters for data skipping. Not guaranteed to be 
used; the spark plan will also apply the filters.
+   * @param sharedConf  the hadoop conf
+   * @return iterator of rows read from the file output type says 
[[InternalRow]] but could be [[ColumnarBatch]]
+   */
+  def read(file: PartitionedFile,
+   requiredSchema: StructType,
+   partitionSchema: StructType,
+   filters: Seq[Filter],
+   sharedConf: Configuration): Iterator[InternalRow]
+

Review Comment:
   nit: remove empty line



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/SparkHoodieParquetReaderBase.scala:
##
@@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache L

1 2 >

1 - 100 of 174 matches

Mail list logo