[GitHub] [hudi] hudi-bot commented on pull request #8978: [HUDI-6315] Optimize DELETE codepath to use meta fields instead of key generation and index lookup

2023-07-03 Thread via GitHub


hudi-bot commented on PR #8978:
URL: https://github.com/apache/hudi/pull/8978#issuecomment-1619538682

   
   ## CI report:
   
   * ded032d61349861050c47d7fe71a8f15db5bcdbe Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18283)
 
   * fe5480c9290e997188e41d668f6d9d2b16d2a99d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18292)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9112: [HUDI-6465] Fix data skipping support BIGINT

2023-07-03 Thread via GitHub


hudi-bot commented on PR #9112:
URL: https://github.com/apache/hudi/pull/9112#issuecomment-1619530186

   
   ## CI report:
   
   * 45bcbc09dacb95a4f7e2c66fba71dd29e13c620d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18261)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18290)
 
   * a797f4fd7a5b5e38f85beede5adf896767c6264a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9110: [MINOR] Test table cleanup

2023-07-03 Thread via GitHub


hudi-bot commented on PR #9110:
URL: https://github.com/apache/hudi/pull/9110#issuecomment-1619530106

   
   ## CI report:
   
   * 8014062978c512d904bc4d51298907f1ecdabd8d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18255)
 
   * 7e0c71cd2b1ba4f24427b87c907bfd28e9596e93 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8978: [HUDI-6315] Optimize DELETE codepath to use meta fields instead of key generation and index lookup

2023-07-03 Thread via GitHub


hudi-bot commented on PR #8978:
URL: https://github.com/apache/hudi/pull/8978#issuecomment-1619529695

   
   ## CI report:
   
   * ded032d61349861050c47d7fe71a8f15db5bcdbe Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18283)
 
   * fe5480c9290e997188e41d668f6d9d2b16d2a99d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch master updated: [HUDI-6462] Add Hudi client init callback interface (#9108)

2023-07-03 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 4bba6af0fa1 [HUDI-6462] Add Hudi client init callback interface (#9108)
4bba6af0fa1 is described below

commit 4bba6af0fa104ad8eef0ecd62e0aedf67bbe33a4
Author: Y Ethan Guo 
AuthorDate: Mon Jul 3 22:18:37 2023 -0700

[HUDI-6462] Add Hudi client init callback interface (#9108)

This PR adds the interface for Hudi client init callback to run custom 
logic at the time of initialization of a Hudi client:

@PublicAPIClass(maturity = ApiMaturityLevel.EVOLVING)
public interface HoodieClientInitCallback {
  /**
   * A callback method in which the user can implement custom logic.
   * This method is called when a {@link BaseHoodieClient} is initialized.
   *
   * @param hoodieClient {@link BaseHoodieClient} instance.
   */
  @PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING)
  void call(BaseHoodieClient hoodieClient);
}
At the time of instantiation of the write or table service client, a user 
may want to do additional processing, such as sending metrics, logsm 
notification, or adding more properties to the write config. The implementation 
of client init callback interface allows such logic to be plugged into Hudi.

A new config, hoodie.client.init.callback.classes, is added for plugging in 
the callback implementation. The class list is comma-separated.
---
 .../hudi/callback/HoodieClientInitCallback.java|  40 
 .../org/apache/hudi/client/BaseHoodieClient.java   |  21 ++
 .../org/apache/hudi/config/HoodieWriteConfig.java  |  31 ++-
 .../callback/TestHoodieClientInitCallback.java | 234 +
 4 files changed, 320 insertions(+), 6 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/callback/HoodieClientInitCallback.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/callback/HoodieClientInitCallback.java
new file mode 100644
index 000..a86eded75e5
--- /dev/null
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/callback/HoodieClientInitCallback.java
@@ -0,0 +1,40 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.callback;
+
+import org.apache.hudi.ApiMaturityLevel;
+import org.apache.hudi.PublicAPIClass;
+import org.apache.hudi.PublicAPIMethod;
+import org.apache.hudi.client.BaseHoodieClient;
+
+/**
+ * A callback interface to run custom logic at the time of initialization of 
the Hudi client.
+ */
+@PublicAPIClass(maturity = ApiMaturityLevel.EVOLVING)
+public interface HoodieClientInitCallback {
+  /**
+   * A callback method in which the user can implement custom logic.
+   * This method is called when a {@link BaseHoodieClient} is initialized.
+   *
+   * @param hoodieClient {@link BaseHoodieClient} instance.
+   */
+  @PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING)
+  void call(BaseHoodieClient hoodieClient);
+}
diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieClient.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieClient.java
index e01ffb20719..26b10c1c1bf 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieClient.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieClient.java
@@ -18,6 +18,7 @@
 
 package org.apache.hudi.client;
 
+import org.apache.hudi.callback.HoodieClientInitCallback;
 import org.apache.hudi.client.embedded.EmbeddedTimelineServerHelper;
 import org.apache.hudi.client.embedded.EmbeddedTimelineService;
 import org.apache.hudi.client.heartbeat.HoodieHeartbeatClient;
@@ -30,8 +31,11 @@ import org.apache.hudi.common.model.HoodieWriteStat;
 import org.apache.hudi.common.table.HoodieTableMetaClient;
 import org.apache.hudi.common.table.timeline.versioning.TimelineLayoutVersion;
 import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ReflectionUtils;
+import 

[GitHub] [hudi] nsivabalan merged pull request #9108: [HUDI-6462] Add Hudi client init callback interface

2023-07-03 Thread via GitHub


nsivabalan merged PR #9108:
URL: https://github.com/apache/hudi/pull/9108


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] amrishlal commented on a diff in pull request #8978: [HUDI-6315] Optimize DELETE codepath to use meta fields instead of key generation and index lookup

2023-07-03 Thread via GitHub


amrishlal commented on code in PR #8978:
URL: https://github.com/apache/hudi/pull/8978#discussion_r1251488891


##
hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java:
##
@@ -227,9 +231,21 @@ public static HoodieWriteResult 
doWriteOperation(SparkRDDWriteClient client, Jav
 }
   }
 
-  public static HoodieWriteResult doDeleteOperation(SparkRDDWriteClient 
client, JavaRDD hoodieKeys,
-  String instantTime) {
-return new HoodieWriteResult(client.delete(hoodieKeys, instantTime));
+  public static HoodieWriteResult doDeleteOperation(SparkRDDWriteClient 
client, JavaRDD>> 
hoodieKeysAndLocations,
+  String instantTime, boolean isPrepped) {
+
+if (isPrepped) {
+  JavaRDD records = hoodieKeysAndLocations.map(tuple -> {
+HoodieRecord record = 
client.getConfig().getRecordMerger().getRecordType() == 
HoodieRecord.HoodieRecordType.AVRO

Review Comment:
   Fixed. Moved client outside of the `map` function.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] Alowator commented on pull request #9112: [HUDI-6465] Fix data skipping support BIGINT

2023-07-03 Thread via GitHub


Alowator commented on PR #9112:
URL: https://github.com/apache/hudi/pull/9112#issuecomment-1619488311

   Rebased on latest master after #9114 with pipelines fixes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #8978: [HUDI-6315] Optimize DELETE codepath to use meta fields instead of key generation and index lookup

2023-07-03 Thread via GitHub


nsivabalan commented on code in PR #8978:
URL: https://github.com/apache/hudi/pull/8978#discussion_r1251481229


##
hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java:
##
@@ -227,9 +231,21 @@ public static HoodieWriteResult 
doWriteOperation(SparkRDDWriteClient client, Jav
 }
   }
 
-  public static HoodieWriteResult doDeleteOperation(SparkRDDWriteClient 
client, JavaRDD hoodieKeys,
-  String instantTime) {
-return new HoodieWriteResult(client.delete(hoodieKeys, instantTime));
+  public static HoodieWriteResult doDeleteOperation(SparkRDDWriteClient 
client, JavaRDD>> 
hoodieKeysAndLocations,
+  String instantTime, boolean isPrepped) {
+
+if (isPrepped) {
+  JavaRDD records = hoodieKeysAndLocations.map(tuple -> {
+HoodieRecord record = 
client.getConfig().getRecordMerger().getRecordType() == 
HoodieRecord.HoodieRecordType.AVRO

Review Comment:
   can you declare the record type as a variable outside and then just use that 
variable. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #8978: [HUDI-6315] Optimize DELETE codepath to use meta fields instead of key generation and index lookup

2023-07-03 Thread via GitHub


nsivabalan commented on code in PR #8978:
URL: https://github.com/apache/hudi/pull/8978#discussion_r1251480969


##
hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java:
##
@@ -227,9 +231,21 @@ public static HoodieWriteResult 
doWriteOperation(SparkRDDWriteClient client, Jav
 }
   }
 
-  public static HoodieWriteResult doDeleteOperation(SparkRDDWriteClient 
client, JavaRDD hoodieKeys,
-  String instantTime) {
-return new HoodieWriteResult(client.delete(hoodieKeys, instantTime));
+  public static HoodieWriteResult doDeleteOperation(SparkRDDWriteClient 
client, JavaRDD>> 
hoodieKeysAndLocations,
+  String instantTime, boolean isPrepped) {
+
+if (isPrepped) {
+  JavaRDD records = hoodieKeysAndLocations.map(tuple -> {
+HoodieRecord record = 
client.getConfig().getRecordMerger().getRecordType() == 
HoodieRecord.HoodieRecordType.AVRO

Review Comment:
   "client" object is in driver. since here we are accessing it in the 
executor, we might get NotSerializableException.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9113: [HUDI-6466] Fix spark insert overwrite partitioned table with dynamic partition

2023-07-03 Thread via GitHub


hudi-bot commented on PR #9113:
URL: https://github.com/apache/hudi/pull/9113#issuecomment-1619481354

   
   ## CI report:
   
   * c6127a02ea8c3f4e8819559dbb4efa9f64bc040f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18262)
 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18288)
 
   * ea59dc594ede90284a8cfd1fba331b76b2a72e7d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18289)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9112: [HUDI-6465] Fix data skipping support BIGINT

2023-07-03 Thread via GitHub


hudi-bot commented on PR #9112:
URL: https://github.com/apache/hudi/pull/9112#issuecomment-1619481308

   
   ## CI report:
   
   * 45bcbc09dacb95a4f7e2c66fba71dd29e13c620d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18261)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18290)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] Alowator commented on pull request #9112: [HUDI-6465] Fix data skipping support BIGINT

2023-07-03 Thread via GitHub


Alowator commented on PR #9112:
URL: https://github.com/apache/hudi/pull/9112#issuecomment-1619477901

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9113: [HUDI-6466] Fix spark insert overwrite partitioned table with dynamic partition

2023-07-03 Thread via GitHub


hudi-bot commented on PR #9113:
URL: https://github.com/apache/hudi/pull/9113#issuecomment-1619472786

   
   ## CI report:
   
   * c6127a02ea8c3f4e8819559dbb4efa9f64bc040f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18262)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18288)
 
   * ea59dc594ede90284a8cfd1fba331b76b2a72e7d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] ad1happy2go commented on issue #8964: test automatic compression failed, no parquet file generated

2023-07-03 Thread via GitHub


ad1happy2go commented on issue #8964:
URL: https://github.com/apache/hudi/issues/8964#issuecomment-1619466884

   @ZhangxuezhenUCAS While packaging the jar manually, you can shade the 
parquet jars also like we do other jars. Let us know what information or help 
you need to get it working.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9113: [HUDI-6466] Fix spark insert overwrite partitioned table with dynamic partition

2023-07-03 Thread via GitHub


hudi-bot commented on PR #9113:
URL: https://github.com/apache/hudi/pull/9113#issuecomment-1619466691

   
   ## CI report:
   
   * c6127a02ea8c3f4e8819559dbb4efa9f64bc040f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18262)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18288)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9087: [HUDI-6329] Aadjust the partitioner automatically for flink consistent hashing index

2023-07-03 Thread via GitHub


hudi-bot commented on PR #9087:
URL: https://github.com/apache/hudi/pull/9087#issuecomment-1619466599

   
   ## CI report:
   
   * 47e33acb50156f19b8c35eaca775f5f2ba8c5ead UNKNOWN
   * cc9a8531d606735d9b31b9f9f6a56f606d45b6f1 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18258)
 
   * ab798a565d25843c811df37ac2460dd9111700cd Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18287)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8978: [HUDI-6315] Optimize DELETE codepath to use meta fields instead of key generation and index lookup

2023-07-03 Thread via GitHub


hudi-bot commented on PR #8978:
URL: https://github.com/apache/hudi/pull/8978#issuecomment-1619466400

   
   ## CI report:
   
   * ded032d61349861050c47d7fe71a8f15db5bcdbe Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18283)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #9117: [HUDI-6437] Fixing/optimizing record updates to RLI

2023-07-03 Thread via GitHub


nsivabalan commented on code in PR #9117:
URL: https://github.com/apache/hudi/pull/9117#discussion_r1251468123


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -1161,25 +1165,36 @@ protected boolean 
validateTimelineBeforeSchedulingCompaction(Option inFl
* @param writeStatuses {@code WriteStatus} from the write operation
*/
   private HoodieData 
getRecordIndexUpdates(HoodieData writeStatuses) {
-// 1. List
-// 2. Reduce by key: accept keys only when new location is not
-return writeStatuses.map(writeStatus -> 
writeStatus.getWrittenRecordDelegates().stream()
-.map(recordDelegate -> Pair.of(recordDelegate.getRecordKey(), 
recordDelegate)))
-.flatMapToPair(Stream::iterator)
-.reduceByKey((recordDelegate1, recordDelegate2) -> {
-  if 
(recordDelegate1.getRecordKey().equals(recordDelegate2.getRecordKey())) {
-if (recordDelegate1.getNewLocation().isPresent() && 
recordDelegate1.getNewLocation().get().getFileId() != null) {
-  return recordDelegate1;
-} else if (recordDelegate2.getNewLocation().isPresent() && 
recordDelegate2.getNewLocation().get().getFileId() != null) {
-  return recordDelegate2;
+HoodiePairData recordKeyDelegatePairs = null;
+// if update partition path is true, chances that we might get two records 
(1 delete in older partition and 1 insert to new partition)
+// and hence we might have to do reduce By key before ingesting to RLI 
partition.
+if (dataWriteConfig.getRecordIndexUpdatePartitionPath()) {
+  recordKeyDelegatePairs = writeStatuses.map(writeStatus -> 
writeStatus.getWrittenRecordDelegates().stream()
+  .map(recordDelegate -> Pair.of(recordDelegate.getRecordKey(), 
recordDelegate)))
+  .flatMapToPair(Stream::iterator)
+  .reduceByKey((recordDelegate1, recordDelegate2) -> {
+if 
(recordDelegate1.getRecordKey().equals(recordDelegate2.getRecordKey())) {
+  if (recordDelegate1.getNewLocation().isPresent() && 
recordDelegate2.getNewLocation().isPresent()) {
+throw new HoodieIOException("Both version of records does not 
have location set. Record V1 " + recordDelegate1.toString()
++ ", Record V2 " + recordDelegate2.toString());
+  }
+  if (recordDelegate1.getNewLocation().isPresent()) {
+return recordDelegate1;
+  } else {
+// if record delegate 1 does not have location set, record 
delegate 2 should have location set.
+return recordDelegate2;
+  }
 } else {
-  // should not come here, one of the above must have a new 
location set
-  return null;
+  return recordDelegate1;
 }
-  } else {
-return recordDelegate1;
-  }
-}, 1)
+  }, Math.max(1, writeStatuses.getNumPartitions()));
+} else {

Review Comment:
   if writeStatus is empty, thats why



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] ad1happy2go commented on issue #9022: [SUPPORT] Kind of corrupted MDT

2023-07-03 Thread via GitHub


ad1happy2go commented on issue #9022:
URL: https://github.com/apache/hudi/issues/9022#issuecomment-1619464788

   @parisni I confirmed that we are not facing this error while reading 
metadata table with the version you specified. Closing out the issue. Please 
reopen in case you face this issue again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] flashJd commented on pull request #9113: [HUDI-6466] Fix spark insert overwrite partitioned table with dynamic partition

2023-07-03 Thread via GitHub


flashJd commented on PR #9113:
URL: https://github.com/apache/hudi/pull/9113#issuecomment-1619439352

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9087: [HUDI-6329] Aadjust the partitioner automatically for flink consistent hashing index

2023-07-03 Thread via GitHub


hudi-bot commented on PR #9087:
URL: https://github.com/apache/hudi/pull/9087#issuecomment-1619430671

   
   ## CI report:
   
   * 47e33acb50156f19b8c35eaca775f5f2ba8c5ead UNKNOWN
   * cc9a8531d606735d9b31b9f9f6a56f606d45b6f1 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18258)
 
   * ab798a565d25843c811df37ac2460dd9111700cd UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] flashJd commented on pull request #9113: [HUDI-6466] Fix spark insert overwrite partitioned table with dynamic partition

2023-07-03 Thread via GitHub


flashJd commented on PR #9113:
URL: https://github.com/apache/hudi/pull/9113#issuecomment-1619426298

   
https://github.com/apache/hudi/blob/5d196fe61757987af29b38e1b5cf38d7ca001924/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala#L879
   Modify of this piece of code cause many test failing dure to the delete old 
base path logic, so just revert it and handle it in another pr


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9108: [HUDI-6462] Add Hudi client init callback interface

2023-07-03 Thread via GitHub


hudi-bot commented on PR #9108:
URL: https://github.com/apache/hudi/pull/9108#issuecomment-1619425450

   
   ## CI report:
   
   * 9b67bc501f473e9e4caf8dccba04dce8a1601f5b Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18282)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] beyond1920 commented on a diff in pull request #9087: [HUDI-6329] Aadjust the partitioner automatically for flink consistent hashing index

2023-07-03 Thread via GitHub


beyond1920 commented on code in PR #9087:
URL: https://github.com/apache/hudi/pull/9087#discussion_r1251440287


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/clustering/update/strategy/FlinkConsistentBucketUpdateStrategy.java:
##
@@ -0,0 +1,147 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.sink.clustering.update.strategy;
+
+import org.apache.hudi.client.HoodieFlinkWriteClient;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieAvroRecord;
+import org.apache.hudi.common.model.HoodieFileGroupId;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordLocation;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.ValidationUtils;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.index.bucket.ConsistentBucketIdentifier;
+import org.apache.hudi.table.HoodieFlinkTable;
+import org.apache.hudi.table.action.cluster.strategy.UpdateStrategy;
+import 
org.apache.hudi.table.action.cluster.util.ConsistentHashingUpdateStrategyUtils;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.LinkedHashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+
+/**
+ * Update strategy for consistent hashing bucket index. If updates to file 
groups that are under clustering are identified, then the current batch of 
records will route to both old and new file groups
+ * (i.e., dual write)
+ */
+public class FlinkConsistentBucketUpdateStrategy extends UpdateStrategy, 
String>>> {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(FlinkConsistentBucketUpdateStrategy.class);
+
+  private boolean initialized = false;
+  private List indexKeyFields;
+  private Map> 
partitionToIdentifier;
+  private String lastRefreshInstant = HoodieTimeline.INIT_INSTANT_TS;
+

Review Comment:
   Sorry to miss that, it is used to check whether there is new pending 
clustering request.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8891: [Hudi-6318]The skip merge config for incremental-read ensures consistency in both stream and batch scenarios

2023-07-03 Thread via GitHub


hudi-bot commented on PR #8891:
URL: https://github.com/apache/hudi/pull/8891#issuecomment-1619387699

   
   ## CI report:
   
   * 36b11d3609fdf04192f3910ea024022e22052169 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17628)
 
   * 2731f65c239577f11386d38d71ee3ff73fa23648 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18286)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #9113: [HUDI-6466] Fix spark insert overwrite partitioned table with dynamic partition

2023-07-03 Thread via GitHub


danny0405 commented on PR #9113:
URL: https://github.com/apache/hudi/pull/9113#issuecomment-1619381392

   Seems a huge behavior change, may not have time for the fix for release 
0.14.0, cc @boneanxs can you help for the review here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8891: [Hudi-6318]The skip merge config for incremental-read ensures consistency in both stream and batch scenarios

2023-07-03 Thread via GitHub


hudi-bot commented on PR #8891:
URL: https://github.com/apache/hudi/pull/8891#issuecomment-1619380817

   
   ## CI report:
   
   * 36b11d3609fdf04192f3910ea024022e22052169 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17628)
 
   * 2731f65c239577f11386d38d71ee3ff73fa23648 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9116: [HUDI-6470] Add spark sql conf in AlterTableCommand

2023-07-03 Thread via GitHub


hudi-bot commented on PR #9116:
URL: https://github.com/apache/hudi/pull/9116#issuecomment-1619375851

   
   ## CI report:
   
   * 80b8747dfec3e4cac7719796a46e52e9846081f5 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18271)
 
   * c15d29a43ce1fa912ae3bb44b0a46abff33149d5 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18285)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #9115: [HUDI-6469] Revert HUDI-6311

2023-07-03 Thread via GitHub


danny0405 commented on PR #9115:
URL: https://github.com/apache/hudi/pull/9115#issuecomment-1619365436

   Hi @jonvex Can you elaborate a little more why to revert the changes?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] zhuanshenbsj1 commented on a diff in pull request #8891: [Hudi-6318]The skip merge config for incremental-read ensures consistency in both stream and batch scenarios

2023-07-03 Thread via GitHub


zhuanshenbsj1 commented on code in PR #8891:
URL: https://github.com/apache/hudi/pull/8891#discussion_r1251402811


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSource.java:
##
@@ -396,6 +396,8 @@ private List buildInputSplits() {
 .rowType(this.tableRowType)
 .maxCompactionMemoryInBytes(maxCompactionMemoryInBytes)
 .partitionPruner(partitionPruner)
+
.skipCompaction(conf.getBoolean(FlinkOptions.READ_STREAMING_SKIP_COMPACT))

Review Comment:
   > The `read.streaming.skip_compaction` and `read.streaming.skip_clustering` 
is the option of streaming read, not batch read.
   
   We can add a similar batch configuration.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9117: [HUDI-6437] Fixing/optimizing record updates to RLI

2023-07-03 Thread via GitHub


danny0405 commented on code in PR #9117:
URL: https://github.com/apache/hudi/pull/9117#discussion_r1251402017


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java:
##
@@ -257,6 +257,10 @@ private Option prepareRecord(HoodieRecord 
hoodieRecord) {
 recordsWritten++;
   } else {
 finalRecordOpt = Option.empty();
+// Clear the new location as the record was deleted
+hoodieRecord.unseal();
+hoodieRecord.clearNewLocation();
+hoodieRecord.seal();

Review Comment:
   Yeah, we should also update locations for mor logs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9117: [HUDI-6437] Fixing/optimizing record updates to RLI

2023-07-03 Thread via GitHub


danny0405 commented on code in PR #9117:
URL: https://github.com/apache/hudi/pull/9117#discussion_r1251401570


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -1161,25 +1165,36 @@ protected boolean 
validateTimelineBeforeSchedulingCompaction(Option inFl
* @param writeStatuses {@code WriteStatus} from the write operation
*/
   private HoodieData 
getRecordIndexUpdates(HoodieData writeStatuses) {
-// 1. List
-// 2. Reduce by key: accept keys only when new location is not
-return writeStatuses.map(writeStatus -> 
writeStatus.getWrittenRecordDelegates().stream()
-.map(recordDelegate -> Pair.of(recordDelegate.getRecordKey(), 
recordDelegate)))
-.flatMapToPair(Stream::iterator)
-.reduceByKey((recordDelegate1, recordDelegate2) -> {
-  if 
(recordDelegate1.getRecordKey().equals(recordDelegate2.getRecordKey())) {
-if (recordDelegate1.getNewLocation().isPresent() && 
recordDelegate1.getNewLocation().get().getFileId() != null) {
-  return recordDelegate1;
-} else if (recordDelegate2.getNewLocation().isPresent() && 
recordDelegate2.getNewLocation().get().getFileId() != null) {
-  return recordDelegate2;
+HoodiePairData recordKeyDelegatePairs = null;
+// if update partition path is true, chances that we might get two records 
(1 delete in older partition and 1 insert to new partition)
+// and hence we might have to do reduce By key before ingesting to RLI 
partition.
+if (dataWriteConfig.getRecordIndexUpdatePartitionPath()) {
+  recordKeyDelegatePairs = writeStatuses.map(writeStatus -> 
writeStatus.getWrittenRecordDelegates().stream()
+  .map(recordDelegate -> Pair.of(recordDelegate.getRecordKey(), 
recordDelegate)))
+  .flatMapToPair(Stream::iterator)
+  .reduceByKey((recordDelegate1, recordDelegate2) -> {
+if 
(recordDelegate1.getRecordKey().equals(recordDelegate2.getRecordKey())) {
+  if (recordDelegate1.getNewLocation().isPresent() && 
recordDelegate2.getNewLocation().isPresent()) {
+throw new HoodieIOException("Both version of records does not 
have location set. Record V1 " + recordDelegate1.toString()
++ ", Record V2 " + recordDelegate2.toString());
+  }
+  if (recordDelegate1.getNewLocation().isPresent()) {
+return recordDelegate1;
+  } else {
+// if record delegate 1 does not have location set, record 
delegate 2 should have location set.
+return recordDelegate2;
+  }
 } else {
-  // should not come here, one of the above must have a new 
location set
-  return null;
+  return recordDelegate1;
 }
-  } else {
-return recordDelegate1;
-  }
-}, 1)
+  }, Math.max(1, writeStatuses.getNumPartitions()));
+} else {

Review Comment:
   Math.max(1, xxx) does not make sense if the xxx is always >= 1, maybe just 
`writeStatuses.getNumPartitions()`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #9116: [HUDI-6470] Add spark sql conf in AlterTableCommand

2023-07-03 Thread via GitHub


danny0405 commented on PR #9116:
URL: https://github.com/apache/hudi/pull/9116#issuecomment-1619355101

   It's greate if we can add a simple test case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #8610: [HUDI-6156] Prevent leaving tmp file in timeline when multi process t…

2023-07-03 Thread via GitHub


danny0405 commented on PR #8610:
URL: https://github.com/apache/hudi/pull/8610#issuecomment-1619353878

   @hbgstc123 Can you update the PR and resolve the test failures.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on issue #9101: [SUPPORT] Transaction and spark job final state inconsistency in batch processing

2023-07-03 Thread via GitHub


danny0405 commented on issue #9101:
URL: https://github.com/apache/hudi/issues/9101#issuecomment-1619350001

   inline archiving and cleaning may have this issue, do you try the async 
cleaning instead? Is there any spark param to control the failover behavior, 
seems not very easy to fix from Hudi side.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9116: [HUDI-6470] Add spark sql conf in AlterTableCommand

2023-07-03 Thread via GitHub


hudi-bot commented on PR #9116:
URL: https://github.com/apache/hudi/pull/9116#issuecomment-1619345051

   
   ## CI report:
   
   * 80b8747dfec3e4cac7719796a46e52e9846081f5 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18271)
 
   * c15d29a43ce1fa912ae3bb44b0a46abff33149d5 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9114: [HUDI-6467] Fix deletes handling in rli when partition path is updated

2023-07-03 Thread via GitHub


danny0405 commented on code in PR #9114:
URL: https://github.com/apache/hudi/pull/9114#discussion_r1251389097


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -1159,10 +1161,27 @@ protected boolean 
validateTimelineBeforeSchedulingCompaction(Option inFl
* @param writeStatuses {@code WriteStatus} from the write operation
*/
   private HoodieData 
getRecordIndexUpdates(HoodieData writeStatuses) {
-return writeStatuses.flatMap(writeStatus -> {
-  List recordList = new LinkedList<>();
-  for (HoodieRecordDelegate recordDelegate : 
writeStatus.getWrittenRecordDelegates()) {
-if (!writeStatus.isErrored(recordDelegate.getHoodieKey())) {
+// 1. List
+// 2. Reduce by key: accept keys only when new location is not
+return writeStatuses.map(writeStatus -> 
writeStatus.getWrittenRecordDelegates().stream()
+.map(recordDelegate -> Pair.of(recordDelegate.getRecordKey(), 
recordDelegate)))
+.flatMapToPair(Stream::iterator)
+.reduceByKey((recordDelegate1, recordDelegate2) -> {

Review Comment:
   Yeah, wondering whether we have better algorithm to tackle the deduplication 
~



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9117: [HUDI-6437] Fixing/optimizing record updates to RLI

2023-07-03 Thread via GitHub


hudi-bot commented on PR #9117:
URL: https://github.com/apache/hudi/pull/9117#issuecomment-1619333860

   
   ## CI report:
   
   * 06cf7c29c35c4bf718f0d015bdf2cc3382deb068 UNKNOWN
   * 70250eb58fec89adce0e7a0ea1f0ccb03173e79a Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18280)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9083: [HUDI-6464] Spark SQL Merge Into for pkless tables

2023-07-03 Thread via GitHub


hudi-bot commented on PR #9083:
URL: https://github.com/apache/hudi/pull/9083#issuecomment-1619333701

   
   ## CI report:
   
   * 3a0bfb88049cf2c0f8afe5c925dbd76fa6f7cd89 UNKNOWN
   * f156c1694aca3a9e2ca4ed26959c6a5a1b773354 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18278)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] zhuanshenbsj1 commented on pull request #9038: [HUDI-6423] Incremental cleaning should consider inflight compaction instant

2023-07-03 Thread via GitHub


zhuanshenbsj1 commented on PR #9038:
URL: https://github.com/apache/hudi/pull/9038#issuecomment-1619333115

   > There are many test failures:
   > 
   > 
https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=18249=logs=600e7de6-e133-5e69-e615-50ee129b3c08=bbbd7bcc-ae73-56b8-887a-cd2d6deaafc7=16819
   > 
   > 
https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=18249=logs=b1544eb9-7ff1-5db9-0187-3e05abf459bc=e0ae894b-41c9-5f4b-7ed2-bdf5243b02e7=569172
   
   TestCleaner.testKeepLatestCommitsWithPendingCompactions has been fixed, 
seems that TestGlobalIndexEnableUpdatePartitions is not my problem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9114: [HUDI-6467] Fix deletes handling in rli when partition path is updated

2023-07-03 Thread via GitHub


danny0405 commented on code in PR #9114:
URL: https://github.com/apache/hudi/pull/9114#discussion_r1251380823


##
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/functional/TestGlobalIndexEnableUpdatePartitions.java:
##
@@ -65,8 +65,8 @@ private static Stream getTableTypeAndIndexType() {
 Arguments.of(COPY_ON_WRITE, GLOBAL_BLOOM),
 Arguments.of(COPY_ON_WRITE, RECORD_INDEX),
 Arguments.of(MERGE_ON_READ, GLOBAL_SIMPLE),
-Arguments.of(MERGE_ON_READ, GLOBAL_BLOOM),
-Arguments.of(MERGE_ON_READ, RECORD_INDEX)
+Arguments.of(MERGE_ON_READ, GLOBAL_BLOOM)
+// Arguments.of(MERGE_ON_READ, RECORD_INDEX)

Review Comment:
   It passes previously.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-3639) [Incremental] Add Proper Incremental Records FIltering support into Hudi's custom RDD

2023-07-03 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-3639.

Resolution: Fixed

Fixed via master branch: 5d196fe61757987af29b38e1b5cf38d7ca001924

> [Incremental] Add Proper Incremental Records FIltering support into Hudi's 
> custom RDD
> -
>
> Key: HUDI-3639
> URL: https://issues.apache.org/jira/browse/HUDI-3639
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Currently, Hudi's `MergeOnReadIncrementalRelation` solely relies on 
> `ParquetFileReader` to do record-level filtering of the records that don't 
> belong to a timeline span being queried.
> As a side-effect, Hudi actually have to disable the use of 
> [VectorizedParquetReader|https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-vectorized-parquet-reader.html]
>  (since using one would prevent records from being filtered by the Reader)
>  
> Instead, we should make sure that proper record-level filtering is performed 
> w/in the returned RDD, instead of squarely relying on FileReader to do that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[hudi] branch master updated: [HUDI-3639] Add Proper Incremental Records FIltering support into Hudi's custom RDD (#8668)

2023-07-03 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 5d196fe6175 [HUDI-3639] Add Proper Incremental Records FIltering 
support into Hudi's custom RDD (#8668)
5d196fe6175 is described below

commit 5d196fe61757987af29b38e1b5cf38d7ca001924
Author: cxzl25 
AuthorDate: Tue Jul 4 09:25:38 2023 +0800

[HUDI-3639] Add Proper Incremental Records FIltering support into Hudi's 
custom RDD (#8668)

* filter operator for incremental RDD
* remove the hard code conf 'spark.sql.parquet.enableVectorizedReader' in 
relations

-

Co-authored-by: Danny Chan 
---
 .../scala/org/apache/hudi/HoodieBaseRelation.scala |  2 -
 .../org/apache/hudi/HoodieMergeOnReadRDD.scala | 39 +-
 .../hudi/MergeOnReadIncrementalRelation.scala  | 28 +++--
 .../apache/hudi/MergeOnReadSnapshotRelation.scala  |  5 ---
 .../functional/TestParquetColumnProjection.scala   | 48 +++---
 5 files changed, 95 insertions(+), 27 deletions(-)

diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala
index a9ddbfa4503..a67d4463bf5 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala
@@ -467,8 +467,6 @@ abstract class HoodieBaseRelation(val sqlContext: 
SQLContext,
   def imbueConfigs(sqlContext: SQLContext): Unit = {
 
sqlContext.sparkSession.sessionState.conf.setConfString("spark.sql.parquet.filterPushdown",
 "true")
 
sqlContext.sparkSession.sessionState.conf.setConfString("spark.sql.parquet.recordLevelFilter.enabled",
 "true")
-// TODO(HUDI-3639) vectorized reader has to be disabled to make sure 
MORIncrementalRelation is working properly
-
sqlContext.sparkSession.sessionState.conf.setConfString("spark.sql.parquet.enableVectorizedReader",
 "false")
   }
 
   /**
diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieMergeOnReadRDD.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieMergeOnReadRDD.scala
index d7b60db4929..db538f110c9 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieMergeOnReadRDD.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieMergeOnReadRDD.scala
@@ -23,6 +23,8 @@ import org.apache.hadoop.mapred.JobConf
 import org.apache.hudi.HoodieBaseRelation.{BaseFileReader, projectReader}
 import org.apache.hudi.HoodieMergeOnReadRDD.CONFIG_INSTANTIATION_LOCK
 import org.apache.hudi.MergeOnReadSnapshotRelation.isProjectionCompatible
+import org.apache.hudi.common.model.HoodieRecord
+import org.apache.hudi.common.util.StringUtils
 import org.apache.hudi.exception.HoodieException
 import 
org.apache.hudi.hadoop.utils.HoodieRealtimeRecordReaderUtils.getMaxCompactionMemoryInBytes
 import org.apache.spark.rdd.RDD
@@ -30,6 +32,7 @@ import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.{Partition, SerializableWritable, SparkContext, 
TaskContext}
 
 import java.io.Closeable
+import java.util.function.Predicate
 
 case class HoodieMergeOnReadPartition(index: Int, split: 
HoodieMergeOnReadFileSplit) extends Partition
 
@@ -64,6 +67,9 @@ private[hudi] case class 
HoodieMergeOnReadBaseFileReaders(fullSchemaReader: Base
  * @param tableState table's state
  * @param mergeType type of merge performed
  * @param fileSplits target file-splits this RDD will be iterating over
+ * @param includeStartTime whether to include the commit with the commitTime
+ * @param startTimestamp start timestamp to filter records
+ * @param endTimestamp end timestamp to filter records
  */
 class HoodieMergeOnReadRDD(@transient sc: SparkContext,
@transient config: Configuration,
@@ -72,7 +78,10 @@ class HoodieMergeOnReadRDD(@transient sc: SparkContext,
requiredSchema: HoodieTableSchema,
tableState: HoodieTableState,
mergeType: String,
-   @transient fileSplits: 
Seq[HoodieMergeOnReadFileSplit])
+   @transient fileSplits: 
Seq[HoodieMergeOnReadFileSplit],
+   includeStartTime: Boolean = false,
+   startTimestamp: String = null,
+   endTimestamp: String = null)
   extends RDD[InternalRow](sc, Nil) with HoodieUnsafeRDD {
 
   protected val maxCompactionMemoryInBytes: Long = 
getMaxCompactionMemoryInBytes(new JobConf(config))
@@ -116,7 +125,33 @@ class HoodieMergeOnReadRDD(@transient sc: 

[GitHub] [hudi] danny0405 merged pull request #8668: [HUDI-3639] Add Proper Incremental Records FIltering support into Hudi's custom RDD

2023-07-03 Thread via GitHub


danny0405 merged PR #8668:
URL: https://github.com/apache/hudi/pull/8668


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8978: [HUDI-6315] Optimize DELETE codepath to use meta fields instead of key generation and index lookup

2023-07-03 Thread via GitHub


hudi-bot commented on PR #8978:
URL: https://github.com/apache/hudi/pull/8978#issuecomment-1619305334

   
   ## CI report:
   
   * 79fac007f537294c5f8a2e40617296e3c44537dd Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18281)
 
   * ded032d61349861050c47d7fe71a8f15db5bcdbe Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18283)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9108: [HUDI-6462] Add Hudi client init callback interface

2023-07-03 Thread via GitHub


hudi-bot commented on PR #9108:
URL: https://github.com/apache/hudi/pull/9108#issuecomment-1619299464

   
   ## CI report:
   
   * 7f7dae8012b51f42fe9452e1ecc555b524a7dc6b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18276)
 
   * 9b67bc501f473e9e4caf8dccba04dce8a1601f5b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18282)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8978: [HUDI-6315] Optimize DELETE codepath to use meta fields instead of key generation and index lookup

2023-07-03 Thread via GitHub


hudi-bot commented on PR #8978:
URL: https://github.com/apache/hudi/pull/8978#issuecomment-1619299254

   
   ## CI report:
   
   * 7142fb46cef77575be4346fa0a9cf2fb7bee03b1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18272)
 
   * 79fac007f537294c5f8a2e40617296e3c44537dd Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18281)
 
   * ded032d61349861050c47d7fe71a8f15db5bcdbe UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9108: [HUDI-6462] Add Hudi client init callback interface

2023-07-03 Thread via GitHub


hudi-bot commented on PR #9108:
URL: https://github.com/apache/hudi/pull/9108#issuecomment-1619294803

   
   ## CI report:
   
   * 7f7dae8012b51f42fe9452e1ecc555b524a7dc6b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18276)
 
   * 9b67bc501f473e9e4caf8dccba04dce8a1601f5b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] jonvex commented on a diff in pull request #9083: [HUDI-6464] Spark SQL Merge Into for pkless tables

2023-07-03 Thread via GitHub


jonvex commented on code in PR #9083:
URL: https://github.com/apache/hudi/pull/9083#discussion_r1251342434


##
hudi-spark-datasource/hudi-spark3.2plus-common/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieSpark32PlusAnalysis.scala:
##
@@ -128,9 +131,135 @@ case class HoodieSpark32PlusResolveReferences(spark: 
SparkSession) extends Rule[
   catalogTable.location.toString))
 LogicalRelation(relation, catalogTable)
   }
+case mO@MatchMergeIntoTable(targetTableO, sourceTableO, _)
+   Hudi change: don't want to go to the spark mit resolution so we 
resolve the source and target if they haven't been
+  //
+  if !mO.resolved =>
+  lazy val analyzer = spark.sessionState.analyzer
+  val targetTable = if (targetTableO.resolved) targetTableO else 
analyzer.execute(targetTableO)
+  val sourceTable = if (sourceTableO.resolved) sourceTableO else 
analyzer.execute(sourceTableO)
+  val m = mO.asInstanceOf[MergeIntoTable].copy(targetTable = targetTable, 
sourceTable = sourceTable)
+  //
+  
+  EliminateSubqueryAliases(targetTable) match {
+case r: NamedRelation if r.skipSchemaResolution =>
+  // Do not resolve the expression if the target table accepts any 
schema.
+  // This allows data sources to customize their own resolution logic 
using
+  // custom resolution rules.
+  m
+
+case _ =>
+  val newMatchedActions = m.matchedActions.map {
+case DeleteAction(deleteCondition) =>
+  val resolvedDeleteCondition = deleteCondition.map(
+resolveExpressionByPlanChildren(_, m))
+  DeleteAction(resolvedDeleteCondition)
+case UpdateAction(updateCondition, assignments) =>
+  val resolvedUpdateCondition = updateCondition.map(
+resolveExpressionByPlanChildren(_, m))
+  UpdateAction(
+resolvedUpdateCondition,
+// The update value can access columns from both target and 
source tables.
+resolveAssignments(assignments, m, resolveValuesWithSourceOnly 
= false))
+case UpdateStarAction(updateCondition) =>
+  Hudi change: filter out meta fields
+  //
+  val assignments = targetTable.output.filter(a => 
!isMetaField(a.name)).map { attr =>
+Assignment(attr, UnresolvedAttribute(Seq(attr.name)))
+  }
+  //
+  
+  UpdateAction(
+updateCondition.map(resolveExpressionByPlanChildren(_, m)),
+// For UPDATE *, the value must from source table.
+resolveAssignments(assignments, m, resolveValuesWithSourceOnly 
= true))
+case o => o
+  }
+  val newNotMatchedActions = m.notMatchedActions.map {
+case InsertAction(insertCondition, assignments) =>
+  // The insert action is used when not matched, so its condition 
and value can only
+  // access columns from the source table.
+  val resolvedInsertCondition = insertCondition.map(
+resolveExpressionByPlanChildren(_, Project(Nil, 
m.sourceTable)))
+  InsertAction(
+resolvedInsertCondition,
+resolveAssignments(assignments, m, resolveValuesWithSourceOnly 
= true))
+case InsertStarAction(insertCondition) =>
+  // The insert action is used when not matched, so its condition 
and value can only
+  // access columns from the source table.
+  val resolvedInsertCondition = insertCondition.map(
+resolveExpressionByPlanChildren(_, Project(Nil, 
m.sourceTable)))
+  Hudi change: filter out meta fields
+  //
+  val assignments = targetTable.output.filter(a => 
!isMetaField(a.name)).map { attr =>
+Assignment(attr, UnresolvedAttribute(Seq(attr.name)))
+  }
+  //
+  
+  InsertAction(
+resolvedInsertCondition,
+resolveAssignments(assignments, m, resolveValuesWithSourceOnly 
= true))
+case o => o

Review Comment:
   Spark 3.2 and 3.3 are the same. In a followup we may want to use the 3.4 code



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] jonvex commented on a diff in pull request #9083: [HUDI-6464] Spark SQL Merge Into for pkless tables

2023-07-03 Thread via GitHub


jonvex commented on code in PR #9083:
URL: https://github.com/apache/hudi/pull/9083#discussion_r1251342069


##
hudi-spark-datasource/hudi-spark3.2plus-common/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieSpark32PlusAnalysis.scala:
##
@@ -128,9 +131,135 @@ case class HoodieSpark32PlusResolveReferences(spark: 
SparkSession) extends Rule[
   catalogTable.location.toString))
 LogicalRelation(relation, catalogTable)
   }
+case mO@MatchMergeIntoTable(targetTableO, sourceTableO, _)
+   Hudi change: don't want to go to the spark mit resolution so we 
resolve the source and target if they haven't been
+  //
+  if !mO.resolved =>
+  lazy val analyzer = spark.sessionState.analyzer
+  val targetTable = if (targetTableO.resolved) targetTableO else 
analyzer.execute(targetTableO)
+  val sourceTable = if (sourceTableO.resolved) sourceTableO else 
analyzer.execute(sourceTableO)
+  val m = mO.asInstanceOf[MergeIntoTable].copy(targetTable = targetTable, 
sourceTable = sourceTable)
+  //
+  
+  EliminateSubqueryAliases(targetTable) match {
+case r: NamedRelation if r.skipSchemaResolution =>
+  // Do not resolve the expression if the target table accepts any 
schema.
+  // This allows data sources to customize their own resolution logic 
using
+  // custom resolution rules.
+  m
+
+case _ =>
+  val newMatchedActions = m.matchedActions.map {
+case DeleteAction(deleteCondition) =>
+  val resolvedDeleteCondition = deleteCondition.map(
+resolveExpressionByPlanChildren(_, m))
+  DeleteAction(resolvedDeleteCondition)
+case UpdateAction(updateCondition, assignments) =>
+  val resolvedUpdateCondition = updateCondition.map(
+resolveExpressionByPlanChildren(_, m))
+  UpdateAction(
+resolvedUpdateCondition,
+// The update value can access columns from both target and 
source tables.
+resolveAssignments(assignments, m, resolveValuesWithSourceOnly 
= false))
+case UpdateStarAction(updateCondition) =>
+  Hudi change: filter out meta fields
+  //
+  val assignments = targetTable.output.filter(a => 
!isMetaField(a.name)).map { attr =>
+Assignment(attr, UnresolvedAttribute(Seq(attr.name)))
+  }
+  //
+  
+  UpdateAction(
+updateCondition.map(resolveExpressionByPlanChildren(_, m)),
+// For UPDATE *, the value must from source table.
+resolveAssignments(assignments, m, resolveValuesWithSourceOnly 
= true))
+case o => o
+  }
+  val newNotMatchedActions = m.notMatchedActions.map {
+case InsertAction(insertCondition, assignments) =>
+  // The insert action is used when not matched, so its condition 
and value can only
+  // access columns from the source table.
+  val resolvedInsertCondition = insertCondition.map(
+resolveExpressionByPlanChildren(_, Project(Nil, 
m.sourceTable)))
+  InsertAction(
+resolvedInsertCondition,
+resolveAssignments(assignments, m, resolveValuesWithSourceOnly 
= true))
+case InsertStarAction(insertCondition) =>
+  // The insert action is used when not matched, so its condition 
and value can only
+  // access columns from the source table.
+  val resolvedInsertCondition = insertCondition.map(
+resolveExpressionByPlanChildren(_, Project(Nil, 
m.sourceTable)))
+  Hudi change: filter out meta fields
+  //
+  val assignments = targetTable.output.filter(a => 
!isMetaField(a.name)).map { attr =>
+Assignment(attr, UnresolvedAttribute(Seq(attr.name)))
+  }
+  //
+  
+  InsertAction(
+resolvedInsertCondition,
+resolveAssignments(assignments, m, resolveValuesWithSourceOnly 
= true))
+case o => o

Review Comment:
   I marked the custom changes by surrounding them in 
   
   //
   changes
   //
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8978: [HUDI-6315] Optimize DELETE codepath to use meta fields instead of key generation and index lookup

2023-07-03 Thread via GitHub


hudi-bot commented on PR #8978:
URL: https://github.com/apache/hudi/pull/8978#issuecomment-1619265221

   
   ## CI report:
   
   * 7142fb46cef77575be4346fa0a9cf2fb7bee03b1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18272)
 
   * 79fac007f537294c5f8a2e40617296e3c44537dd Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18281)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] jonvex commented on a diff in pull request #9083: [HUDI-6464] Spark SQL Merge Into for pkless tables

2023-07-03 Thread via GitHub


jonvex commented on code in PR #9083:
URL: https://github.com/apache/hudi/pull/9083#discussion_r1251341306


##
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieAnalysis.scala:
##
@@ -43,23 +43,29 @@ object HoodieAnalysis extends SparkAdapterSupport {
 val rules: ListBuffer[RuleBuilder] = ListBuffer()
 
 // NOTE: This rule adjusts [[LogicalRelation]]s resolving into Hudi tables 
such that
-//   meta-fields are not affecting the resolution of the target 
columns to be updated by Spark.
+//   meta-fields are not affecting the resolution of the target 
columns to be updated by Spark (Except in the
+//   case of MergeInto. We leave the meta columns on the target table, 
and use other means to ensure resolution)

Review Comment:
   It makes aligning the schema more difficult so it could be done in a followup



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] jonvex commented on a diff in pull request #9083: [HUDI-6464] Spark SQL Merge Into for pkless tables

2023-07-03 Thread via GitHub


jonvex commented on code in PR #9083:
URL: https://github.com/apache/hudi/pull/9083#discussion_r1251340425


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoKeyGenerator.scala:
##
@@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi.command
+
+import org.apache.avro.generic.GenericRecord
+import org.apache.hudi.common.config.TypedProperties
+import org.apache.hudi.common.model.HoodieRecord.{RECORD_KEY_META_FIELD_ORD, 
PARTITION_PATH_META_FIELD_ORD}
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.unsafe.types.UTF8String
+
+
+/**
+ * NOTE TO USERS: YOU SHOULD NOT SET THIS AS YOUR KEYGENERATOR

Review Comment:
   We could



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9117: [HUDI-6437] Fixing/optimizing record updates to RLI

2023-07-03 Thread via GitHub


hudi-bot commented on PR #9117:
URL: https://github.com/apache/hudi/pull/9117#issuecomment-1619260606

   
   ## CI report:
   
   * 1e58d179d02fd489ff0b6404b6c270c0589a95d7 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18279)
 
   * 06cf7c29c35c4bf718f0d015bdf2cc3382deb068 UNKNOWN
   * 70250eb58fec89adce0e7a0ea1f0ccb03173e79a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18280)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8978: [HUDI-6315] Optimize DELETE codepath to use meta fields instead of key generation and index lookup

2023-07-03 Thread via GitHub


hudi-bot commented on PR #8978:
URL: https://github.com/apache/hudi/pull/8978#issuecomment-1619260402

   
   ## CI report:
   
   * 7142fb46cef77575be4346fa0a9cf2fb7bee03b1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18272)
 
   * 79fac007f537294c5f8a2e40617296e3c44537dd UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9117: [HUDI-6437] Fixing/optimizing record updates to RLI

2023-07-03 Thread via GitHub


hudi-bot commented on PR #9117:
URL: https://github.com/apache/hudi/pull/9117#issuecomment-1619256505

   
   ## CI report:
   
   * 1e58d179d02fd489ff0b6404b6c270c0589a95d7 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18279)
 
   * 06cf7c29c35c4bf718f0d015bdf2cc3382deb068 UNKNOWN
   * 70250eb58fec89adce0e7a0ea1f0ccb03173e79a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] jonvex commented on a diff in pull request #9083: [HUDI-6464] Spark SQL Merge Into for pkless tables

2023-07-03 Thread via GitHub


jonvex commented on code in PR #9083:
URL: https://github.com/apache/hudi/pull/9083#discussion_r1251337496


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieCreateRecordUtils.scala:
##
@@ -206,27 +222,29 @@ object HoodieCreateRecordUtils {
   }
 
   def getHoodieKeyAndMaybeLocationFromAvroRecord(keyGenerator: 
Option[BaseKeyGenerator], avroRec: GenericRecord,
- isPrepped: Boolean): 
(HoodieKey, Option[HoodieRecordLocation]) = {
+ isPrepped: Boolean, 
mergeIntoWrites: Boolean): (HoodieKey, Option[HoodieRecordLocation]) = {
+//use keygen for mergeIntoWrites recordKey and partitionPath because the 
keygenerator handles
+//fetching from the meta fields if they are populated and otherwise doing 
keygen
 val recordKey = if (isPrepped) {
   avroRec.get(HoodieRecord.RECORD_KEY_METADATA_FIELD).toString
 } else {
   keyGenerator.get.getRecordKey(avroRec)
-};
+}
 
 val partitionPath = if (isPrepped) {
   avroRec.get(HoodieRecord.PARTITION_PATH_METADATA_FIELD).toString
 } else {
   keyGenerator.get.getPartitionPath(avroRec)
-};
+}
 
 val hoodieKey = new HoodieKey(recordKey, partitionPath)
-val instantTime: Option[String] = if (isPrepped) {
+val instantTime: Option[String] = if (isPrepped || mergeIntoWrites) {

Review Comment:
   I think it might be changed at write time?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] jonvex commented on a diff in pull request #9083: [HUDI-6464] Spark SQL Merge Into for pkless tables

2023-07-03 Thread via GitHub


jonvex commented on code in PR #9083:
URL: https://github.com/apache/hudi/pull/9083#discussion_r1251336316


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieInternalProxyIndex.java:
##
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.index;
+
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.common.data.HoodieData;
+import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieIndexException;
+import org.apache.hudi.table.HoodieTable;
+
+public class HoodieInternalProxyIndex extends HoodieIndex {
+
+  /**
+   * Index that does not do tagging. Its purpose is to be used for Spark sql 
Merge into command
+   * Merge into does not need to use index lookup because we get the location 
from the meta columns
+   * from the join
+   */
+  public HoodieInternalProxyIndex(HoodieWriteConfig config) {
+super(config);
+  }
+
+  @Override
+  public  HoodieData> 
tagLocation(HoodieData> records, HoodieEngineContext context, 
HoodieTable hoodieTable) throws HoodieIndexException {
+return records;
+  }
+
+  @Override
+  public HoodieData updateLocation(HoodieData 
writeStatuses, HoodieEngineContext context, HoodieTable hoodieTable) throws 
HoodieIndexException {
+return writeStatuses;
+  }
+
+  @Override
+  public boolean rollbackCommit(String instantTime) {
+return false;
+  }
+
+  @Override
+  public boolean isGlobal() {
+return false;

Review Comment:
   I believe it comes into play when changing the partition 
https://issues.apache.org/jira/browse/HUDI-6471



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on a diff in pull request #9083: [HUDI-6464] Spark SQL Merge Into for pkless tables

2023-07-03 Thread via GitHub


yihua commented on code in PR #9083:
URL: https://github.com/apache/hudi/pull/9083#discussion_r1251329643


##
hudi-spark-datasource/hudi-spark3.0.x/src/main/scala/org/apache/spark/sql/catalyst/analysis/HoodieSpark30Analysis.scala:
##
@@ -0,0 +1,223 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.analysis.{EliminateSubqueryAliases, 
ResolveLambdaVariables, UnresolvedAttribute, UnresolvedExtractValue, 
caseInsensitiveResolution, withPosition}
+import org.apache.spark.sql.catalyst.expressions.{Alias, CurrentDate, 
CurrentTimestamp, Expression, ExtractValue, GetStructField, LambdaFunction}
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.util.toPrettySQL
+import org.apache.spark.sql.hudi.HoodieSqlCommonUtils
+
+/**
+ * NOTE: Taken from HoodieSpark2Analysis and modified to resolve source and 
target tables if not already resolved
+ *
+ *   PLEASE REFRAIN MAKING ANY CHANGES TO THIS CODE UNLESS ABSOLUTELY 
NECESSARY
+ */
+object HoodieSpark30Analysis {

Review Comment:
   For Spark 3.0 and 3.1, have you checked if the code here is different from 
Spark's `ResolveReferences`.  Given we introduce the custom rule here, we 
should still match the implementation of `ResolveReferences` in the 
corresponding Spark version except for the custom logic you added.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on a diff in pull request #9083: [HUDI-6464] Spark SQL Merge Into for pkless tables

2023-07-03 Thread via GitHub


yihua commented on code in PR #9083:
URL: https://github.com/apache/hudi/pull/9083#discussion_r1251327647


##
hudi-spark-datasource/hudi-spark3.2plus-common/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieSpark32PlusAnalysis.scala:
##
@@ -128,9 +131,135 @@ case class HoodieSpark32PlusResolveReferences(spark: 
SparkSession) extends Rule[
   catalogTable.location.toString))
 LogicalRelation(relation, catalogTable)
   }
+case mO@MatchMergeIntoTable(targetTableO, sourceTableO, _)
+   Hudi change: don't want to go to the spark mit resolution so we 
resolve the source and target if they haven't been
+  //
+  if !mO.resolved =>
+  lazy val analyzer = spark.sessionState.analyzer
+  val targetTable = if (targetTableO.resolved) targetTableO else 
analyzer.execute(targetTableO)
+  val sourceTable = if (sourceTableO.resolved) sourceTableO else 
analyzer.execute(sourceTableO)
+  val m = mO.asInstanceOf[MergeIntoTable].copy(targetTable = targetTable, 
sourceTable = sourceTable)
+  //
+  

Review Comment:
   docs to add?  Could you check all places where the comment is empty?



##
hudi-spark-datasource/hudi-spark3.2plus-common/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieSpark32PlusAnalysis.scala:
##
@@ -128,9 +131,135 @@ case class HoodieSpark32PlusResolveReferences(spark: 
SparkSession) extends Rule[
   catalogTable.location.toString))
 LogicalRelation(relation, catalogTable)
   }
+case mO@MatchMergeIntoTable(targetTableO, sourceTableO, _)
+   Hudi change: don't want to go to the spark mit resolution so we 
resolve the source and target if they haven't been
+  //
+  if !mO.resolved =>
+  lazy val analyzer = spark.sessionState.analyzer
+  val targetTable = if (targetTableO.resolved) targetTableO else 
analyzer.execute(targetTableO)
+  val sourceTable = if (sourceTableO.resolved) sourceTableO else 
analyzer.execute(sourceTableO)
+  val m = mO.asInstanceOf[MergeIntoTable].copy(targetTable = targetTable, 
sourceTable = sourceTable)
+  //
+  
+  EliminateSubqueryAliases(targetTable) match {
+case r: NamedRelation if r.skipSchemaResolution =>
+  // Do not resolve the expression if the target table accepts any 
schema.
+  // This allows data sources to customize their own resolution logic 
using
+  // custom resolution rules.
+  m
+
+case _ =>
+  val newMatchedActions = m.matchedActions.map {
+case DeleteAction(deleteCondition) =>
+  val resolvedDeleteCondition = deleteCondition.map(
+resolveExpressionByPlanChildren(_, m))
+  DeleteAction(resolvedDeleteCondition)
+case UpdateAction(updateCondition, assignments) =>
+  val resolvedUpdateCondition = updateCondition.map(
+resolveExpressionByPlanChildren(_, m))
+  UpdateAction(
+resolvedUpdateCondition,
+// The update value can access columns from both target and 
source tables.
+resolveAssignments(assignments, m, resolveValuesWithSourceOnly 
= false))
+case UpdateStarAction(updateCondition) =>
+  Hudi change: filter out meta fields
+  //
+  val assignments = targetTable.output.filter(a => 
!isMetaField(a.name)).map { attr =>
+Assignment(attr, UnresolvedAttribute(Seq(attr.name)))
+  }
+  //
+  
+  UpdateAction(
+updateCondition.map(resolveExpressionByPlanChildren(_, m)),
+// For UPDATE *, the value must from source table.
+resolveAssignments(assignments, m, resolveValuesWithSourceOnly 
= true))
+case o => o
+  }
+  val newNotMatchedActions = m.notMatchedActions.map {
+case InsertAction(insertCondition, assignments) =>
+  // The insert action is used when not matched, so its condition 
and value can only
+  // access columns from the source table.
+  val resolvedInsertCondition = insertCondition.map(
+resolveExpressionByPlanChildren(_, Project(Nil, 
m.sourceTable)))
+  InsertAction(
+resolvedInsertCondition,
+resolveAssignments(assignments, m, resolveValuesWithSourceOnly 
= true))
+case InsertStarAction(insertCondition) =>
+  // The insert action is used when not matched, so its condition 
and value can only
+  // access columns from the source table.
+  val resolvedInsertCondition = insertCondition.map(
+resolveExpressionByPlanChildren(_, Project(Nil, 
m.sourceTable)))
+  Hudi change: filter out meta fields
+  //
+  val assignments = targetTable.output.filter(a => 

[GitHub] [hudi] hudi-bot commented on pull request #9117: [HUDI-6437] Fixing/optimizing record updates to RLI

2023-07-03 Thread via GitHub


hudi-bot commented on PR #9117:
URL: https://github.com/apache/hudi/pull/9117#issuecomment-1619229336

   
   ## CI report:
   
   * 1e58d179d02fd489ff0b6404b6c270c0589a95d7 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18279)
 
   * 06cf7c29c35c4bf718f0d015bdf2cc3382deb068 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9108: [HUDI-6462] Add Hudi client init callback interface

2023-07-03 Thread via GitHub


hudi-bot commented on PR #9108:
URL: https://github.com/apache/hudi/pull/9108#issuecomment-1619229269

   
   ## CI report:
   
   * 7f7dae8012b51f42fe9452e1ecc555b524a7dc6b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18276)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9083: [HUDI-6464] Spark SQL Merge Into for pkless tables

2023-07-03 Thread via GitHub


hudi-bot commented on PR #9083:
URL: https://github.com/apache/hudi/pull/9083#issuecomment-1619229040

   
   ## CI report:
   
   * 3a0bfb88049cf2c0f8afe5c925dbd76fa6f7cd89 UNKNOWN
   * 575c165468bcf8a4650935ab4020975a8d75e73e Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18275)
 
   * f156c1694aca3a9e2ca4ed26959c6a5a1b773354 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18278)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on a diff in pull request #9083: [HUDI-6464] Spark SQL Merge Into for pkless tables

2023-07-03 Thread via GitHub


yihua commented on code in PR #9083:
URL: https://github.com/apache/hudi/pull/9083#discussion_r1251313925


##
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala:
##
@@ -298,26 +333,34 @@ case class MergeIntoHoodieTableCommand(mergeInto: 
MergeIntoTable) extends Hoodie
* {@code ts = source.sts}
* 
*/
-  def sourceDataset: DataFrame = {
+  def projectedJoinedDataset: DataFrame = {
 val resolver = sparkSession.sessionState.analyzer.resolver
 
-val sourceTablePlan = mergeInto.sourceTable
-val sourceTableOutput = sourceTablePlan.output
+// We want to join the source and target tables.
+// Then we want to project the output so that we have the meta columns 
from the target table
+// followed by the data columns of the source table
+val tablemetacols = mergeInto.targetTable.output.filter(a => 
isMetaField(a.name))

Review Comment:
   nit: `tableMetaCols`



##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable3.scala:
##
@@ -0,0 +1,353 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi
+
+import org.apache.hudi.{HoodieSparkUtils, ScalaAssertionSupport}
+
+class TestMergeIntoTable3 extends HoodieSparkSqlTestBase with 
ScalaAssertionSupport {

Review Comment:
   It's OK to add a new test class.  I think Siva's point is that, instead of 
naming it with numbers, we should name it with readability in mind, sth like 
`TestMergeIntoWithNonRecordKeyField`.



##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable2.scala:
##
@@ -440,6 +441,7 @@ class TestMergeIntoTable2 extends HoodieSparkSqlTestBase {
 })
   }
 
+

Review Comment:
   nit: remove empty line



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9117: [HUDI-6437] Fixing/optimizing record updates to RLI

2023-07-03 Thread via GitHub


hudi-bot commented on PR #9117:
URL: https://github.com/apache/hudi/pull/9117#issuecomment-1619223964

   
   ## CI report:
   
   * 1e58d179d02fd489ff0b6404b6c270c0589a95d7 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9083: [HUDI-6464] Spark SQL Merge Into for pkless tables

2023-07-03 Thread via GitHub


hudi-bot commented on PR #9083:
URL: https://github.com/apache/hudi/pull/9083#issuecomment-1619223779

   
   ## CI report:
   
   * 3a0bfb88049cf2c0f8afe5c925dbd76fa6f7cd89 UNKNOWN
   * 575c165468bcf8a4650935ab4020975a8d75e73e Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18275)
 
   * f156c1694aca3a9e2ca4ed26959c6a5a1b773354 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9083: [HUDI-6464] Spark SQL Merge Into for pkless tables

2023-07-03 Thread via GitHub


hudi-bot commented on PR #9083:
URL: https://github.com/apache/hudi/pull/9083#issuecomment-1619217778

   
   ## CI report:
   
   * 3a0bfb88049cf2c0f8afe5c925dbd76fa6f7cd89 UNKNOWN
   * 575c165468bcf8a4650935ab4020975a8d75e73e Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18275)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan opened a new pull request, #9117: [HUDI-6437] Fixing/optimizing record updates to RLI

2023-07-03 Thread via GitHub


nsivabalan opened a new pull request, #9117:
URL: https://github.com/apache/hudi/pull/9117

   ### Change Logs
   
   Optimizing updates to RLI partition in MDT when Update Partition path = 
false and few minor fixes.
   
   ### Impact
   
   Optimizing updates to RLI partition in MDT when Update Partition path = 
false and few minor fixes.
   
   ### Risk level (write none, low medium or high below)
   
   low.
   
   ### Documentation Update
   
   N/A
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on a diff in pull request #9083: [HUDI-6464] Spark SQL Merge Into for pkless tables

2023-07-03 Thread via GitHub


yihua commented on code in PR #9083:
URL: https://github.com/apache/hudi/pull/9083#discussion_r1251290919


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieInternalProxyIndex.java:
##
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.index;
+
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.common.data.HoodieData;
+import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieIndexException;
+import org.apache.hudi.table.HoodieTable;
+
+public class HoodieInternalProxyIndex extends HoodieIndex {
+
+  /**
+   * Index that does not do tagging. Its purpose is to be used for Spark sql 
Merge into command
+   * Merge into does not need to use index lookup because we get the location 
from the meta columns
+   * from the join
+   */

Review Comment:
   nit: could you also add clarification in the docs on why we still need to 
implement a dummy index instead of using prepped write action?



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoKeyGenerator.scala:
##
@@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi.command
+
+import org.apache.avro.generic.GenericRecord
+import org.apache.hudi.common.config.TypedProperties
+import org.apache.hudi.common.model.HoodieRecord.{RECORD_KEY_META_FIELD_ORD, 
PARTITION_PATH_META_FIELD_ORD}
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.unsafe.types.UTF8String
+
+
+/**
+ * NOTE TO USERS: YOU SHOULD NOT SET THIS AS YOUR KEYGENERATOR
+ *
+ * Keygenerator that is meant to be used internally for the spark sql merge 
into command
+ * It will attempt to get the partition path and recordkey from the 
metafields, but will
+ * fallback to the sql keygenerator if the meta field is not populated
+ *
+ */
+class MergeIntoKeyGenerator(props: TypedProperties) extends 
SqlKeyGenerator(props) {
+
+  override def getRecordKey(record: GenericRecord): String = {
+val recordKey = record.get(RECORD_KEY_META_FIELD_ORD)
+if (recordKey != null) {
+  recordKey.toString
+} else {
+  super.getRecordKey(record)
+}
+  }
+
+  override def getRecordKey(row: Row): String = {
+val recordKey = row.get(RECORD_KEY_META_FIELD_ORD)
+if (recordKey != null) {
+  recordKey.toString
+} else {
+  super.getRecordKey(row)
+}
+  }
+
+  override def getRecordKey(internalRow: InternalRow, schema: StructType): 
UTF8String = {
+val recordKey = internalRow.getUTF8String(RECORD_KEY_META_FIELD_ORD)
+if (recordKey != null) {
+  recordKey
+} else {
+  super.getRecordKey(internalRow, schema)
+}
+  }
+
+  override def getPartitionPath(record: GenericRecord): String = {
+val partitionPath = record.get(PARTITION_PATH_META_FIELD_ORD)
+if (partitionPath != null) {
+  partitionPath.toString
+} else {
+  super.getPartitionPath(record)
+}
+  }
+
+  override def getPartitionPath(row: Row): String = {
+val partitionPath = row.get(PARTITION_PATH_META_FIELD_ORD)
+if (partitionPath != null) {
+  partitionPath.toString

Review Comment:
   nit: similar here for 

[hudi] branch master updated: [HUDI-6467] Fix deletes handling in rli when partition path is updated (#9114)

2023-07-03 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new c5b5953b0b8 [HUDI-6467] Fix deletes handling in rli  when partition 
path is updated (#9114)
c5b5953b0b8 is described below

commit c5b5953b0b8666aefb3b51cba29ac1727154f62c
Author: Sagar Sumit 
AuthorDate: Tue Jul 4 03:26:30 2023 +0530

[HUDI-6467] Fix deletes handling in rli  when partition path is updated 
(#9114)

* [HUDI-6467] Fix deletes handling in rli  when partition path is updated

-

Co-authored-by: sivabalan 
---
 .../metadata/HoodieBackedTableMetadataWriter.java  | 46 ++
 .../common/model/HoodieRecordGlobalLocation.java   |  5 ++-
 .../apache/hudi/metadata/BaseTableMetadata.java|  7 +++-
 .../TestGlobalIndexEnableUpdatePartitions.java |  6 +--
 4 files changed, 39 insertions(+), 25 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
index 0908ad79708..a7b45ee6524 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
@@ -82,9 +82,11 @@ import java.util.LinkedList;
 import java.util.List;
 import java.util.Locale;
 import java.util.Map;
+import java.util.Objects;
 import java.util.Set;
 import java.util.stream.Collectors;
 import java.util.stream.IntStream;
+import java.util.stream.Stream;
 
 import static 
org.apache.hudi.common.config.HoodieMetadataConfig.DEFAULT_METADATA_POPULATE_META_FIELDS;
 import static org.apache.hudi.common.table.HoodieTableConfig.ARCHIVELOG_FOLDER;
@@ -299,7 +301,7 @@ public abstract class HoodieBackedTableMetadataWriter 
implements HoodieTableMeta
   exists = false;
 }
 
-return  exists;
+return exists;
   }
 
   /**
@@ -489,7 +491,7 @@ public abstract class HoodieBackedTableMetadataWriter 
implements HoodieTableMeta
* Read the record keys from base files in partitions and return records.
*/
   private HoodieData 
readRecordKeysFromBaseFiles(HoodieEngineContext engineContext,
-  List> partitionBaseFilePairs) {
+   
List> partitionBaseFilePairs) {
 if (partitionBaseFilePairs.isEmpty()) {
   return engineContext.emptyHoodieData();
 }
@@ -1101,7 +1103,7 @@ public abstract class HoodieBackedTableMetadataWriter 
implements HoodieTableMeta
 .getCommitTimeline().filterCompletedInstants().lastInstant();
 if (lastCompletedCompactionInstant.isPresent()
 && metadataMetaClient.getActiveTimeline().filterCompletedInstants()
-
.findInstantsAfter(lastCompletedCompactionInstant.get().getTimestamp()).countInstants()
 < 3) {
+
.findInstantsAfter(lastCompletedCompactionInstant.get().getTimestamp()).countInstants()
 < 3) {
   // do not clean the log files immediately after compaction to give some 
buffer time for metadata table reader,
   // because there is case that the reader has prepared for the log file 
readers already before the compaction completes
   // while before/during the reading of the log files, the cleaning 
triggers and delete the reading files,
@@ -1159,10 +1161,27 @@ public abstract class HoodieBackedTableMetadataWriter 
implements HoodieTableMeta
* @param writeStatuses {@code WriteStatus} from the write operation
*/
   private HoodieData 
getRecordIndexUpdates(HoodieData writeStatuses) {
-return writeStatuses.flatMap(writeStatus -> {
-  List recordList = new LinkedList<>();
-  for (HoodieRecordDelegate recordDelegate : 
writeStatus.getWrittenRecordDelegates()) {
-if (!writeStatus.isErrored(recordDelegate.getHoodieKey())) {
+// 1. List
+// 2. Reduce by key: accept keys only when new location is not
+return writeStatuses.map(writeStatus -> 
writeStatus.getWrittenRecordDelegates().stream()
+.map(recordDelegate -> Pair.of(recordDelegate.getRecordKey(), 
recordDelegate)))
+.flatMapToPair(Stream::iterator)
+.reduceByKey((recordDelegate1, recordDelegate2) -> {
+  if 
(recordDelegate1.getRecordKey().equals(recordDelegate2.getRecordKey())) {
+if (recordDelegate1.getNewLocation().isPresent() && 
recordDelegate1.getNewLocation().get().getFileId() != null) {
+  return recordDelegate1;
+} else if (recordDelegate2.getNewLocation().isPresent() && 
recordDelegate2.getNewLocation().get().getFileId() != null) {
+  return recordDelegate2;
+} else {
+  // should not come here, one of the above must have a new 

[GitHub] [hudi] nsivabalan merged pull request #9114: [HUDI-6467] Fix deletes handling in rli when partition path is updated

2023-07-03 Thread via GitHub


nsivabalan merged PR #9114:
URL: https://github.com/apache/hudi/pull/9114


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9114: [HUDI-6467] Fix deletes handling in rli when partition path is updated

2023-07-03 Thread via GitHub


hudi-bot commented on PR #9114:
URL: https://github.com/apache/hudi/pull/9114#issuecomment-1619183340

   
   ## CI report:
   
   * 07601ef7fa5c3b16846e5f973c9f1845c3d6 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18274)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on a diff in pull request #9083: [HUDI-6464] Spark SQL Merge Into for pkless tables

2023-07-03 Thread via GitHub


yihua commented on code in PR #9083:
URL: https://github.com/apache/hudi/pull/9083#discussion_r1251283969


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieInternalProxyIndex.java:
##
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.index;
+
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.common.data.HoodieData;
+import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieIndexException;
+import org.apache.hudi.table.HoodieTable;
+
+public class HoodieInternalProxyIndex extends HoodieIndex {
+
+  /**
+   * Index that does not do tagging. Its purpose is to be used for Spark sql 
Merge into command
+   * Merge into does not need to use index lookup because we get the location 
from the meta columns
+   * from the join
+   */
+  public HoodieInternalProxyIndex(HoodieWriteConfig config) {
+super(config);
+  }
+
+  @Override
+  public  HoodieData> 
tagLocation(HoodieData> records, HoodieEngineContext context, 
HoodieTable hoodieTable) throws HoodieIndexException {
+return records;
+  }
+
+  @Override
+  public HoodieData updateLocation(HoodieData 
writeStatuses, HoodieEngineContext context, HoodieTable hoodieTable) throws 
HoodieIndexException {
+return writeStatuses;
+  }
+
+  @Override
+  public boolean rollbackCommit(String instantTime) {
+return false;
+  }
+
+  @Override
+  public boolean isGlobal() {
+return false;

Review Comment:
   If the global version is to be implemented, does the user need to simply set 
a config and we return the `true` here for the global index?  Since the record 
location is already known from the meta column, how does the global/non-global 
part come into play here?



##
hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/hudi/SparkAdapter.scala:
##
@@ -202,4 +203,12 @@ trait SparkAdapter extends Serializable {
* Converts instance of [[StorageLevel]] to a corresponding string
*/
   def convertStorageLevelToString(level: StorageLevel): String
+
+  /**
+   * Calls fail analysis on
+   *
+   */
+  def failAnalysisForMIT(a: Attribute, cols: String): Unit = {}
+
+  def createMITJoin(left: LogicalPlan, right: LogicalPlan, joinType: JoinType, 
condition: Option[Expression], hint: String): LogicalPlan

Review Comment:
   nit: Put these into `HoodieCatalystPlansUtils`?



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieInternalConfig.java:
##
@@ -46,6 +46,13 @@ public class HoodieInternalConfig extends HoodieConfig {
   .withDocumentation("For SQL operations, if enables bulk_insert 
operation, "
   + "this configure will take effect to decide overwrite whole table 
or partitions specified");
 
+  public static final ConfigProperty SQL_MERGE_INTO_WRITES = 
ConfigProperty
+  .key("hoodie.internal.sql.merge.into.writes")
+  .defaultValue("false")
+  .markAdvanced()
+  .withDocumentation("For merge into from spark-sql, we need some special 
handling. for eg, schema "

Review Comment:
   nit: add `.sinceVersion("0.14.0")`



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieInternalConfig.java:
##
@@ -46,6 +46,13 @@ public class HoodieInternalConfig extends HoodieConfig {
   .withDocumentation("For SQL operations, if enables bulk_insert 
operation, "
   + "this configure will take effect to decide overwrite whole table 
or partitions specified");
 
+  public static final ConfigProperty SQL_MERGE_INTO_WRITES = 
ConfigProperty
+  .key("hoodie.internal.sql.merge.into.writes")
+  .defaultValue("false")
+  .markAdvanced()
+  .withDocumentation("For merge into from spark-sql, we need some special 
handling. for eg, schema "
+  + "validation should be disabled for writes from merge into. As well 
as reuse of meta cols for keygen and skip tagging");

Review Comment:
   Could you add sth around: `This internal config is used by Merge Into SQL 
logic only to mark such use case and let 

[GitHub] [hudi] hudi-bot commented on pull request #9116: [HUDI-6470] Add spark sql conf in AlterTableCommand

2023-07-03 Thread via GitHub


hudi-bot commented on PR #9116:
URL: https://github.com/apache/hudi/pull/9116#issuecomment-1619119598

   
   ## CI report:
   
   * 80b8747dfec3e4cac7719796a46e52e9846081f5 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18271)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8978: [HUDI-6315] Optimize DELETE codepath to use meta fields instead of key generation and index lookup

2023-07-03 Thread via GitHub


hudi-bot commented on PR #8978:
URL: https://github.com/apache/hudi/pull/8978#issuecomment-1619119227

   
   ## CI report:
   
   * 7142fb46cef77575be4346fa0a9cf2fb7bee03b1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18272)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9108: [HUDI-6462] Add Hudi client init callback interface

2023-07-03 Thread via GitHub


hudi-bot commented on PR #9108:
URL: https://github.com/apache/hudi/pull/9108#issuecomment-1619061262

   
   ## CI report:
   
   * 298590e894d6eb2a39c4eb523b0b8e09903f9635 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18273)
 
   * 7f7dae8012b51f42fe9452e1ecc555b524a7dc6b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18276)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9083: [HUDI-6464] Spark SQL Merge Into for pkless tables

2023-07-03 Thread via GitHub


hudi-bot commented on PR #9083:
URL: https://github.com/apache/hudi/pull/9083#issuecomment-1619061131

   
   ## CI report:
   
   * 3a0bfb88049cf2c0f8afe5c925dbd76fa6f7cd89 UNKNOWN
   * 633f78cab4bdf225125bd028cf4a2b141844ef09 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18270)
 
   * 575c165468bcf8a4650935ab4020975a8d75e73e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18275)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9083: [HUDI-6464] Spark SQL Merge Into for pkless tables

2023-07-03 Thread via GitHub


hudi-bot commented on PR #9083:
URL: https://github.com/apache/hudi/pull/9083#issuecomment-1619053605

   
   ## CI report:
   
   * 3a0bfb88049cf2c0f8afe5c925dbd76fa6f7cd89 UNKNOWN
   * 633f78cab4bdf225125bd028cf4a2b141844ef09 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18270)
 
   * 575c165468bcf8a4650935ab4020975a8d75e73e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9108: [HUDI-6462] Add Hudi client init callback interface

2023-07-03 Thread via GitHub


hudi-bot commented on PR #9108:
URL: https://github.com/apache/hudi/pull/9108#issuecomment-1619053709

   
   ## CI report:
   
   * 3fbcdb8f1f2c7504b8564ead1d065c1d862f83fb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18246)
 
   * 298590e894d6eb2a39c4eb523b0b8e09903f9635 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18273)
 
   * 7f7dae8012b51f42fe9452e1ecc555b524a7dc6b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9083: [HUDI-6464] Spark SQL Merge Into for pkless tables

2023-07-03 Thread via GitHub


hudi-bot commented on PR #9083:
URL: https://github.com/apache/hudi/pull/9083#issuecomment-1619046246

   
   ## CI report:
   
   * 3a0bfb88049cf2c0f8afe5c925dbd76fa6f7cd89 UNKNOWN
   * 633f78cab4bdf225125bd028cf4a2b141844ef09 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18270)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on a diff in pull request #9108: [HUDI-6462] Add Hudi client init callback interface

2023-07-03 Thread via GitHub


yihua commented on code in PR #9108:
URL: https://github.com/apache/hudi/pull/9108#discussion_r1251218379


##
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/callback/TestThrowExceptionCallback.java:
##
@@ -0,0 +1,33 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.callback;
+
+import org.apache.hudi.client.BaseHoodieClient;
+import org.apache.hudi.exception.HoodieIOException;
+
+/**
+ * A test {@link HoodieClientInitCallback} implementation to throw an 
exception.
+ */
+public class TestThrowExceptionCallback implements HoodieClientInitCallback {

Review Comment:
   OK, the inner classes have to be static to work.  Now all test 
implementation are in inner classes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-6473) RLI and update partition path follow up

2023-07-03 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-6473:
-

 Summary: RLI and update partition path follow up
 Key: HUDI-6473
 URL: https://issues.apache.org/jira/browse/HUDI-6473
 Project: Apache Hudi
  Issue Type: Bug
  Components: index, metadata
Reporter: sivabalan narayanan


We found an issue w/ RLI and update partition path cases. 

We put in an initial fix to unblock broken master here 
[https://github.com/apache/hudi/pull/9114] 

but there are some follow ups
 * Avoid reduce by Key for update partition path = false case
 * Fix parallelism for reduce By Key
 * Fix RLI MOR and update partition path test case. 
TestGlobalIndexEnableUpdatePartitions.testUpdatePartitionsThenDelete



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6472) Spark Sql Merge Into does not ignore case

2023-07-03 Thread Jonathan Vexler (Jira)
Jonathan Vexler created HUDI-6472:
-

 Summary: Spark Sql  Merge Into does not ignore case
 Key: HUDI-6472
 URL: https://issues.apache.org/jira/browse/HUDI-6472
 Project: Apache Hudi
  Issue Type: Bug
  Components: spark-sql
Reporter: Jonathan Vexler


With introduction of merge into changes in HUDI-6464, case insensitivity no 
longer works. See commented tests in TestMergeIntoTable2.scala



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #9114: [HUDI-6467] Fix deletes handling in rli when partition path is updated

2023-07-03 Thread via GitHub


hudi-bot commented on PR #9114:
URL: https://github.com/apache/hudi/pull/9114#issuecomment-1618997883

   
   ## CI report:
   
   * 02a297b2e0cdfd364b2dccffdad4ff6df9adf564 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18265)
 
   * 07601ef7fa5c3b16846e5f973c9f1845c3d6 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18274)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9108: [HUDI-6462] Add Hudi client init callback interface

2023-07-03 Thread via GitHub


hudi-bot commented on PR #9108:
URL: https://github.com/apache/hudi/pull/9108#issuecomment-1618997748

   
   ## CI report:
   
   * 3fbcdb8f1f2c7504b8564ead1d065c1d862f83fb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18246)
 
   * 298590e894d6eb2a39c4eb523b0b8e09903f9635 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18273)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9114: [HUDI-6467] Fix deletes handling in rli when partition path is updated

2023-07-03 Thread via GitHub


hudi-bot commented on PR #9114:
URL: https://github.com/apache/hudi/pull/9114#issuecomment-1618988954

   
   ## CI report:
   
   * 02a297b2e0cdfd364b2dccffdad4ff6df9adf564 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18265)
 
   * 07601ef7fa5c3b16846e5f973c9f1845c3d6 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9108: [HUDI-6462] Add Hudi client init callback interface

2023-07-03 Thread via GitHub


hudi-bot commented on PR #9108:
URL: https://github.com/apache/hudi/pull/9108#issuecomment-1618988868

   
   ## CI report:
   
   * 3fbcdb8f1f2c7504b8564ead1d065c1d862f83fb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18246)
 
   * 298590e894d6eb2a39c4eb523b0b8e09903f9635 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9115: [HUDI-6469] Revert HUDI-6311

2023-07-03 Thread via GitHub


hudi-bot commented on PR #9115:
URL: https://github.com/apache/hudi/pull/9115#issuecomment-1618981905

   
   ## CI report:
   
   * 2a046240c1e7c0a18f9b57c0845298ea65b72951 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18269)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on a diff in pull request #9114: [HUDI-6467] Fix deletes handling in rli when partition path is updated

2023-07-03 Thread via GitHub


yihua commented on code in PR #9114:
URL: https://github.com/apache/hudi/pull/9114#discussion_r1251176617


##
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/functional/TestGlobalIndexEnableUpdatePartitions.java:
##
@@ -65,8 +65,8 @@ private static Stream getTableTypeAndIndexType() {
 Arguments.of(COPY_ON_WRITE, GLOBAL_BLOOM),
 Arguments.of(COPY_ON_WRITE, RECORD_INDEX),
 Arguments.of(MERGE_ON_READ, GLOBAL_SIMPLE),
-Arguments.of(MERGE_ON_READ, GLOBAL_BLOOM),
-Arguments.of(MERGE_ON_READ, RECORD_INDEX)
+Arguments.of(MERGE_ON_READ, GLOBAL_BLOOM)
+// Arguments.of(MERGE_ON_READ, RECORD_INDEX)

Review Comment:
   Is this still failing?



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -1159,10 +1161,27 @@ protected boolean 
validateTimelineBeforeSchedulingCompaction(Option inFl
* @param writeStatuses {@code WriteStatus} from the write operation
*/
   private HoodieData 
getRecordIndexUpdates(HoodieData writeStatuses) {
-return writeStatuses.flatMap(writeStatus -> {
-  List recordList = new LinkedList<>();
-  for (HoodieRecordDelegate recordDelegate : 
writeStatus.getWrittenRecordDelegates()) {
-if (!writeStatus.isErrored(recordDelegate.getHoodieKey())) {
+// 1. List
+// 2. Reduce by key: accept keys only when new location is not
+return writeStatuses.map(writeStatus -> 
writeStatus.getWrittenRecordDelegates().stream()
+.map(recordDelegate -> Pair.of(recordDelegate.getRecordKey(), 
recordDelegate)))
+.flatMapToPair(Stream::iterator)
+.reduceByKey((recordDelegate1, recordDelegate2) -> {
+  if 
(recordDelegate1.getRecordKey().equals(recordDelegate2.getRecordKey())) {
+if (recordDelegate1.getNewLocation().isPresent() && 
recordDelegate1.getNewLocation().get().getFileId() != null) {
+  return recordDelegate1;
+} else if (recordDelegate2.getNewLocation().isPresent() && 
recordDelegate2.getNewLocation().get().getFileId() != null) {
+  return recordDelegate2;
+} else {
+  // should not come here, one of the above must have a new 
location set
+  return null;
+}
+  } else {
+return recordDelegate1;
+  }
+}, 1)

Review Comment:
   Parallelism should be adjustable, not 1?



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -1159,10 +1161,27 @@ protected boolean 
validateTimelineBeforeSchedulingCompaction(Option inFl
* @param writeStatuses {@code WriteStatus} from the write operation
*/
   private HoodieData 
getRecordIndexUpdates(HoodieData writeStatuses) {
-return writeStatuses.flatMap(writeStatus -> {
-  List recordList = new LinkedList<>();
-  for (HoodieRecordDelegate recordDelegate : 
writeStatus.getWrittenRecordDelegates()) {
-if (!writeStatus.isErrored(recordDelegate.getHoodieKey())) {
+// 1. List
+// 2. Reduce by key: accept keys only when new location is not
+return writeStatuses.map(writeStatus -> 
writeStatus.getWrittenRecordDelegates().stream()
+.map(recordDelegate -> Pair.of(recordDelegate.getRecordKey(), 
recordDelegate)))
+.flatMapToPair(Stream::iterator)
+.reduceByKey((recordDelegate1, recordDelegate2) -> {
+  if 
(recordDelegate1.getRecordKey().equals(recordDelegate2.getRecordKey())) {
+if (recordDelegate1.getNewLocation().isPresent() && 
recordDelegate1.getNewLocation().get().getFileId() != null) {
+  return recordDelegate1;
+} else if (recordDelegate2.getNewLocation().isPresent() && 
recordDelegate2.getNewLocation().get().getFileId() != null) {
+  return recordDelegate2;
+} else {
+  // should not come here, one of the above must have a new 
location set
+  return null;

Review Comment:
   Should this throw an exception?



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -1159,10 +1161,27 @@ protected boolean 
validateTimelineBeforeSchedulingCompaction(Option inFl
* @param writeStatuses {@code WriteStatus} from the write operation
*/
   private HoodieData 
getRecordIndexUpdates(HoodieData writeStatuses) {
-return writeStatuses.flatMap(writeStatus -> {
-  List recordList = new LinkedList<>();
-  for (HoodieRecordDelegate recordDelegate : 
writeStatus.getWrittenRecordDelegates()) {
-if (!writeStatus.isErrored(recordDelegate.getHoodieKey())) {
+// 1. List
+// 2. Reduce by key: accept keys only when new location is not
+return writeStatuses.map(writeStatus -> 
writeStatus.getWrittenRecordDelegates().stream()
+.map(recordDelegate -> 

[GitHub] [hudi] yihua commented on a diff in pull request #9108: [HUDI-6462] Add Hudi client init callback interface

2023-07-03 Thread via GitHub


yihua commented on code in PR #9108:
URL: https://github.com/apache/hudi/pull/9108#discussion_r1251167929


##
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/callback/TestThrowExceptionCallback.java:
##
@@ -0,0 +1,33 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.callback;
+
+import org.apache.hudi.client.BaseHoodieClient;
+import org.apache.hudi.exception.HoodieIOException;
+
+/**
+ * A test {@link HoodieClientInitCallback} implementation to throw an 
exception.
+ */
+public class TestThrowExceptionCallback implements HoodieClientInitCallback {

Review Comment:
   I tried inner class but it does not work well with the class loader.  So I 
rather put them in independent classes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on a diff in pull request #9108: [HUDI-6462] Add Hudi client init callback interface

2023-07-03 Thread via GitHub


yihua commented on code in PR #9108:
URL: https://github.com/apache/hudi/pull/9108#discussion_r1251167572


##
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/callback/TestChangeConfigInitCallback.java:
##
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.callback;
+
+import org.apache.hudi.client.BaseHoodieClient;
+import org.apache.hudi.config.HoodieWriteConfig;
+
+import org.apache.avro.Schema;
+
+import static org.apache.hudi.config.HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE;
+
+/**
+ * A test {@link HoodieClientInitCallback} implementation to add the property
+ * `user.defined.key2=value2` to the write schema.
+ */
+public class TestChangeConfigInitCallback implements HoodieClientInitCallback {

Review Comment:
   I renamed the classes with `TestClass` suffix to differentiated them from 
other classes containing the actual tests.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9038: [HUDI-6423] Incremental cleaning should consider inflight compaction instant

2023-07-03 Thread via GitHub


hudi-bot commented on PR #9038:
URL: https://github.com/apache/hudi/pull/9038#issuecomment-1618936154

   
   ## CI report:
   
   * a65a29c0cf1c8feb9f39e168ba80c99ebcae1c5d UNKNOWN
   * 33e1774f08f40abfb216f0e8f8894f6000b2ee3e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18268)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8978: [HUDI-6315] Optimize DELETE codepath to use meta fields instead of key generation and index lookup

2023-07-03 Thread via GitHub


hudi-bot commented on PR #8978:
URL: https://github.com/apache/hudi/pull/8978#issuecomment-1618935992

   
   ## CI report:
   
   * 85d6a980287b105a661025ed5aa45da319ad52a1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18213)
 
   * 7142fb46cef77575be4346fa0a9cf2fb7bee03b1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18272)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6471) Global index not fully supported for spark sql merge into

2023-07-03 Thread Jonathan Vexler (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Vexler updated HUDI-6471:
--
Description: 
With the changes to merge into introduced in HUDI-6464, the config 
h4. hoodie.simple.index.update.partition.path will not work

index.global.enabled also will have issues

  was:
With the changes to merge into introduced in HUDI-6464, the config 
h4. hoodie.simple.index.update.partition.path will not work


> Global index not fully supported for spark sql merge into
> -
>
> Key: HUDI-6471
> URL: https://issues.apache.org/jira/browse/HUDI-6471
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: Jonathan Vexler
>Priority: Major
>
> With the changes to merge into introduced in HUDI-6464, the config 
> h4. hoodie.simple.index.update.partition.path will not work
> index.global.enabled also will have issues



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] jonvex commented on a diff in pull request #9083: [HUDI-6464] Spark SQL Merge Into for pkless tables

2023-07-03 Thread via GitHub


jonvex commented on code in PR #9083:
URL: https://github.com/apache/hudi/pull/9083#discussion_r1251149304


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieInternalProxyIndex.java:
##
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.index;
+
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.common.data.HoodieData;
+import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieIndexException;
+import org.apache.hudi.table.HoodieTable;
+
+public class HoodieInternalProxyIndex extends HoodieIndex {
+
+  /**
+   * Index that does not do tagging. Its purpose is to be used for Spark sql 
Merge into command
+   * Merge into does not need to use index lookup because we get the location 
from the meta columns
+   * from the join
+   */
+  public HoodieInternalProxyIndex(HoodieWriteConfig config) {
+super(config);
+  }
+
+  @Override
+  public  HoodieData> 
tagLocation(HoodieData> records, HoodieEngineContext context, 
HoodieTable hoodieTable) throws HoodieIndexException {
+return records;
+  }
+
+  @Override
+  public HoodieData updateLocation(HoodieData 
writeStatuses, HoodieEngineContext context, HoodieTable hoodieTable) throws 
HoodieIndexException {
+return writeStatuses;
+  }
+
+  @Override
+  public boolean rollbackCommit(String instantTime) {
+return false;
+  }
+
+  @Override
+  public boolean isGlobal() {

Review Comment:
   https://issues.apache.org/jira/browse/HUDI-6471



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-6471) Global index not fully supported for spark sql merge into

2023-07-03 Thread Jonathan Vexler (Jira)
Jonathan Vexler created HUDI-6471:
-

 Summary: Global index not fully supported for spark sql merge into
 Key: HUDI-6471
 URL: https://issues.apache.org/jira/browse/HUDI-6471
 Project: Apache Hudi
  Issue Type: Bug
  Components: spark-sql
Reporter: Jonathan Vexler


With the changes to merge into introduced in HUDI-6464, the config 
h4. hoodie.simple.index.update.partition.path will not work



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #8978: [HUDI-6315] Optimize DELETE codepath to use meta fields instead of key generation and index lookup

2023-07-03 Thread via GitHub


hudi-bot commented on PR #8978:
URL: https://github.com/apache/hudi/pull/8978#issuecomment-1618926995

   
   ## CI report:
   
   * 85d6a980287b105a661025ed5aa45da319ad52a1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18213)
 
   * 7142fb46cef77575be4346fa0a9cf2fb7bee03b1 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] ad1happy2go commented on issue #7600: Hoodie clean is not deleting old files for MOR table

2023-07-03 Thread via GitHub


ad1happy2go commented on issue #7600:
URL: https://github.com/apache/hudi/issues/7600#issuecomment-1618920425

   @umehrot2 @koochiswathiTR Were you able to get it resolved with those 
configs. Please let us know in case you need any other help on this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #9114: [HUDI-6467] Fix deletes handling in rli when partition path is updated

2023-07-03 Thread via GitHub


nsivabalan commented on code in PR #9114:
URL: https://github.com/apache/hudi/pull/9114#discussion_r1251129461


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -1159,10 +1161,30 @@ protected boolean 
validateTimelineBeforeSchedulingCompaction(Option inFl
* @param writeStatuses {@code WriteStatus} from the write operation
*/
   private HoodieData 
getRecordIndexUpdates(HoodieData writeStatuses) {
-return writeStatuses.flatMap(writeStatus -> {
-  List recordList = new LinkedList<>();
-  for (HoodieRecordDelegate recordDelegate : 
writeStatus.getWrittenRecordDelegates()) {
-if (!writeStatus.isErrored(recordDelegate.getHoodieKey())) {
+// 1. List
+// 2. Reduce by key: accept keys only when new location is not
+return writeStatuses.map(writeStatus -> 
writeStatus.getWrittenRecordDelegates().stream().map(recordDelegate -> 
Pair.of(writeStatus, recordDelegate)))
+.flatMapToPair(Stream::iterator)

Review Comment:
   Lets do something like this. might be simpler
   
   HD 
   
   HD map to HpairD 
   
   reduceByKey(combine) -> record (only one record for one record key)
   
   .values() should give us the final HD 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



  1   2   >