date:20240603

svn commit: r69529 - in /release/hudi/0.15.0: ./ hudi-0.15.0.src.tgz hudi-0.15.0.src.tgz.asc hudi-0.15.0.src.tgz.sha512

2024-06-03 Thread yihua

Author: yihua
Date: Tue Jun  4 06:13:26 2024
New Revision: 69529

Log:
Add Apache Hudi 0.15.0 source release

Added:
release/hudi/0.15.0/
release/hudi/0.15.0/hudi-0.15.0.src.tgz   (with props)
release/hudi/0.15.0/hudi-0.15.0.src.tgz.asc
release/hudi/0.15.0/hudi-0.15.0.src.tgz.sha512

Added: release/hudi/0.15.0/hudi-0.15.0.src.tgz
==
Binary file - no diff available.

Propchange: release/hudi/0.15.0/hudi-0.15.0.src.tgz
--
svn:mime-type = application/octet-stream

Added: release/hudi/0.15.0/hudi-0.15.0.src.tgz.asc
==
--- release/hudi/0.15.0/hudi-0.15.0.src.tgz.asc (added)
+++ release/hudi/0.15.0/hudi-0.15.0.src.tgz.asc Tue Jun  4 06:13:26 2024
@@ -0,0 +1,16 @@
+-BEGIN PGP SIGNATURE-
+
+iQIzBAABCAAdFiEEiIqTQeYA64VQqs1e+xt1BPf3cMkFAmZerO0ACgkQ+xt1BPf3
+cMkCwQ/+KPVteMK7Q7gH5oCzOrGeafKkm9i4IySfSg+BkjyYwncTgyCnoOn+bQLm
+nQIiIPFpF+ROcnaD+iQqd+VPeuX/V5JS3MzeZddw/k1MueXuTGuqd2q84LNp3LYe
+hXmCmUCG6nQHWulcgRVIKrrfdBMUeZDTL3WR7JgirNJovOkrdP9k3prl5jeHes2v
+fEu1WwTq6NajPYtRM8csRMPVWIO5oDW9oJTF45OGQvjZTc0pb8AgudP6f7CRcutW
+QQz/EX4AFISe0azaHw7NHLJoR75h4Iz+Onzo520d5fKlowDKVEXVyYLgY9ThEFks
+GboqpO2LQDiGyzdwVM6KAfQsVOwNWJ+4VgItlWHlfe4NE/wZr61OfdU2fIonFGfu
+SN1Z3wKyJC5SmAeRsRRm9L791CbGab5D4ZYI+r2MCO0kvzOKG8yl+bPjCt94Qc62
+1TrnsaVz5k9CmIl3A8dxtiCtG/g/W/68qliKgXX8TivMJ8Gr2LFJCEygu9JplxUl
+R0sb6+4Bmftyu8NHF6j4LWcL6Ae3ySQf0oN8q3laekMjf4rrcqoGKzH/A6GAAdtO
+D17JDreky3ARU6aksbFTzoKM6nwKQTsva3gD6xjCmcaIMfoOTUa7QdCQgNQC2Afa
++EiQvGq7touqwlfxUwKLfyx0BMD1DdjPZ07a3oJln2odf5kI4TQ=
+=W9Sg
+-END PGP SIGNATURE-

Added: release/hudi/0.15.0/hudi-0.15.0.src.tgz.sha512
==
--- release/hudi/0.15.0/hudi-0.15.0.src.tgz.sha512 (added)
+++ release/hudi/0.15.0/hudi-0.15.0.src.tgz.sha512 Tue Jun  4 06:13:26 2024
@@ -0,0 +1 @@
+5e8627c69b9c13c6e0b3849ca829d0a941ff55fd7fbc2697bc98bcde155ff28642ad6fc7f9f234e262acf13069667ba71f74437433edb59e7904bc4a5086f6bf
  hudi-0.15.0.src.tgz

svn commit: r69528 - in /dev/hudi/hudi-0.15.0: ./ hudi-0.15.0.src.tgz hudi-0.15.0.src.tgz.asc hudi-0.15.0.src.tgz.sha512

2024-06-03 Thread yihua

Author: yihua
Date: Tue Jun  4 06:00:37 2024
New Revision: 69528

Log:
Add Apache Hudi 0.15.0 source release

Added:
dev/hudi/hudi-0.15.0/
dev/hudi/hudi-0.15.0/hudi-0.15.0.src.tgz   (with props)
dev/hudi/hudi-0.15.0/hudi-0.15.0.src.tgz.asc
dev/hudi/hudi-0.15.0/hudi-0.15.0.src.tgz.sha512

Added: dev/hudi/hudi-0.15.0/hudi-0.15.0.src.tgz
==
Binary file - no diff available.

Propchange: dev/hudi/hudi-0.15.0/hudi-0.15.0.src.tgz
--
svn:mime-type = application/octet-stream

Added: dev/hudi/hudi-0.15.0/hudi-0.15.0.src.tgz.asc
==
--- dev/hudi/hudi-0.15.0/hudi-0.15.0.src.tgz.asc (added)
+++ dev/hudi/hudi-0.15.0/hudi-0.15.0.src.tgz.asc Tue Jun  4 06:00:37 2024
@@ -0,0 +1,16 @@
+-BEGIN PGP SIGNATURE-
+
+iQIzBAABCAAdFiEEiIqTQeYA64VQqs1e+xt1BPf3cMkFAmZerO0ACgkQ+xt1BPf3
+cMkCwQ/+KPVteMK7Q7gH5oCzOrGeafKkm9i4IySfSg+BkjyYwncTgyCnoOn+bQLm
+nQIiIPFpF+ROcnaD+iQqd+VPeuX/V5JS3MzeZddw/k1MueXuTGuqd2q84LNp3LYe
+hXmCmUCG6nQHWulcgRVIKrrfdBMUeZDTL3WR7JgirNJovOkrdP9k3prl5jeHes2v
+fEu1WwTq6NajPYtRM8csRMPVWIO5oDW9oJTF45OGQvjZTc0pb8AgudP6f7CRcutW
+QQz/EX4AFISe0azaHw7NHLJoR75h4Iz+Onzo520d5fKlowDKVEXVyYLgY9ThEFks
+GboqpO2LQDiGyzdwVM6KAfQsVOwNWJ+4VgItlWHlfe4NE/wZr61OfdU2fIonFGfu
+SN1Z3wKyJC5SmAeRsRRm9L791CbGab5D4ZYI+r2MCO0kvzOKG8yl+bPjCt94Qc62
+1TrnsaVz5k9CmIl3A8dxtiCtG/g/W/68qliKgXX8TivMJ8Gr2LFJCEygu9JplxUl
+R0sb6+4Bmftyu8NHF6j4LWcL6Ae3ySQf0oN8q3laekMjf4rrcqoGKzH/A6GAAdtO
+D17JDreky3ARU6aksbFTzoKM6nwKQTsva3gD6xjCmcaIMfoOTUa7QdCQgNQC2Afa
++EiQvGq7touqwlfxUwKLfyx0BMD1DdjPZ07a3oJln2odf5kI4TQ=
+=W9Sg
+-END PGP SIGNATURE-

Added: dev/hudi/hudi-0.15.0/hudi-0.15.0.src.tgz.sha512
==
--- dev/hudi/hudi-0.15.0/hudi-0.15.0.src.tgz.sha512 (added)
+++ dev/hudi/hudi-0.15.0/hudi-0.15.0.src.tgz.sha512 Tue Jun  4 06:00:37 2024
@@ -0,0 +1 @@
+5e8627c69b9c13c6e0b3849ca829d0a941ff55fd7fbc2697bc98bcde155ff28642ad6fc7f9f234e262acf13069667ba71f74437433edb59e7904bc4a5086f6bf
  hudi-0.15.0.src.tgz

(hudi) annotated tag release-0.15.0 updated (38832854be3 -> 3b2205a3e49)

2024-06-03 Thread yihua

This is an automated email from the ASF dual-hosted git repository.

yihua pushed a change to annotated tag release-0.15.0
in repository https://gitbox.apache.org/repos/asf/hudi.git


*** WARNING: tag release-0.15.0 was modified! ***

from 38832854be3 (commit)
  to 3b2205a3e49 (tag)
 tagging 38832854be37cb78ad1edd87f515f01ca5ea6a8a (commit)
 replaces release-0.15.0-rc3
  by Y Ethan Guo
  on Mon Jun 3 22:54:32 2024 -0700

- Log -
0.15.0
-BEGIN PGP SIGNATURE-

iQIzBAABCAAdFiEEDE0xZCfsqnGiCtlma+HUVMkPXqUFAmZerBgACgkQa+HUVMkP
XqW2kxAAlAQJ8w7WbKJCy4c/YY+lPOc2jX0b/Q4+lMoIV/aQJu8XDFtBNEin7GBE
b2g4iLEag0SDAu3dzpR5YqmPCrGPGfkP4ZHOeYWsuxXiHn/UKQGLZR3hBvjZQXSE
fpe2C0B/h7/U6u4In31cqAL4N9DNXJcQt+780R+SUJbRWbyqRZfU3ddHIOkOZNJg
lZ1UrJ7rFlF/VNUWpb6BDIHPm8+7p0jygt0YJKOtefL55tJSA2PNy/FhOAt6Fs2A
FXRTLQn7lbNb9DAX6xpu+wdgt/KGW1RzvPy4CnlfSC/3h4NZ7SDDOt3/nG7Sz0M+
5slho5iolnsMNGkuRebGH/V/zOPKhI7bLLrpAKtxrPtRyoOj+io0queqVR0fkuQZ
q315iUAGWRIzEZ9TbvR23dm2zitlUSgP0+2dETs7oVZ6c5jL1ojyZ1MlKcWMZS86
0xrv0vLNdnCHr0u+rCkhiaz/FqWxmnl6sQIRpicrHpGnZpYyNFauJ1TLS9OjOud7
86sMEK6atM8emXl+iYJfcEyJXpDHR0dncJrpDEk1XIDh/Bg2aZL1yvwbTFrDVp/x
PDrOC5JufvfY05XZHMU2HTaIquVfBlJ0GdsWHL45TJ5HtZ4hBPVL1rknjNjgkx4N
nF2b3zQuCzMr/+7D6VbDNIlWsAg8wvmY8bjKo4XtTEldnNUMR30=
=E0hr
-END PGP SIGNATURE-
---


No new revisions were added by this update.

Summary of changes:

(hudi) branch release-0.15.0 updated: [MINOR] Update release version to reflect published version 0.15.0

2024-06-03 Thread yihua

This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch release-0.15.0
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/release-0.15.0 by this push:
 new 38832854be3 [MINOR] Update release version to reflect published 
version 0.15.0
38832854be3 is described below

commit 38832854be37cb78ad1edd87f515f01ca5ea6a8a
Author: Y Ethan Guo 
AuthorDate: Mon Jun 3 22:49:24 2024 -0700

[MINOR] Update release version to reflect published version 0.15.0
---
 docker/hoodie/hadoop/base/pom.xml| 2 +-
 docker/hoodie/hadoop/base_java11/pom.xml | 2 +-
 docker/hoodie/hadoop/datanode/pom.xml| 2 +-
 docker/hoodie/hadoop/historyserver/pom.xml   | 2 +-
 docker/hoodie/hadoop/hive_base/pom.xml   | 2 +-
 docker/hoodie/hadoop/namenode/pom.xml| 2 +-
 docker/hoodie/hadoop/pom.xml | 2 +-
 docker/hoodie/hadoop/prestobase/pom.xml  | 2 +-
 docker/hoodie/hadoop/spark_base/pom.xml  | 2 +-
 docker/hoodie/hadoop/sparkadhoc/pom.xml  | 2 +-
 docker/hoodie/hadoop/sparkmaster/pom.xml | 2 +-
 docker/hoodie/hadoop/sparkworker/pom.xml | 2 +-
 docker/hoodie/hadoop/trinobase/pom.xml   | 2 +-
 docker/hoodie/hadoop/trinocoordinator/pom.xml| 2 +-
 docker/hoodie/hadoop/trinoworker/pom.xml | 2 +-
 hudi-aws/pom.xml | 4 ++--
 hudi-cli/pom.xml | 2 +-
 hudi-client/hudi-client-common/pom.xml   | 4 ++--
 hudi-client/hudi-flink-client/pom.xml| 4 ++--
 hudi-client/hudi-java-client/pom.xml | 4 ++--
 hudi-client/hudi-spark-client/pom.xml| 4 ++--
 hudi-client/pom.xml  | 2 +-
 hudi-common/pom.xml  | 2 +-
 hudi-examples/hudi-examples-common/pom.xml   | 2 +-
 hudi-examples/hudi-examples-flink/pom.xml| 2 +-
 hudi-examples/hudi-examples-java/pom.xml | 2 +-
 hudi-examples/hudi-examples-spark/pom.xml| 2 +-
 hudi-examples/pom.xml| 2 +-
 hudi-flink-datasource/hudi-flink/pom.xml | 4 ++--
 hudi-flink-datasource/hudi-flink1.14.x/pom.xml   | 4 ++--
 hudi-flink-datasource/hudi-flink1.15.x/pom.xml   | 4 ++--
 hudi-flink-datasource/hudi-flink1.16.x/pom.xml   | 4 ++--
 hudi-flink-datasource/hudi-flink1.17.x/pom.xml   | 4 ++--
 hudi-flink-datasource/hudi-flink1.18.x/pom.xml   | 4 ++--
 hudi-flink-datasource/pom.xml| 4 ++--
 hudi-gcp/pom.xml | 2 +-
 hudi-hadoop-common/pom.xml   | 2 +-
 hudi-hadoop-mr/pom.xml   | 2 +-
 hudi-integ-test/pom.xml  | 2 +-
 hudi-io/pom.xml  | 2 +-
 hudi-kafka-connect/pom.xml   | 4 ++--
 hudi-platform-service/hudi-metaserver/hudi-metaserver-client/pom.xml | 2 +-
 hudi-platform-service/hudi-metaserver/hudi-metaserver-server/pom.xml | 2 +-
 hudi-platform-service/hudi-metaserver/pom.xml| 4 ++--
 hudi-platform-service/pom.xml| 2 +-
 hudi-spark-datasource/hudi-spark-common/pom.xml  | 4 ++--
 hudi-spark-datasource/hudi-spark/pom.xml | 4 ++--
 hudi-spark-datasource/hudi-spark2-common/pom.xml | 2 +-
 hudi-spark-datasource/hudi-spark2/pom.xml| 4 ++--
 hudi-spark-datasource/hudi-spark3-common/pom.xml | 2 +-
 hudi-spark-datasource/hudi-spark3.0.x/pom.xml| 4 ++--
 hudi-spark-datasource/hudi-spark3.1.x/pom.xml| 4 ++--
 hudi-spark-datasource/hudi-spark3.2.x/pom.xml| 4 ++--
 hudi-spark-datasource/hudi-spark3.2plus-common/pom.xml   | 2 +-
 hudi-spark-datasource/hudi-spark3.3.x/pom.xml| 4 ++--
 hudi-spark-datasource/hudi-spark3.4.x/pom.xml| 4 ++--
 hudi-spark-datasource/hudi-spark3.5.x/pom.xml| 4 ++--

Re: [PR] [MINOR] Correct order of test services start in `UtilitiesTestBase` [hudi]

2024-06-03 Thread via GitHub



hudi-bot commented on PR #11387:
URL: https://github.com/apache/hudi/pull/11387#issuecomment-2146639225

   
   ## CI report:
   
   * 5c44161ef57a474b702de93418e0d601da490897 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24211)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7819] Fix OptionsResolver#allowCommitOnEmptyBatch default value bug [hudi]

2024-06-03 Thread via GitHub



hudi-bot commented on PR #11370:
URL: https://github.com/apache/hudi/pull/11370#issuecomment-2146639091

   
   ## CI report:
   
   * deba838f4432bea1bf8b5ca914cfebd272821f24 UNKNOWN
   * e1c37e6afc4869b7af4f5746ef5baf0512fba58f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24209)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR] Correct order of test services start in `UtilitiesTestBase` [hudi]

2024-06-03 Thread via GitHub



hudi-bot commented on PR #11387:
URL: https://github.com/apache/hudi/pull/11387#issuecomment-2146631915

   
   ## CI report:
   
   * 5c44161ef57a474b702de93418e0d601da490897 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT]Trying to find org.apache.hudi.com.google.common.base.Preconditions when using ZookeeperBasedLockProvider [hudi]

2024-06-03 Thread via GitHub



Gatsby-Lee commented on issue #8723:
URL: https://github.com/apache/hudi/issues/8723#issuecomment-2146629737

   I have the same issue with the Hudi Jar in the Amazon EMR Image 7.1.0.
   hmmm


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7819] Fix OptionsResolver#allowCommitOnEmptyBatch default value bug [hudi]

2024-06-03 Thread via GitHub



hudi-bot commented on PR #11370:
URL: https://github.com/apache/hudi/pull/11370#issuecomment-2146624701

   
   ## CI report:
   
   * a7f8320f47046e457868ced0c56e98f2f35001e4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24187)
 
   * deba838f4432bea1bf8b5ca914cfebd272821f24 UNKNOWN
   * e1c37e6afc4869b7af4f5746ef5baf0512fba58f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24209)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] [MINOR] Correct order of test services start in `UtilitiesTestBase` [hudi]

2024-06-03 Thread via GitHub



geserdugarov opened a new pull request, #11387:
URL: https://github.com/apache/hudi/pull/11387

   For now, all uses of `initTestServices()` are with `needsHdfs=false`, 
`needsHive=false`, `needsZookeeper=false`, except 
`HoodieDeltaStreamerTestBase`, where `needsHive=true`. So there is no problem 
with `initTestServices()` for now. But if we switch `needsZookeeper` to `true`, 
then we will face errors of Hive connection to Zookeeper in 
`HoodieDeltaStreamerTestBase`.
   Correct order of services start:
   1. Zookeeper
   2. HDFS
   3. Hive
   
   Also fixed `cleanUpUtilitiesTestServices()` ordering, in reverse to 
corresponding initialization.
   
   ### Impact
   
   No impact
   
   ### Risk level (write none, low medium or high below)
   
   None
   
   ### Documentation Update
   
   No need
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Change Logs and Impact were stated clearly
   - [x] Adequate tests were added if applicable
   - [x] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT] Hive Sync tool fails to sync Hoodi table written using Flink 1.16 to HMS [hudi]

2024-06-03 Thread via GitHub



alberttwong commented on issue #8848:
URL: https://github.com/apache/hudi/issues/8848#issuecomment-2146593572

   @danny0405 I'm documenting my process at 
https://github.com/apache/incubator-xtable/discussions/457


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7819] Fix OptionsResolver#allowCommitOnEmptyBatch default value bug [hudi]

2024-06-03 Thread via GitHub



hudi-bot commented on PR #11370:
URL: https://github.com/apache/hudi/pull/11370#issuecomment-2146580024

   
   ## CI report:
   
   * a7f8320f47046e457868ced0c56e98f2f35001e4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24187)
 
   * deba838f4432bea1bf8b5ca914cfebd272821f24 UNKNOWN
   * e1c37e6afc4869b7af4f5746ef5baf0512fba58f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7819] Fix OptionsResolver#allowCommitOnEmptyBatch default value bug [hudi]

2024-06-03 Thread via GitHub



hudi-bot commented on PR #11370:
URL: https://github.com/apache/hudi/pull/11370#issuecomment-2146572269

   
   ## CI report:
   
   * a7f8320f47046e457868ced0c56e98f2f35001e4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24187)
 
   * deba838f4432bea1bf8b5ca914cfebd272821f24 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Closed] (HUDI-7824) Fix incremental partitions fetch logic when savepoint is removed for Incr cleaner

2024-06-03 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit closed HUDI-7824.
-
Fix Version/s: 1.0.0
   0.16.0
   Resolution: Fixed

> Fix incremental partitions fetch logic when savepoint is removed for Incr 
> cleaner
> -
>
> Key: HUDI-7824
> URL: https://issues.apache.org/jira/browse/HUDI-7824
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cleaning
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0, 0.16.0
>
>
> with incremental cleaner, if a savepoint is blocking cleaning up of a commit 
> and cleaner moved ahead wrt earliest commit to retain, when savepoint is 
> removed later, cleaner should account for cleaning up the commit of interest. 
>  
> Lets ensure clean planner account for all partitions when such savepoint 
> removal is detected



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

(hudi) branch master updated (d0c7de050a8 -> ffd4f52b9ab)

2024-06-03 Thread codope

This is an automated email from the ASF dual-hosted git repository.

codope pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from d0c7de050a8 [HUDI-7822] Bump io.airlift:aircompressor from 0.25 to 
0.27 (#11380)
 add ffd4f52b9ab [HUDI-7824] Fixing incr cleaner with savepoint removal 
(#11375)

No new revisions were added by this update.

Summary of changes:
 .../hudi/table/action/clean/CleanPlanner.java  | 58 +-
 .../apache/hudi/table/action/TestCleanPlanner.java | 55 +++-
 2 files changed, 55 insertions(+), 58 deletions(-)

Re: [PR] [HUDI-7824] Fixing incr cleaner with savepoint removal [hudi]

2024-06-03 Thread via GitHub



codope merged PR #11375:
URL: https://github.com/apache/hudi/pull/11375


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7747] In MetaClient remove getBasePathV2() and return StoragePath from getBasePath() [hudi]

2024-06-03 Thread via GitHub



wombatu-kun commented on PR #11385:
URL: https://github.com/apache/hudi/pull/11385#issuecomment-2146538109

   > Let me know if you prefer to address the `toString()` calls in this PR. 
Also, could you raise another PR against `branch-0.x` with the same changes?
   
   Ok, i'll do it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Assigned] (HUDI-7782) Task not serializable due to DynamoDBBasedLockProvider and HiveMetastoreBasedLockProvider in clean action

2024-06-03 Thread Vova Kolmakov (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vova Kolmakov reassigned HUDI-7782:
---

Assignee: Vova Kolmakov

> Task not serializable due to DynamoDBBasedLockProvider and 
> HiveMetastoreBasedLockProvider in clean action
> -
>
> Key: HUDI-7782
> URL: https://issues.apache.org/jira/browse/HUDI-7782
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: hector
>Assignee: Vova Kolmakov
>Priority: Major
>
> Caused by: java.io.NotSerializableException: 
> org.apache.hudi.hive.transaction.lock.HiveMetastoreBasedLockProvider
> Serialization stack:
>  - object not serializable (class: 
> org.apache.hudi.hive.transaction.lock.HiveMetastoreBasedLockProvider, value: 
> org.apache.hudi.hive.transaction.lock.HiveMetastoreBasedLockProvider@1355d2ca)
>  
> like HUDI-3638, only fixed the issue of ZookeeperbasedLockProvider.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [HUDI-7747] In MetaClient remove getBasePathV2() and return StoragePath from getBasePath() [hudi]

2024-06-03 Thread via GitHub



yihua commented on code in PR #11385:
URL: https://github.com/apache/hudi/pull/11385#discussion_r1625298506


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/index/ScheduleIndexActionExecutor.java:
##
@@ -150,8 +150,8 @@ private void validateBeforeScheduling() {
   private void abort(HoodieInstant indexInstant) {
 // delete metadata partition
 partitionIndexTypes.forEach(partitionType -> {
-  if (metadataPartitionExists(table.getMetaClient().getBasePath(), 
context, partitionType.getPartitionPath())) {
-deleteMetadataPartition(table.getMetaClient().getBasePath(), context, 
partitionType.getPartitionPath());
+  if 
(metadataPartitionExists(table.getMetaClient().getBasePath().toString(), 
context, partitionType.getPartitionPath())) {

Review Comment:
   It would be great to get rid of most of the `toString()` calls.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7747] In MetaClient remove getBasePathV2() and return StoragePath from getBasePath() [hudi]

2024-06-03 Thread via GitHub



yihua commented on code in PR #11385:
URL: https://github.com/apache/hudi/pull/11385#discussion_r1625295987


##
hudi-cli/src/main/java/org/apache/hudi/cli/commands/FileSystemViewCommand.java:
##
@@ -239,9 +239,9 @@ private HoodieTableFileSystemView 
buildFileSystemView(String globRegex, String m
 HoodieTableMetaClient client = HoodieCLI.getTableMetaClient();
 HoodieTableMetaClient metaClient = HoodieTableMetaClient.builder()
 .setConf(client.getStorageConf().newInstance())
-
.setBasePath(client.getBasePath()).setLoadActiveTimelineOnLoad(true).build();
+
.setBasePath(client.getBasePath().toString()).setLoadActiveTimelineOnLoad(true).build();

Review Comment:
   nit: can some of the `toString()` calls be avoided by directly passing the 
`StoragePath` instance?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7704] Unify test client storage classes with duplicate code [hudi]

2024-06-03 Thread via GitHub



hudi-bot commented on PR #11152:
URL: https://github.com/apache/hudi/pull/11152#issuecomment-2146485218

   
   ## CI report:
   
   * dd052193e61243e0f2228fe8993851ac066dbdda Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24208)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Assigned] (HUDI-7827) Bump io.airlift:aircompressor from 0.25 to 0.27

2024-06-03 Thread Vova Kolmakov (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vova Kolmakov reassigned HUDI-7827:
---

Assignee: (was: Vova Kolmakov)

> Bump io.airlift:aircompressor from 0.25 to 0.27
> ---
>
> Key: HUDI-7827
> URL: https://issues.apache.org/jira/browse/HUDI-7827
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-7827) Bump io.airlift:aircompressor from 0.25 to 0.27

2024-06-03 Thread Vova Kolmakov (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vova Kolmakov reassigned HUDI-7827:
---

Assignee: Vova Kolmakov

> Bump io.airlift:aircompressor from 0.25 to 0.27
> ---
>
> Key: HUDI-7827
> URL: https://issues.apache.org/jira/browse/HUDI-7827
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Vova Kolmakov
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [HUDI-7747] In MetaClient remove getBasePathV2() and return StoragePath from getBasePath() [hudi]

2024-06-03 Thread via GitHub



hudi-bot commented on PR #11385:
URL: https://github.com/apache/hudi/pull/11385#issuecomment-2146435632

   
   ## CI report:
   
   * 064b5310f709e5886dd7e278d1ebf9cdcfbe70c7 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24206)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7824] Fixing incr cleaner with savepoint removal [hudi]

2024-06-03 Thread via GitHub



hudi-bot commented on PR #11375:
URL: https://github.com/apache/hudi/pull/11375#issuecomment-2146435535

   
   ## CI report:
   
   * f9f468cff5a5cb05c822e3dc0c349b60217fb208 UNKNOWN
   * 8475945ae37ea4e76e388a89f1c0c908bc943508 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24207)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7825]Support Report pending clustering and compaction plan metric [hudi]

2024-06-03 Thread via GitHub



danny0405 commented on code in PR #11377:
URL: https://github.com/apache/hudi/pull/11377#discussion_r1625240199


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java:
##
@@ -272,6 +275,28 @@ public void notifyCheckpointComplete(long checkpointId) {
 );
   }
 
+  private void emitCompactionAndClusteringMetrics(Configuration conf,
+  HoodieTableMetaClient metaClient, HoodieFlinkWriteClient writeClient) {
+if (conf.getBoolean(FlinkOptions.CLUSTERING_SCHEDULE_ENABLED)
+&& !conf.getBoolean(FlinkOptions.CLUSTERING_ASYNC_ENABLED)) {
+  HoodieTimeline pendingReplaceTimeline = metaClient.getActiveTimeline()
+  .filterPendingReplaceTimeline();
+  HoodieMetrics metrics = writeClient.getMetrics();
+  if (metrics != null) {
+
metrics.setPendingClusteringCount(pendingReplaceTimeline.countInstants());
+  }
+}
+if (conf.getBoolean(FlinkOptions.COMPACTION_SCHEDULE_ENABLED)

Review Comment:
   yeah, you are right, there are two set of metrics for Flink now, we might 
need to unify them.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7704] Unify test client storage classes with duplicate code [hudi]

2024-06-03 Thread via GitHub



the-other-tim-brown commented on code in PR #11152:
URL: https://github.com/apache/hudi/pull/11152#discussion_r1625236004


##
hudi-client/hudi-client-common/src/test/java/org/apache/hudi/utils/HoodieWriterClientTestHarness.java:
##
@@ -165,71 +247,1183 @@ public HoodieWriteConfig.Builder 
getConfigBuilder(String schemaStr, HoodieIndex.
 return builder;
   }
 
-  public void assertPartitionMetadataForRecords(String basePath, 
List inputRecords,
-HoodieStorage storage) throws 
IOException {
-Set partitionPathSet = inputRecords.stream()
-.map(HoodieRecord::getPartitionPath)
-.collect(Collectors.toSet());
-assertPartitionMetadata(basePath, 
partitionPathSet.stream().toArray(String[]::new), storage);
+  // Functional Interfaces for passing lambda and Hoodie Write API contexts
+
+  @FunctionalInterface
+  public interface Function2 {

Review Comment:
   I'm fine with keeping it as is. I didn't realize the difference



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7824] Fixing incr cleaner with savepoint removal [hudi]

2024-06-03 Thread via GitHub



hudi-bot commented on PR #11375:
URL: https://github.com/apache/hudi/pull/11375#issuecomment-2146392324

   
   ## CI report:
   
   * f9f468cff5a5cb05c822e3dc0c349b60217fb208 UNKNOWN
   * eb047ef0d7f79002d87338b776e03923de161dee Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24185)
 
   * 8475945ae37ea4e76e388a89f1c0c908bc943508 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24207)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7704] Unify test client storage classes with duplicate code [hudi]

2024-06-03 Thread via GitHub



hudi-bot commented on PR #11152:
URL: https://github.com/apache/hudi/pull/11152#issuecomment-2146391979

   
   ## CI report:
   
   * 9946df9dd0aca2e4e8613b36265462d76397c8d8 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24194)
 
   * dd052193e61243e0f2228fe8993851ac066dbdda Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24208)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7824] Fixing incr cleaner with savepoint removal [hudi]

2024-06-03 Thread via GitHub



hudi-bot commented on PR #11375:
URL: https://github.com/apache/hudi/pull/11375#issuecomment-2146386048

   
   ## CI report:
   
   * f9f468cff5a5cb05c822e3dc0c349b60217fb208 UNKNOWN
   * eb047ef0d7f79002d87338b776e03923de161dee Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24185)
 
   * 8475945ae37ea4e76e388a89f1c0c908bc943508 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7704] Unify test client storage classes with duplicate code [hudi]

2024-06-03 Thread via GitHub



hudi-bot commented on PR #11152:
URL: https://github.com/apache/hudi/pull/11152#issuecomment-2146385756

   
   ## CI report:
   
   * 9946df9dd0aca2e4e8613b36265462d76397c8d8 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24194)
 
   * dd052193e61243e0f2228fe8993851ac066dbdda UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-5956] fix spark DAG ui when write [hudi]

2024-06-03 Thread via GitHub



danny0405 commented on code in PR #11376:
URL: https://github.com/apache/hudi/pull/11376#discussion_r1625210183


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala:
##
@@ -174,32 +174,43 @@ class HoodieSparkSqlWriterInternal {
 sourceDf: DataFrame,
 streamingWritesParamsOpt: Option[StreamingWriteParams] = 
Option.empty,
 hoodieWriteClient: Option[SparkRDDWriteClient[_]] = Option.empty):
-
   (Boolean, HOption[String], HOption[String], HOption[String], 
SparkRDDWriteClient[_], HoodieTableConfig) = {
-var succeeded = false
-var counter = 0
-val maxRetry: Integer = 
Integer.parseInt(optParams.getOrElse(HoodieWriteConfig.NUM_RETRIES_ON_CONFLICT_FAILURES.key(),
 HoodieWriteConfig.NUM_RETRIES_ON_CONFLICT_FAILURES.defaultValue().toString))
-var toReturn: (Boolean, HOption[String], HOption[String], HOption[String], 
SparkRDDWriteClient[_], HoodieTableConfig) = null
 
-while (counter <= maxRetry && !succeeded) {
-  try {
-toReturn = writeInternal(sqlContext, mode, optParams, sourceDf, 
streamingWritesParamsOpt, hoodieWriteClient)
-if (counter > 0) {
-  log.warn(s"Succeeded with attempt no $counter")
-}
-succeeded = true
-  } catch {
-case e: HoodieWriteConflictException =>
-  val writeConcurrencyMode = 
optParams.getOrElse(HoodieWriteConfig.WRITE_CONCURRENCY_MODE.key(), 
HoodieWriteConfig.WRITE_CONCURRENCY_MODE.defaultValue())
-  if (WriteConcurrencyMode.supportsMultiWriter(writeConcurrencyMode) 
&& counter < maxRetry) {
-counter += 1
-log.warn(s"Conflict found. Retrying again for attempt no $counter")
-  } else {
-throw e
+val retryWrite: () => (Boolean, HOption[String], HOption[String], 
HOption[String], SparkRDDWriteClient[_], HoodieTableConfig) = () => {
+  var succeeded = false
+  var counter = 0
+  val maxRetry: Integer = 
Integer.parseInt(optParams.getOrElse(HoodieWriteConfig.NUM_RETRIES_ON_CONFLICT_FAILURES.key(),
 HoodieWriteConfig.NUM_RETRIES_ON_CONFLICT_FAILURES.defaultValue().toString))
+  var toReturn: (Boolean, HOption[String], HOption[String], 
HOption[String], SparkRDDWriteClient[_], HoodieTableConfig) = null
+
+  while (counter <= maxRetry && !succeeded) {
+try {
+  toReturn = writeInternal(sqlContext, mode, optParams, sourceDf, 
streamingWritesParamsOpt, hoodieWriteClient)
+  if (counter > 0) {
+log.warn(s"Succeeded with attempt no $counter")
   }
+  succeeded = true
+} catch {
+  case e: HoodieWriteConflictException =>
+val writeConcurrencyMode = 
optParams.getOrElse(HoodieWriteConfig.WRITE_CONCURRENCY_MODE.key(), 
HoodieWriteConfig.WRITE_CONCURRENCY_MODE.defaultValue())
+if (WriteConcurrencyMode.supportsMultiWriter(writeConcurrencyMode) 
&& counter < maxRetry) {
+  counter += 1
+  log.warn(s"Conflict found. Retrying again for attempt no 
$counter")
+} else {
+  throw e
+}
+}
   }
+  toReturn
+}
+
+val executionId = getExecutionId(sqlContext.sparkContext, 
sourceDf.queryExecution)
+if (executionId.isEmpty) {
+  sparkAdapter.sqlExecutionWithNewExecutionId(sourceDf.sparkSession, 
sourceDf.queryExecution, Option("Hudi Command"))(

Review Comment:
   @jonvex do you have intreast for the review?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7146] Integrate secondary index on reader path [hudi]

2024-06-03 Thread via GitHub



danny0405 commented on code in PR #11162:
URL: https://github.com/apache/hudi/pull/11162#discussion_r1625209429


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/common/HoodieSparkEngineContext.java:
##
@@ -229,6 +231,13 @@ public void cancelAllJobs() {
 javaSparkContext.cancelAllJobs();
   }
 
+  @Override
+  public  O aggregate(HoodieData data, O zeroValue, 
Functions.Function2 seqOp, Functions.Function2 combOp) {
+Function2 seqOpFunc = seqOp::apply;
+Function2 combOpFunc = combOp::apply;
+return HoodieJavaRDD.getJavaRDD(data).aggregate(zeroValue, seqOpFunc, 
combOpFunc);

Review Comment:
   I didn't see changes for `HoodieMetadataMergedLogRecordScanner` that 
switches from string cache key to serializable, then how we support the 
non-string secondary index fields?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT] Hive Sync tool fails to sync Hoodi table written using Flink 1.16 to HMS [hudi]

2024-06-03 Thread via GitHub



danny0405 commented on issue #8848:
URL: https://github.com/apache/hudi/issues/8848#issuecomment-2146377029

   @alberttwong Did you package the jar manually with the hive profile?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7704] Unify test client storage classes with duplicate code [hudi]

2024-06-03 Thread via GitHub



wombatu-kun commented on code in PR #11152:
URL: https://github.com/apache/hudi/pull/11152#discussion_r1625204089


##
hudi-client/hudi-client-common/src/test/java/org/apache/hudi/utils/HoodieWriterClientTestHarness.java:
##
@@ -165,71 +247,1183 @@ public HoodieWriteConfig.Builder 
getConfigBuilder(String schemaStr, HoodieIndex.
 return builder;
   }
 
-  public void assertPartitionMetadataForRecords(String basePath, 
List inputRecords,
-HoodieStorage storage) throws 
IOException {
-Set partitionPathSet = inputRecords.stream()
-.map(HoodieRecord::getPartitionPath)
-.collect(Collectors.toSet());
-assertPartitionMetadata(basePath, 
partitionPathSet.stream().toArray(String[]::new), storage);
+  // Functional Interfaces for passing lambda and Hoodie Write API contexts
+
+  @FunctionalInterface
+  public interface Function2 {

Review Comment:
   The order of types is different from BiFunction: here resulting type is 
first, but in BiFunction it is the last. I did not add Function2, before 
refactoring it was already in code and it was used a lot (>50 usages). And also 
there is Function3 with resulting type in the first place. Function3 has >80 
usages. And if we replace Function2 with BiFunction, we should also reorder 
type params in Function3 declaration and usages for consistency.  
   Is it really necessary?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7704] Unify test client storage classes with duplicate code [hudi]

2024-06-03 Thread via GitHub



wombatu-kun commented on code in PR #11152:
URL: https://github.com/apache/hudi/pull/11152#discussion_r1625200070


##
hudi-client/hudi-client-common/src/test/java/org/apache/hudi/utils/HoodieWriterClientTestHarness.java:
##
@@ -165,71 +247,1183 @@ public HoodieWriteConfig.Builder 
getConfigBuilder(String schemaStr, HoodieIndex.
 return builder;
   }
 
-  public void assertPartitionMetadataForRecords(String basePath, 
List inputRecords,
-HoodieStorage storage) throws 
IOException {
-Set partitionPathSet = inputRecords.stream()
-.map(HoodieRecord::getPartitionPath)
-.collect(Collectors.toSet());
-assertPartitionMetadata(basePath, 
partitionPathSet.stream().toArray(String[]::new), storage);
+  // Functional Interfaces for passing lambda and Hoodie Write API contexts
+
+  @FunctionalInterface
+  public interface Function2 {
+
+R apply(T1 v1, T2 v2) throws IOException;
+  }
+
+  @FunctionalInterface
+  public interface Function3 {
+
+R apply(T1 v1, T2 v2, T3 v3) throws IOException;
+  }
+
+  /* Auxiliary methods for testing CopyOnWriteStorage with Spark and Java 
clients
+  to avoid code duplication in TestHoodieClientOnCopyOnWriteStorage and 
TestHoodieJavaClientOnCopyOnWriteStorage */
+
+  protected List writeAndVerifyBatch(BaseHoodieWriteClient 
client, List inserts, String commitTime, boolean 
populateMetaFields, boolean autoCommitOff) throws IOException {
+// override in subclasses if needed
+return Collections.emptyList();

Review Comment:
   Ok, made it abstract, moved it's implementations from 
TestHoodieJavaClientOnCopyOnWriteStorage to HoodieJavaClientTestHarness, and 
from TestHoodieClientOnCopyOnWriteStorage to HoodieSparkClientTestHarness.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7704] Unify test client storage classes with duplicate code [hudi]

2024-06-03 Thread via GitHub



wombatu-kun commented on code in PR #11152:
URL: https://github.com/apache/hudi/pull/11152#discussion_r1625198786


##
hudi-client/hudi-client-common/src/test/java/org/apache/hudi/testutils/Assertions.java:
##
@@ -51,4 +67,88 @@ public static void assertFileSizesEqual(List 
statuses, CheckedFunct
 assertEquals(fileSizeGetter.apply(status), 
status.getStat().getFileSizeInBytes(;
   }
 
+  public static void assertPartitionMetadataForRecords(String basePath, 
List inputRecords,
+HoodieStorage storage) throws 
IOException {
+Set partitionPathSet = inputRecords.stream()
+.map(HoodieRecord::getPartitionPath)
+.collect(Collectors.toSet());
+assertPartitionMetadata(basePath, 
partitionPathSet.stream().toArray(String[]::new), storage);

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7824] Fixing incr cleaner with savepoint removal [hudi]

2024-06-03 Thread via GitHub



nsivabalan commented on code in PR #11375:
URL: https://github.com/apache/hudi/pull/11375#discussion_r1625198742


##
hudi-client/hudi-client-common/src/test/java/org/apache/hudi/table/action/TestCleanPlanner.java:
##
@@ -393,23 +417,23 @@ static Stream 
keepLatestByHoursOrCommitsArgsIncrCleanPartitions() {
 Map> latestSavepoints = new HashMap<>();
 latestSavepoints.put(savepoint2, Collections.singletonList(PARTITION1));
 latestSavepoints.put(savepoint3, Collections.singletonList(PARTITION1));
-
arguments.addAll(buildArgumentsForCleanByHoursAndCommitsIncrCleanPartitionsCases(
+
arguments.addAll(buildArgumentsForCleanByHoursAndCommitsIncrCleanPartitionsCases(true,
 earliestInstant, lastCompletedInLastClean, lastCleanInstant, 
earliestInstantInLastClean, Collections.singletonList(PARTITION1),
 Collections.singletonMap(savepoint2, 
Collections.singletonList(PARTITION1)),
 activeInstantsPartitionsMap2, latestSavepoints, 
twoPartitionsInActiveTimeline, false));
 
 // 2 savepoints were tracked in previous clean. one of them is removed in 
latest. A partition which was part of the removed savepoint should be added in 
final
 // list of partitions to clean
 Map> previousSavepoints = new HashMap<>();
-latestSavepoints.put(savepoint2, Collections.singletonList(PARTITION1));
-latestSavepoints.put(savepoint3, Collections.singletonList(PARTITION2));
-
arguments.addAll(buildArgumentsForCleanByHoursAndCommitsIncrCleanPartitionsCases(
+previousSavepoints.put(savepoint2, Collections.singletonList(PARTITION1));
+previousSavepoints.put(savepoint3, Collections.singletonList(PARTITION2));
+
arguments.addAll(buildArgumentsForCleanByHoursAndCommitsIncrCleanPartitionsCases(true,
 earliestInstant, lastCompletedInLastClean, lastCleanInstant, 
earliestInstantInLastClean, Collections.singletonList(PARTITION1),
-previousSavepoints, activeInstantsPartitionsMap2, 
Collections.singletonMap(savepoint3, Collections.singletonList(PARTITION2)), 
twoPartitionsInActiveTimeline, false));
+previousSavepoints, activeInstantsPartitionsMap2, 
Collections.singletonMap(savepoint3, Collections.singletonList(PARTITION2)), 
threePartitionsInActiveTimeline, false));

Review Comment:
   sure. makes sense



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7747] In MetaClient remove getBasePathV2() and return StoragePath from getBasePath() [hudi]

2024-06-03 Thread via GitHub



hudi-bot commented on PR #11385:
URL: https://github.com/apache/hudi/pull/11385#issuecomment-2146337860

   
   ## CI report:
   
   * 8605f0fd0fa5bc1c82a26eac8147fc521040f53a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24201)
 
   * 064b5310f709e5886dd7e278d1ebf9cdcfbe70c7 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24206)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7747] In MetaClient remove getBasePathV2() and return StoragePath from getBasePath() [hudi]

2024-06-03 Thread via GitHub



hudi-bot commented on PR #11385:
URL: https://github.com/apache/hudi/pull/11385#issuecomment-2146330844

   
   ## CI report:
   
   * 8605f0fd0fa5bc1c82a26eac8147fc521040f53a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24201)
 
   * 064b5310f709e5886dd7e278d1ebf9cdcfbe70c7 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7704] Unify test client storage classes with duplicate code [hudi]

2024-06-03 Thread via GitHub



the-other-tim-brown commented on PR #11152:
URL: https://github.com/apache/hudi/pull/11152#issuecomment-2146293807

   Just a couple minor nitpicks but the refactor looks good to me


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7704] Unify test client storage classes with duplicate code [hudi]

2024-06-03 Thread via GitHub



the-other-tim-brown commented on code in PR #11152:
URL: https://github.com/apache/hudi/pull/11152#discussion_r1625155568


##
hudi-client/hudi-client-common/src/test/java/org/apache/hudi/testutils/Assertions.java:
##
@@ -51,4 +67,88 @@ public static void assertFileSizesEqual(List 
statuses, CheckedFunct
 assertEquals(fileSizeGetter.apply(status), 
status.getStat().getFileSizeInBytes(;
   }
 
+  public static void assertPartitionMetadataForRecords(String basePath, 
List inputRecords,
+HoodieStorage storage) throws 
IOException {
+Set partitionPathSet = inputRecords.stream()
+.map(HoodieRecord::getPartitionPath)
+.collect(Collectors.toSet());
+assertPartitionMetadata(basePath, 
partitionPathSet.stream().toArray(String[]::new), storage);

Review Comment:
   For this line and 83, you can simplify this by not collecting to a set and 
just use `distinct()` on the stream



##
hudi-client/hudi-client-common/src/test/java/org/apache/hudi/utils/HoodieWriterClientTestHarness.java:
##
@@ -165,71 +247,1183 @@ public HoodieWriteConfig.Builder 
getConfigBuilder(String schemaStr, HoodieIndex.
 return builder;
   }
 
-  public void assertPartitionMetadataForRecords(String basePath, 
List inputRecords,
-HoodieStorage storage) throws 
IOException {
-Set partitionPathSet = inputRecords.stream()
-.map(HoodieRecord::getPartitionPath)
-.collect(Collectors.toSet());
-assertPartitionMetadata(basePath, 
partitionPathSet.stream().toArray(String[]::new), storage);
+  // Functional Interfaces for passing lambda and Hoodie Write API contexts
+
+  @FunctionalInterface
+  public interface Function2 {
+
+R apply(T1 v1, T2 v2) throws IOException;
+  }
+
+  @FunctionalInterface
+  public interface Function3 {
+
+R apply(T1 v1, T2 v2, T3 v3) throws IOException;
+  }
+
+  /* Auxiliary methods for testing CopyOnWriteStorage with Spark and Java 
clients
+  to avoid code duplication in TestHoodieClientOnCopyOnWriteStorage and 
TestHoodieJavaClientOnCopyOnWriteStorage */
+
+  protected List writeAndVerifyBatch(BaseHoodieWriteClient 
client, List inserts, String commitTime, boolean 
populateMetaFields, boolean autoCommitOff) throws IOException {
+// override in subclasses if needed
+return Collections.emptyList();

Review Comment:
   Should this just be abstract? Returning empty list by default may be 
misleading to other developers in the future that extend this class



##
hudi-client/hudi-client-common/src/test/java/org/apache/hudi/utils/HoodieWriterClientTestHarness.java:
##
@@ -165,71 +247,1183 @@ public HoodieWriteConfig.Builder 
getConfigBuilder(String schemaStr, HoodieIndex.
 return builder;
   }
 
-  public void assertPartitionMetadataForRecords(String basePath, 
List inputRecords,
-HoodieStorage storage) throws 
IOException {
-Set partitionPathSet = inputRecords.stream()
-.map(HoodieRecord::getPartitionPath)
-.collect(Collectors.toSet());
-assertPartitionMetadata(basePath, 
partitionPathSet.stream().toArray(String[]::new), storage);
+  // Functional Interfaces for passing lambda and Hoodie Write API contexts
+
+  @FunctionalInterface
+  public interface Function2 {

Review Comment:
   There is already a BiFunction in java that does the same thing, can we just 
use that?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7823] Simplify dependency management on exclusions [hudi]

2024-06-03 Thread via GitHub



hudi-bot commented on PR #11374:
URL: https://github.com/apache/hudi/pull/11374#issuecomment-2146288529

   
   ## CI report:
   
   * 05fab0df29530420f0a77abf46be996b70c1bc25 UNKNOWN
   * 1e6955bbac8cc18f6774360c7b3ef4e307c1c397 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24205)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7823] Simplify dependency management on exclusions [hudi]

2024-06-03 Thread via GitHub



hudi-bot commented on PR #11374:
URL: https://github.com/apache/hudi/pull/11374#issuecomment-2146281871

   
   ## CI report:
   
   * 05fab0df29530420f0a77abf46be996b70c1bc25 UNKNOWN
   * 70cb1fe3bf55810cb26a89147fad92594537388c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24204)
 
   * 1e6955bbac8cc18f6774360c7b3ef4e307c1c397 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7823] Simplify dependency management on exclusions [hudi]

2024-06-03 Thread via GitHub



hudi-bot commented on PR #11374:
URL: https://github.com/apache/hudi/pull/11374#issuecomment-2146274832

   
   ## CI report:
   
   * 05fab0df29530420f0a77abf46be996b70c1bc25 UNKNOWN
   * 6abd40f1b77feb86cdc95d58cd2285c546a1f63e Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24180)
 
   * 70cb1fe3bf55810cb26a89147fad92594537388c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7824] Fixing incr cleaner with savepoint removal [hudi]

2024-06-03 Thread via GitHub



the-other-tim-brown commented on code in PR #11375:
URL: https://github.com/apache/hudi/pull/11375#discussion_r1625111858


##
hudi-client/hudi-client-common/src/test/java/org/apache/hudi/table/action/TestCleanPlanner.java:
##
@@ -160,6 +178,12 @@ void testPartitionsForIncrCleaning(HoodieWriteConfig 
config, String earliestInst
 mockLastCleanCommit(mockHoodieTable, lastCleanInstant, 
earliestInstantsInLastClean, activeTimeline, cleanMetadataOptionPair);
 mockFewActiveInstants(mockHoodieTable, activeInstantsPartitions, 
savepointsTrackedInLastClean, areCommitsForSavepointsRemoved);
 
+// mock getAllPartitions
+HoodieStorage storage = mock(HoodieStorage.class);
+when(mockHoodieTable.getStorage()).thenReturn(storage);
+mockedStatic.when(() -> FSUtils.getAllPartitionPaths(context, storage, 
config.getMetadataConfig(), config.getBasePath()))

Review Comment:
   we could also update the CleanPlanner to use 
`hoodieTable.getMetadataTable().getAllPartitionPaths()` which could make the 
test setup cleaner as well



##
hudi-client/hudi-client-common/src/test/java/org/apache/hudi/table/action/TestCleanPlanner.java:
##
@@ -393,23 +417,23 @@ static Stream 
keepLatestByHoursOrCommitsArgsIncrCleanPartitions() {
 Map> latestSavepoints = new HashMap<>();
 latestSavepoints.put(savepoint2, Collections.singletonList(PARTITION1));
 latestSavepoints.put(savepoint3, Collections.singletonList(PARTITION1));
-
arguments.addAll(buildArgumentsForCleanByHoursAndCommitsIncrCleanPartitionsCases(
+
arguments.addAll(buildArgumentsForCleanByHoursAndCommitsIncrCleanPartitionsCases(true,
 earliestInstant, lastCompletedInLastClean, lastCleanInstant, 
earliestInstantInLastClean, Collections.singletonList(PARTITION1),
 Collections.singletonMap(savepoint2, 
Collections.singletonList(PARTITION1)),
 activeInstantsPartitionsMap2, latestSavepoints, 
twoPartitionsInActiveTimeline, false));
 
 // 2 savepoints were tracked in previous clean. one of them is removed in 
latest. A partition which was part of the removed savepoint should be added in 
final
 // list of partitions to clean
 Map> previousSavepoints = new HashMap<>();
-latestSavepoints.put(savepoint2, Collections.singletonList(PARTITION1));
-latestSavepoints.put(savepoint3, Collections.singletonList(PARTITION2));
-
arguments.addAll(buildArgumentsForCleanByHoursAndCommitsIncrCleanPartitionsCases(
+previousSavepoints.put(savepoint2, Collections.singletonList(PARTITION1));
+previousSavepoints.put(savepoint3, Collections.singletonList(PARTITION2));
+
arguments.addAll(buildArgumentsForCleanByHoursAndCommitsIncrCleanPartitionsCases(true,
 earliestInstant, lastCompletedInLastClean, lastCleanInstant, 
earliestInstantInLastClean, Collections.singletonList(PARTITION1),
-previousSavepoints, activeInstantsPartitionsMap2, 
Collections.singletonMap(savepoint3, Collections.singletonList(PARTITION2)), 
twoPartitionsInActiveTimeline, false));
+previousSavepoints, activeInstantsPartitionsMap2, 
Collections.singletonMap(savepoint3, Collections.singletonList(PARTITION2)), 
threePartitionsInActiveTimeline, false));

Review Comment:
   Should the descriptions in the comments be updated to match the changes in 
the expected partitions?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7146] Integrate secondary index on reader path [hudi]

2024-06-03 Thread via GitHub



hudi-bot commented on PR #11162:
URL: https://github.com/apache/hudi/pull/11162#issuecomment-2145850157

   
   ## CI report:
   
   * b342d8f8e10f77419bf1bd0bc9f626a596ad65f9 UNKNOWN
   * ab2875d506fbb642636ca10d044fa9b9e5c951ae Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24203)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] Implement Support for Ursa Frame ingestion pipelines [hudi]

2024-06-03 Thread via GitHub



balaji-varadarajan closed pull request #11386: Implement Support for Ursa Frame 
ingestion pipelines
URL: https://github.com/apache/hudi/pull/11386


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7146] Integrate secondary index on reader path [hudi]

2024-06-03 Thread via GitHub



hudi-bot commented on PR #11162:
URL: https://github.com/apache/hudi/pull/11162#issuecomment-2145748432

   
   ## CI report:
   
   * b342d8f8e10f77419bf1bd0bc9f626a596ad65f9 UNKNOWN
   * b00831e0a0506714d27bc2a64e58084b357a83cc Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24196)
 
   * ab2875d506fbb642636ca10d044fa9b9e5c951ae Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24203)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT] Hive Sync tool fails to sync Hoodi table written using Flink 1.16 to HMS [hudi]

2024-06-03 Thread via GitHub



alberttwong commented on issue #8848:
URL: https://github.com/apache/hudi/issues/8848#issuecomment-2145746488

   adding in https://mvnrepository.com/artifact/org.apache.thrift/libfb303
   
   ```
   Running Command : java -cp 
/hive/lib/hive-metastore-3.1.3.jar::/hive/lib/hive-service-3.1.3.jar::/hive/lib/hive-exec-3.1.3.jar::/hive/lib/hive-jdbc-3.1.3.jar:/hive/lib/hive-jdbc-handler-3.1.3.jar::/hive/lib/jackson-annotations-2.12.0.jar:/hive/lib/jackson-core-2.12.0.jar:/hive/lib/jackson-core-asl-1.9.13.jar:/hive/lib/jackson-databind-2.12.0.jar:/hive/lib/jackson-dataformat-smile-2.12.0.jar:/hive/lib/jackson-mapper-asl-1.9.13.jar:/hive/lib/jackson-module-scala_2.11-2.12.0.jar::/hadoop/share/hadoop/common/*:/hadoop/share/hadoop/mapreduce/*:/hadoop/share/hadoop/hdfs/*:/hadoop/share/hadoop/common/lib/*:/hadoop/share/hadoop/hdfs/lib/*:/root/.ivy2/jars/*:/hadoop/etc/hadoop:/opt/hudi/hudi-sync/hudi-hive-sync/../../packaging/hudi-hive-sync-bundle/target/hudi-hive-sync-bundle-1.0.0-SNAPSHOT.jar
 org.apache.hudi.hive.HiveSyncTool --metastore-uris 
thrift://hive-metastore:9083 --partitioned-by city --base-path 
s3a://warehouse/people --database hudi_db --table people --sync-mode hms
   2024-06-03 17:15:25,270 INFO  [main] conf.HiveConf 
(HiveConf.java:findConfigFile(187)) - Found configuration file null
   2024-06-03 17:15:25,444 WARN  [main] util.NativeCodeLoader 
(NativeCodeLoader.java:(60)) - Unable to load native-hadoop library for 
your platform... using builtin-java classes where applicable
   2024-06-03 17:15:25,550 INFO  [main] impl.MetricsConfig 
(MetricsConfig.java:loadFirst(120)) - Loaded properties from 
hadoop-metrics2.properties
   2024-06-03 17:15:25,581 INFO  [main] impl.MetricsSystemImpl 
(MetricsSystemImpl.java:startTimer(378)) - Scheduled Metric snapshot period at 
10 second(s).
   2024-06-03 17:15:25,581 INFO  [main] impl.MetricsSystemImpl 
(MetricsSystemImpl.java:start(191)) - s3a-file-system metrics system started
   2024-06-03 17:15:26,025 INFO  [main] table.HoodieTableMetaClient 
(HoodieTableMetaClient.java:(148)) - Loading HoodieTableMetaClient from 
s3a://warehouse/people
   2024-06-03 17:15:26,120 INFO  [main] table.HoodieTableConfig 
(HoodieTableConfig.java:(309)) - Loading table properties from 
s3a://warehouse/people/.hoodie/hoodie.properties
   2024-06-03 17:15:26,140 INFO  [main] table.HoodieTableMetaClient 
(HoodieTableMetaClient.java:(169)) - Finished Loading Table of type 
COPY_ON_WRITE(version=1) from s3a://warehouse/people
   2024-06-03 17:15:26,140 INFO  [main] table.HoodieTableMetaClient 
(HoodieTableMetaClient.java:(171)) - Loading Active commit timeline for 
s3a://warehouse/people
   2024-06-03 17:15:26,159 INFO  [main] timeline.HoodieActiveTimeline 
(HoodieActiveTimeline.java:(177)) - Loaded instants upto : 
Option{val=[20240603170053432__commit__COMPLETED]}
   2024-06-03 17:15:26,229 ERROR [main] utils.MetaStoreUtils 
(MetaStoreUtils.java:logAndThrowMetaException(166)) - Got exception: 
java.net.URISyntaxException Illegal character in hostname at index 35: 
thrift://demo-hive-metastore-1.demo_default:9083
   java.net.URISyntaxException: Illegal character in hostname at index 35: 
thrift://demo-hive-metastore-1.demo_default:9083
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7146] Integrate secondary index on reader path [hudi]

2024-06-03 Thread via GitHub



hudi-bot commented on PR #11162:
URL: https://github.com/apache/hudi/pull/11162#issuecomment-2145735114

   
   ## CI report:
   
   * b342d8f8e10f77419bf1bd0bc9f626a596ad65f9 UNKNOWN
   * b00831e0a0506714d27bc2a64e58084b357a83cc Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24196)
 
   * ab2875d506fbb642636ca10d044fa9b9e5c951ae UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT] Hive Sync tool fails to sync Hoodi table written using Flink 1.16 to HMS [hudi]

2024-06-03 Thread via GitHub



alberttwong commented on issue #8848:
URL: https://github.com/apache/hudi/issues/8848#issuecomment-2145731084

   after adding 
https://mvnrepository.com/artifact/org.apache.calcite/calcite-core, I ran into
   
   ```
   root@spark:/opt/hudi/hudi-sync/hudi-hive-sync# ./run_sync_tool.sh  
--metastore-uris 'thrift://hive-metastore:9083' --partitioned-by city 
--base-path 's3a://warehouse/people' --database hudi_db --table people 
--sync-mode hms 
   setting hadoop conf dir
   Running Command : java -cp 
/hive/lib/hive-metastore-3.1.3.jar::/hive/lib/hive-service-3.1.3.jar::/hive/lib/hive-exec-3.1.3.jar::/hive/lib/hive-jdbc-3.1.3.jar:/hive/lib/hive-jdbc-handler-3.1.3.jar::/hive/lib/jackson-annotations-2.12.0.jar:/hive/lib/jackson-core-2.12.0.jar:/hive/lib/jackson-core-asl-1.9.13.jar:/hive/lib/jackson-databind-2.12.0.jar:/hive/lib/jackson-dataformat-smile-2.12.0.jar:/hive/lib/jackson-mapper-asl-1.9.13.jar:/hive/lib/jackson-module-scala_2.11-2.12.0.jar::/hadoop/share/hadoop/common/*:/hadoop/share/hadoop/mapreduce/*:/hadoop/share/hadoop/hdfs/*:/hadoop/share/hadoop/common/lib/*:/hadoop/share/hadoop/hdfs/lib/*:/root/.ivy2/jars/*:/hadoop/etc/hadoop:/opt/hudi/hudi-sync/hudi-hive-sync/../../packaging/hudi-hive-sync-bundle/target/hudi-hive-sync-bundle-1.0.0-SNAPSHOT.jar
 org.apache.hudi.hive.HiveSyncTool --metastore-uris 
thrift://hive-metastore:9083 --partitioned-by city --base-path 
s3a://warehouse/people --database hudi_db --table people --sync-mode hms
   2024-06-03 17:10:42,515 INFO  [main] conf.HiveConf 
(HiveConf.java:findConfigFile(187)) - Found configuration file null
   2024-06-03 17:10:42,707 WARN  [main] util.NativeCodeLoader 
(NativeCodeLoader.java:(60)) - Unable to load native-hadoop library for 
your platform... using builtin-java classes where applicable
   2024-06-03 17:10:42,824 INFO  [main] impl.MetricsConfig 
(MetricsConfig.java:loadFirst(120)) - Loaded properties from 
hadoop-metrics2.properties
   2024-06-03 17:10:42,858 INFO  [main] impl.MetricsSystemImpl 
(MetricsSystemImpl.java:startTimer(378)) - Scheduled Metric snapshot period at 
10 second(s).
   2024-06-03 17:10:42,858 INFO  [main] impl.MetricsSystemImpl 
(MetricsSystemImpl.java:start(191)) - s3a-file-system metrics system started
   2024-06-03 17:10:43,304 INFO  [main] table.HoodieTableMetaClient 
(HoodieTableMetaClient.java:(148)) - Loading HoodieTableMetaClient from 
s3a://warehouse/people
   2024-06-03 17:10:43,395 INFO  [main] table.HoodieTableConfig 
(HoodieTableConfig.java:(309)) - Loading table properties from 
s3a://warehouse/people/.hoodie/hoodie.properties
   2024-06-03 17:10:43,413 INFO  [main] table.HoodieTableMetaClient 
(HoodieTableMetaClient.java:(169)) - Finished Loading Table of type 
COPY_ON_WRITE(version=1) from s3a://warehouse/people
   2024-06-03 17:10:43,413 INFO  [main] table.HoodieTableMetaClient 
(HoodieTableMetaClient.java:(171)) - Loading Active commit timeline for 
s3a://warehouse/people
   2024-06-03 17:10:43,431 INFO  [main] timeline.HoodieActiveTimeline 
(HoodieActiveTimeline.java:(177)) - Loaded instants upto : 
Option{val=[20240603170053432__commit__COMPLETED]}
   Exception in thread "main" java.lang.NoClassDefFoundError: 
com/facebook/fb303/FacebookService$Iface
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7747] In MetaClient remove getBasePathV2() and return StoragePath from getBasePath() [hudi]

2024-06-03 Thread via GitHub



hudi-bot commented on PR #11385:
URL: https://github.com/apache/hudi/pull/11385#issuecomment-2145719815

   
   ## CI report:
   
   * 8605f0fd0fa5bc1c82a26eac8147fc521040f53a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24201)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7146] Integrate secondary index on reader path [hudi]

2024-06-03 Thread via GitHub



codope commented on code in PR #11162:
URL: https://github.com/apache/hudi/pull/11162#discussion_r1624781058


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/common/HoodieSparkEngineContext.java:
##
@@ -229,6 +231,13 @@ public void cancelAllJobs() {
 javaSparkContext.cancelAllJobs();
   }
 
+  @Override
+  public  O aggregate(HoodieData data, O zeroValue, 
Functions.Function2 seqOp, Functions.Function2 combOp) {
+Function2 seqOpFunc = seqOp::apply;
+Function2 combOpFunc = combOp::apply;
+return HoodieJavaRDD.getJavaRDD(data).aggregate(zeroValue, seqOpFunc, 
combOpFunc);

Review Comment:
   This is based on 
https://github.com/apache/spark/blob/7e8b60b5ae7d6453bc1ce51b5112c975f9aa8757/core/src/main/scala/org/apache/spark/api/java/JavaRDDLike.scala#L426



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] Implement Support for Ursa Frame ingestion pipelines [hudi]

2024-06-03 Thread via GitHub



balaji-varadarajan opened a new pull request, #11386:
URL: https://github.com/apache/hudi/pull/11386

   ### Change Logs
   
   As described in 
https://docs.google.com/document/d/1sY1Kimyom_qL9-a5Z7lVf43SkZDZCi_wlwxs7J95WMU/edit,
 Implement pipelines to ingest Ursa frame-gen and other process output to 
lakehouse.
   
   ### Impact
   
   
   As described in 
https://docs.google.com/document/d/1sY1Kimyom_qL9-a5Z7lVf43SkZDZCi_wlwxs7J95WMU/edit,
 Implement pipelines to ingest Ursa frame-gen and other process output to 
lakehouse.
   
   ### Risk level (write none, low medium or high below)
   none
   
   ### Documentation Update
   none
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7146] Integrate secondary index on reader path [hudi]

2024-06-03 Thread via GitHub



codope commented on code in PR #11162:
URL: https://github.com/apache/hudi/pull/11162#discussion_r1624771184


##
hudi-common/src/main/java/org/apache/hudi/metadata/BaseTableMetadata.java:
##
@@ -312,6 +312,33 @@ public Map> 
readRecordIndex(List
+   * If the Metadata Table is not enabled, an exception is thrown to 
distinguish this from the absence of the key.
+   *
+   * @param secondaryKeys The list of secondary keys to read
+   */
+  @Override
+  public Map> 
readSecondaryIndex(List secondaryKeys) {
+
ValidationUtils.checkState(dataMetaClient.getTableConfig().isMetadataPartitionAvailable(MetadataPartitionType.RECORD_INDEX),
+"Record index is not initialized in MDT");
+ValidationUtils.checkState(
+
dataMetaClient.getTableConfig().getMetadataPartitions().stream().anyMatch(partitionName
 -> 
partitionName.startsWith(MetadataPartitionType.SECONDARY_INDEX.getPartitionPath())),
+"Secondary index is not initialized in MDT");
+// Fetch secondary-index records
+Map>> secondaryKeyRecords 
= getSecondaryIndexRecords(secondaryKeys, 
MetadataPartitionType.SECONDARY_INDEX.getPartitionPath());
+// Now collect the record-keys and fetch the RLI records

Review Comment:
   No, here it is RLI. Secondary index contains mapping from secondary key to 
primary key. So, we need to lookup RLI to get the files for those matching 
primary keys.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7146] Integrate secondary index on reader path [hudi]

2024-06-03 Thread via GitHub



codope commented on code in PR #11162:
URL: https://github.com/apache/hudi/pull/11162#discussion_r1624769566


##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java:
##
@@ -851,6 +852,158 @@ private Map 
reverseLookupSecondaryKeys(String partitionName, Lis
 return recordKeyMap;
   }
 
+  @Override
+  protected Map>> 
getSecondaryIndexRecords(List keys, String partitionName) {
+if (keys.isEmpty()) {
+  return Collections.emptyMap();
+}
+
+Map>> result = new 
HashMap<>();
+
+// Load the file slices for the partition. Each file slice is a shard 
which saves a portion of the keys.
+List partitionFileSlices = 
partitionFileSliceMap.computeIfAbsent(partitionName,
+k -> 
HoodieTableMetadataUtil.getPartitionLatestMergedFileSlices(metadataMetaClient, 
metadataFileSystemView, partitionName));
+final int numFileSlices = partitionFileSlices.size();
+ValidationUtils.checkState(numFileSlices > 0, "Number of file slices for 
partition " + partitionName + " should be > 0");
+
+// Lookup keys from each file slice
+// TODO: parallelize this loop
+for (FileSlice partition : partitionFileSlices) {
+  Map>> 
currentFileSliceResult = lookupSecondaryKeysFromFileSlice(partitionName, keys, 
partition);
+
+  currentFileSliceResult.forEach((secondaryKey, secondaryRecords) -> {
+result.merge(secondaryKey, secondaryRecords, (oldRecords, newRecords) 
-> {
+  newRecords.addAll(oldRecords);
+  return newRecords;
+});
+  });
+}
+
+return result;
+  }
+
+  /**
+   * Lookup list of keys from a single file slice.
+   *
+   * @param partitionName Name of the partition
+   * @param secondaryKeys The list of secondary keys to lookup
+   * @param fileSlice The file slice to read
+   * @return A {@code Map} of secondary-key to list of {@code HoodieRecord} 
for the secondary-keys which were found in the file slice
+   */
+  private Map>> 
lookupSecondaryKeysFromFileSlice(String partitionName, List 
secondaryKeys, FileSlice fileSlice) {
+Map> logRecordsMap = new HashMap<>();
+
+Pair, HoodieMetadataLogRecordReader> readers = 
getOrCreateReaders(partitionName, fileSlice);
+try {
+  List timings = new ArrayList<>(1);
+  HoodieSeekingFileReader baseFileReader = readers.getKey();
+  HoodieMetadataLogRecordReader logRecordScanner = readers.getRight();
+  if (baseFileReader == null && logRecordScanner == null) {
+return Collections.emptyMap();
+  }
+
+  // Sort it here once so that we don't need to sort individually for base 
file and for each individual log files.
+  Set secondaryKeySet = new HashSet<>(secondaryKeys.size());
+  List sortedSecondaryKeys = new ArrayList<>(secondaryKeys);
+  Collections.sort(sortedSecondaryKeys);

Review Comment:
   Good point! I have now parallelized the lookup through engineContext. So, 
this sorting would only be limited to single partition of data which should not 
spill to disk. But, even if it does, spark will handle it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7146] Integrate secondary index on reader path [hudi]

2024-06-03 Thread via GitHub



codope commented on code in PR #11162:
URL: https://github.com/apache/hudi/pull/11162#discussion_r1624767426


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestSecondaryIndexWithSql.scala:
##
@@ -95,4 +97,39 @@ class TestSecondaryIndexWithSql extends 
SecondaryIndexTestBase {
   private def checkAnswer(sql: String)(expects: Seq[Any]*): Unit = {
 assertResult(expects.map(row => Row(row: 
_*)).toArray.sortBy(_.toString()))(spark.sql(sql).collect().sortBy(_.toString()))
   }
+
+  @Test
+  def testSecondaryIndexWithInFilter(): Unit = {
+if (HoodieSparkUtils.gteqSpark3_2) {
+  var hudiOpts = commonOpts
+  hudiOpts = hudiOpts + (
+DataSourceWriteOptions.TABLE_TYPE.key -> 
HoodieTableType.COPY_ON_WRITE.name(),
+DataSourceReadOptions.ENABLE_DATA_SKIPPING.key -> "true")
+
+  spark.sql(
+s"""
+   |create table $tableName (
+   |  record_key_col string,
+   |  not_record_key_col string,
+   |  partition_key_col string
+   |) using hudi
+   | options (
+   |  primaryKey ='record_key_col',
+   |  hoodie.metadata.enable = 'true',
+   |  hoodie.metadata.record.index.enable = 'true',
+   |  hoodie.datasource.write.recordkey.field = 'record_key_col',
+   |  hoodie.enable.data.skipping = 'true'
+   | )
+   | partitioned by(partition_key_col)
+   | location '$basePath'
+   """.stripMargin)
+  spark.sql(s"insert into $tableName values('row1', 'abc', 'p1')")

Review Comment:
   i've added now.. but i discovered one issue while testing which I am still 
fixing.



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SecondaryIndexSupport.scala:
##
@@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.RecordLevelIndexSupport.filterQueryWithRecordKey
+import org.apache.hudi.SecondaryIndexSupport.filterQueriesWithSecondaryKey
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.model.FileSlice
+import org.apache.hudi.common.table.HoodieTableMetaClient
+import 
org.apache.hudi.metadata.HoodieTableMetadataUtil.PARTITION_NAME_SECONDARY_INDEX
+import org.apache.hudi.storage.StoragePath
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.expressions.Expression
+
+import scala.collection.JavaConverters._
+import scala.collection.{JavaConverters, mutable}
+
+class SecondaryIndexSupport(spark: SparkSession,
+metadataConfig: HoodieMetadataConfig,
+metaClient: HoodieTableMetaClient) extends 
RecordLevelIndexSupport(spark, metadataConfig, metaClient) {
+  override def getIndexName: String = SecondaryIndexSupport.INDEX_NAME

Review Comment:
   yes



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7747] In MetaClient remove getBasePathV2() and return StoragePath from getBasePath() [hudi]

2024-06-03 Thread via GitHub



hudi-bot commented on PR #11385:
URL: https://github.com/apache/hudi/pull/11385#issuecomment-2145635715

   
   ## CI report:
   
   * 8605f0fd0fa5bc1c82a26eac8147fc521040f53a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24201)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7747] In MetaClient remove getBasePathV2() and return StoragePath from getBasePath() [hudi]

2024-06-03 Thread via GitHub



hudi-bot commented on PR #11385:
URL: https://github.com/apache/hudi/pull/11385#issuecomment-2145618615

   
   ## CI report:
   
   * 8605f0fd0fa5bc1c82a26eac8147fc521040f53a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

(hudi) branch branch-0.x updated: [HUDI-7816] Provide SourceProfileSupplier option into the SnapshotLoadQuerySplitter (branch-0.x) (#11379)

2024-06-03 Thread yihua

This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/branch-0.x by this push:
 new 63773c58018 [HUDI-7816] Provide SourceProfileSupplier option into the 
SnapshotLoadQuerySplitter (branch-0.x) (#11379)
63773c58018 is described below

commit 63773c58018efe3414c941d65ed78958fcf6d32f
Author: Matthew Wong 
AuthorDate: Mon Jun 3 09:07:56 2024 -0700

[HUDI-7816] Provide SourceProfileSupplier option into the 
SnapshotLoadQuerySplitter (branch-0.x) (#11379)
---
 .../apache/hudi/utilities/sources/HoodieIncrSource.java  |  2 +-
 .../utilities/sources/SnapshotLoadQuerySplitter.java | 16 
 .../hudi/utilities/sources/helpers/QueryRunner.java  |  2 +-
 .../sources/helpers/TestSnapshotQuerySplitterImpl.java   |  3 ++-
 4 files changed, 16 insertions(+), 7 deletions(-)

diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/HoodieIncrSource.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/HoodieIncrSource.java
index 768e4c3c3fc..79264c6fd6e 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/HoodieIncrSource.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/HoodieIncrSource.java
@@ -203,7 +203,7 @@ public class HoodieIncrSource extends RowSource {
   .option(DataSourceReadOptions.QUERY_TYPE().key(), 
DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL())
   .load(srcPath);
   if (snapshotLoadQuerySplitter.isPresent()) {
-queryInfo = 
snapshotLoadQuerySplitter.get().getNextCheckpoint(snapshot, queryInfo);
+queryInfo = 
snapshotLoadQuerySplitter.get().getNextCheckpoint(snapshot, queryInfo, 
sourceProfileSupplier);
   }
   source = snapshot
   // add filtering so that only interested records are returned.
diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/SnapshotLoadQuerySplitter.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/SnapshotLoadQuerySplitter.java
index ca299122ec7..f0fd1fed904 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/SnapshotLoadQuerySplitter.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/SnapshotLoadQuerySplitter.java
@@ -18,10 +18,14 @@
 
 package org.apache.hudi.utilities.sources;
 
+import org.apache.hudi.ApiMaturityLevel;
+import org.apache.hudi.PublicAPIClass;
+import org.apache.hudi.PublicAPIMethod;
 import org.apache.hudi.common.config.TypedProperties;
 import org.apache.hudi.common.util.Option;
 import org.apache.hudi.common.util.ReflectionUtils;
 import org.apache.hudi.utilities.sources.helpers.QueryInfo;
+import org.apache.hudi.utilities.streamer.SourceProfileSupplier;
 import org.apache.spark.sql.Dataset;
 import org.apache.spark.sql.Row;
 
@@ -30,6 +34,7 @@ import static 
org.apache.hudi.utilities.sources.SnapshotLoadQuerySplitter.Config
 /**
  * Abstract splitter responsible for managing the snapshot load query 
operations.
  */
+@PublicAPIClass(maturity = ApiMaturityLevel.EVOLVING)
 public abstract class SnapshotLoadQuerySplitter {
 
   /**
@@ -61,20 +66,23 @@ public abstract class SnapshotLoadQuerySplitter {
*
* @param df The dataset to process.
* @param beginCheckpointStr The starting checkpoint string.
+   * @param sourceProfileSupplier An Option of a SourceProfileSupplier to use 
in load splitting implementation
* @return The next checkpoint as an Option.
*/
-  public abstract Option getNextCheckpoint(Dataset df, String 
beginCheckpointStr);
+  @PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING)
+  public abstract Option getNextCheckpoint(Dataset df, String 
beginCheckpointStr, Option sourceProfileSupplier);
 
   /**
-   * Retrieves the next checkpoint based on query information.
+   * Retrieves the next checkpoint based on query information and a 
SourceProfileSupplier.
*
* @param df The dataset to process.
* @param queryInfo The query information object.
+   * @param sourceProfileSupplier An Option of a SourceProfileSupplier to use 
in load splitting implementation
* @return Updated query information with the next checkpoint, in case of 
empty checkpoint,
* returning endPoint same as queryInfo.getEndInstant().
*/
-  public QueryInfo getNextCheckpoint(Dataset df, QueryInfo queryInfo) {
-return getNextCheckpoint(df, queryInfo.getStartInstant())
+  public QueryInfo getNextCheckpoint(Dataset df, QueryInfo queryInfo, 
Option sourceProfileSupplier) {
+return getNextCheckpoint(df, queryInfo.getStartInstant(), 
sourceProfileSupplier)
 .map(checkpoint -> queryInfo.withUpdatedEndInstant(checkpoint))
 .orElse(queryInfo);
   }
diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/QueryRunner.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilitie

Re: [PR] [HUDI-7816] Provide SourceProfileSupplier option into the SnapshotLoadQuerySplitter (branch-0.x) [hudi]

2024-06-03 Thread via GitHub



yihua merged PR #11379:
URL: https://github.com/apache/hudi/pull/11379


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

(hudi) branch dependabot/maven/io.airlift-aircompressor-0.27 deleted (was 5042e73eb65)

2024-06-03 Thread github-bot

This is an automated email from the ASF dual-hosted git repository.

github-bot pushed a change to branch 
dependabot/maven/io.airlift-aircompressor-0.27
in repository https://gitbox.apache.org/repos/asf/hudi.git


 was 5042e73eb65 Bump io.airlift:aircompressor from 0.25 to 0.27

The revisions that were on this branch are still contained in
other references; therefore, this change does not discard any commits
from the repository.

(hudi) branch master updated: [HUDI-7822] Bump io.airlift:aircompressor from 0.25 to 0.27 (#11380)

2024-06-03 Thread yihua

This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new d0c7de050a8 [HUDI-7822] Bump io.airlift:aircompressor from 0.25 to 
0.27 (#11380)
d0c7de050a8 is described below

commit d0c7de050a8900a29f5d127093b378b96f9c5158
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
AuthorDate: Mon Jun 3 09:07:28 2024 -0700

[HUDI-7822] Bump io.airlift:aircompressor from 0.25 to 0.27 (#11380)

Signed-off-by: dependabot[bot] 
Co-authored-by: dependabot[bot] 
<49699333+dependabot[bot]@users.noreply.github.com>
---
 pom.xml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/pom.xml b/pom.xml
index 8e86c154e86..b3eade21971 100644
--- a/pom.xml
+++ b/pom.xml
@@ -130,7 +130,7 @@
 1.6.0
 1.5.6
 0.9.47
-0.25
+0.27
 0.13.0
 0.8.0
 4.5.13

Re: [PR] [HUDI-7822] Bump io.airlift:aircompressor from 0.25 to 0.27 [hudi]

2024-06-03 Thread via GitHub



yihua merged PR #11380:
URL: https://github.com/apache/hudi/pull/11380


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Assigned] (HUDI-7822) Resolve the conflicts between mixed hdfs and local path in Flink tests

2024-06-03 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-7822:
---

Assignee: Ethan Guo

> Resolve the conflicts between mixed hdfs and local path in Flink tests
> --
>
> Key: HUDI-7822
> URL: https://issues.apache.org/jira/browse/HUDI-7822
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Bumps the depdency to mitigate vulnerability.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7827) Bump io.airlift:aircompressor from 0.25 to 0.27

2024-06-03 Thread Ethan Guo (Jira)

Ethan Guo created HUDI-7827:
---

 Summary: Bump io.airlift:aircompressor from 0.25 to 0.27
 Key: HUDI-7827
 URL: https://issues.apache.org/jira/browse/HUDI-7827
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7822) Resolve the conflicts between mixed hdfs and local path in Flink tests

2024-06-03 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7822:

Fix Version/s: 0.16.0

> Resolve the conflicts between mixed hdfs and local path in Flink tests
> --
>
> Key: HUDI-7822
> URL: https://issues.apache.org/jira/browse/HUDI-7822
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0, 0.16.0
>
>
> Bumps the depdency to mitigate vulnerability.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7822) Resolve the conflicts between mixed hdfs and local path in Flink tests

2024-06-03 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7822:

Description: Bumps the depdency to mitigate vulnerability.

> Resolve the conflicts between mixed hdfs and local path in Flink tests
> --
>
> Key: HUDI-7822
> URL: https://issues.apache.org/jira/browse/HUDI-7822
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Bumps the depdency to mitigate vulnerability.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

(hudi) branch branch-0.x updated: [MINOR] Avoid logging full commit metadata at info level (#11382)

2024-06-03 Thread yihua

This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/branch-0.x by this push:
 new 4684082406c [MINOR] Avoid logging full commit metadata at info level  
(#11382)
4684082406c is described below

commit 4684082406c4d23c97b25e96297b7c05fd653208
Author: Tim Brown 
AuthorDate: Mon Jun 3 11:01:42 2024 -0500

[MINOR] Avoid logging full commit metadata at info level  (#11382)
---
 .../org/apache/hudi/client/BaseHoodieTableServiceClient.java | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java
index 7dcff3bd6f2..ff0f635b06e 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java
@@ -327,7 +327,8 @@ public abstract class BaseHoodieTableServiceClient 
extends BaseHoodieCl
   finalizeWrite(table, compactionCommitTime, writeStats);
   // commit to data table after committing to metadata table.
   writeTableMetadata(table, compactionCommitTime, metadata, 
context.emptyHoodieData());
-  LOG.info("Committing Compaction {}. Finished with result {}", 
compactionCommitTime, metadata);
+  LOG.info("Committing Compaction {}", compactionCommitTime);
+  LOG.debug("Compaction {} finished with result: {}", 
compactionCommitTime, metadata);
   CompactHelpers.getInstance().completeInflightCompaction(table, 
compactionCommitTime, metadata);
 } finally {
   this.txnManager.endTransaction(Option.of(compactionInstant));
@@ -388,7 +389,8 @@ public abstract class BaseHoodieTableServiceClient 
extends BaseHoodieCl
   finalizeWrite(table, logCompactionCommitTime, writeStats);
   // commit to data table after committing to metadata table.
   writeTableMetadata(table, logCompactionCommitTime, metadata, 
context.emptyHoodieData());
-  LOG.info("Committing Log Compaction {}. Finished with result {}", 
logCompactionCommitTime, metadata);
+  LOG.info("Committing Log Compaction {}", logCompactionCommitTime);
+  LOG.debug("Log Compaction {} finished with result {}", 
logCompactionCommitTime, metadata);
   CompactHelpers.getInstance().completeInflightLogCompaction(table, 
logCompactionCommitTime, metadata);
 } finally {
   this.txnManager.endTransaction(Option.of(logCompactionInstant));
@@ -513,7 +515,8 @@ public abstract class BaseHoodieTableServiceClient 
extends BaseHoodieCl
   // Update table's metadata (table)
   writeTableMetadata(table, clusteringInstant.getTimestamp(), metadata, 
writeStatuses.orElseGet(context::emptyHoodieData));
 
-  LOG.info("Committing Clustering {}. Finished with result {}", 
clusteringCommitTime, metadata);
+  LOG.info("Committing Clustering {}", clusteringCommitTime);
+  LOG.debug("Clustering {} finished with result {}", clusteringCommitTime, 
metadata);
 
   table.getActiveTimeline().transitionReplaceInflightToComplete(
   clusteringInstant,

Re: [PR] [MINOR] Avoid logging full commit metadata at info level [hudi]

2024-06-03 Thread via GitHub



yihua merged PR #11382:
URL: https://github.com/apache/hudi/pull/11382


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [DISCUSSION] Deltastreamer - Reading commit checkpoint from Kafka instead of latest Hoodie commit [hudi]

2024-06-03 Thread via GitHub



KishanFairmatic commented on issue #11268:
URL: https://github.com/apache/hudi/issues/11268#issuecomment-2145507599

   @danny0405 : In master i.e. HUDI version 1.0.0, there is a flag 
`--ignore-checkpoint`, which will do the same thing. We use version 0.13.0 for 
now, so would use this for now till 1.0.0 is stable and we are ready to 
upgrade. 
   
   But by default, when auto.offset.reset = group, and for the first attempt 
when there are no commits in kafka, it defaults to latest, which might mean 
loss of data. Either that should be earliest, or there should be an option to 
choose earliest to prevent missing any data. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-06-03 Thread via GitHub



hudi-bot commented on PR #10957:
URL: https://github.com/apache/hudi/pull/10957#issuecomment-2145498112

   
   ## CI report:
   
   * c98242b22fb2518c0cc93c037df558037030500f UNKNOWN
   * 22846139475031d663fc6bb2b1a554dd1b2e637e Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24200)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7747] In MetaClient remove getBasePathV2() and return StoragePath from getBasePath() [hudi]

2024-06-03 Thread via GitHub



hudi-bot commented on PR #11385:
URL: https://github.com/apache/hudi/pull/11385#issuecomment-2145441835

   
   ## CI report:
   
   * 11cd96c4d0e7727918907e231c3eef8c997f0476 UNKNOWN
   * 8605f0fd0fa5bc1c82a26eac8147fc521040f53a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24201)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7747] In MetaClient remove getBasePathV2() and return StoragePath from getBasePath() [hudi]

2024-06-03 Thread via GitHub



hudi-bot commented on PR #11385:
URL: https://github.com/apache/hudi/pull/11385#issuecomment-2145334253

   
   ## CI report:
   
   * 11cd96c4d0e7727918907e231c3eef8c997f0476 UNKNOWN
   * 8605f0fd0fa5bc1c82a26eac8147fc521040f53a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24201)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-06-03 Thread via GitHub



hudi-bot commented on PR #10957:
URL: https://github.com/apache/hudi/pull/10957#issuecomment-2145332448

   
   ## CI report:
   
   * c98242b22fb2518c0cc93c037df558037030500f UNKNOWN
   * ec6fa62945094d548dce7d7e8e6ef2363ba0d05f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24179)
 
   * 22846139475031d663fc6bb2b1a554dd1b2e637e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24200)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7825]Support Report pending clustering and compaction plan metric [hudi]

2024-06-03 Thread via GitHub



LXin96 commented on code in PR #11377:
URL: https://github.com/apache/hudi/pull/11377#discussion_r1624539082


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java:
##
@@ -272,6 +275,28 @@ public void notifyCheckpointComplete(long checkpointId) {
 );
   }
 
+  private void emitCompactionAndClusteringMetrics(Configuration conf,
+  HoodieTableMetaClient metaClient, HoodieFlinkWriteClient writeClient) {
+if (conf.getBoolean(FlinkOptions.CLUSTERING_SCHEDULE_ENABLED)
+&& !conf.getBoolean(FlinkOptions.CLUSTERING_ASYNC_ENABLED)) {
+  HoodieTimeline pendingReplaceTimeline = metaClient.getActiveTimeline()
+  .filterPendingReplaceTimeline();
+  HoodieMetrics metrics = writeClient.getMetrics();
+  if (metrics != null) {
+
metrics.setPendingClusteringCount(pendingReplaceTimeline.countInstants());
+  }
+}
+if (conf.getBoolean(FlinkOptions.COMPACTION_SCHEDULE_ENABLED)

Review Comment:
   @danny0405 en，i get you, however, this is another situation, when we 
FlinkOptions.COMPACTION_SCHEDULE_ENABLED set to true, 
FlinkOptions.COMPACTION_ASYNC_ENABLED set to false ， the CompactionPlanOperator 
will not be added to pipeline, then will not report the pending compaction plan 
, the situation is as same as clustering. this situation happens to offload the 
clustering or compaction.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7747] In MetaClient remove getBasePathV2() and return StoragePath from getBasePath() [hudi]

2024-06-03 Thread via GitHub



hudi-bot commented on PR #11385:
URL: https://github.com/apache/hudi/pull/11385#issuecomment-2145315579

   
   ## CI report:
   
   * 11cd96c4d0e7727918907e231c3eef8c997f0476 UNKNOWN
   * 8605f0fd0fa5bc1c82a26eac8147fc521040f53a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-06-03 Thread via GitHub



hudi-bot commented on PR #10957:
URL: https://github.com/apache/hudi/pull/10957#issuecomment-2145313418

   
   ## CI report:
   
   * c98242b22fb2518c0cc93c037df558037030500f UNKNOWN
   * ec6fa62945094d548dce7d7e8e6ef2363ba0d05f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24179)
 
   * 22846139475031d663fc6bb2b1a554dd1b2e637e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7747] In MetaClient remove getBasePathV2() and return StoragePath from getBasePath() [hudi]

2024-06-03 Thread via GitHub



hudi-bot commented on PR #11385:
URL: https://github.com/apache/hudi/pull/11385#issuecomment-2145294738

   
   ## CI report:
   
   * 11cd96c4d0e7727918907e231c3eef8c997f0476 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-06-03 Thread via GitHub



jonvex commented on code in PR #10957:
URL: https://github.com/apache/hudi/pull/10957#discussion_r1624489092


##
hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java:
##
@@ -231,7 +231,13 @@ private static Option findNestedField(Schema 
schema, String[] fiel
 if (!nestedPart.isPresent()) {
   return Option.empty();
 }
-return nestedPart;
+boolean isUnion = false;

Review Comment:
   hudi-common/src/test/java/org/apache/hudi/avro/TestAvroSchemaUtils.java I 
uncommented the test in this pr



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-7747) In MetaClient remove getBasePathV2() and return StoragePath from getBasePath()

2024-06-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7747:
-
Labels: pull-request-available  (was: )

> In MetaClient remove getBasePathV2() and return StoragePath from getBasePath()
> --
>
> Key: HUDI-7747
> URL: https://issues.apache.org/jira/browse/HUDI-7747
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jonathan Vexler
>Assignee: Vova Kolmakov
>Priority: Major
>  Labels: pull-request-available
>
> In HoodieTableMetaClient remove getBasePathV2() and return StoragePath from 
> getBasePath().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[PR] [HUDI-7747] In MetaClient remove getBasePathV2() and return StoragePath from getBasePath() [hudi]

2024-06-03 Thread via GitHub



wombatu-kun opened a new pull request, #11385:
URL: https://github.com/apache/hudi/pull/11385

   ### Change Logs
   
   In HoodieTableMetaClient remove getBasePathV2() and return StoragePath from 
getBasePath().
   
   ### Impact
   
   none
   
   ### Risk level (write none, low medium or high below)
   
   none
   
   ### Documentation Update
   
   none
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-5956] fix spark DAG ui when write [hudi]

2024-06-03 Thread via GitHub



KnightChess commented on code in PR #11376:
URL: https://github.com/apache/hudi/pull/11376#discussion_r1624306145


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala:
##
@@ -174,32 +174,43 @@ class HoodieSparkSqlWriterInternal {
 sourceDf: DataFrame,
 streamingWritesParamsOpt: Option[StreamingWriteParams] = 
Option.empty,
 hoodieWriteClient: Option[SparkRDDWriteClient[_]] = Option.empty):
-
   (Boolean, HOption[String], HOption[String], HOption[String], 
SparkRDDWriteClient[_], HoodieTableConfig) = {
-var succeeded = false
-var counter = 0
-val maxRetry: Integer = 
Integer.parseInt(optParams.getOrElse(HoodieWriteConfig.NUM_RETRIES_ON_CONFLICT_FAILURES.key(),
 HoodieWriteConfig.NUM_RETRIES_ON_CONFLICT_FAILURES.defaultValue().toString))
-var toReturn: (Boolean, HOption[String], HOption[String], HOption[String], 
SparkRDDWriteClient[_], HoodieTableConfig) = null
 
-while (counter <= maxRetry && !succeeded) {
-  try {
-toReturn = writeInternal(sqlContext, mode, optParams, sourceDf, 
streamingWritesParamsOpt, hoodieWriteClient)
-if (counter > 0) {
-  log.warn(s"Succeeded with attempt no $counter")
-}
-succeeded = true
-  } catch {
-case e: HoodieWriteConflictException =>
-  val writeConcurrencyMode = 
optParams.getOrElse(HoodieWriteConfig.WRITE_CONCURRENCY_MODE.key(), 
HoodieWriteConfig.WRITE_CONCURRENCY_MODE.defaultValue())
-  if (WriteConcurrencyMode.supportsMultiWriter(writeConcurrencyMode) 
&& counter < maxRetry) {
-counter += 1
-log.warn(s"Conflict found. Retrying again for attempt no $counter")
-  } else {
-throw e
+val retryWrite: () => (Boolean, HOption[String], HOption[String], 
HOption[String], SparkRDDWriteClient[_], HoodieTableConfig) = () => {
+  var succeeded = false
+  var counter = 0
+  val maxRetry: Integer = 
Integer.parseInt(optParams.getOrElse(HoodieWriteConfig.NUM_RETRIES_ON_CONFLICT_FAILURES.key(),
 HoodieWriteConfig.NUM_RETRIES_ON_CONFLICT_FAILURES.defaultValue().toString))
+  var toReturn: (Boolean, HOption[String], HOption[String], 
HOption[String], SparkRDDWriteClient[_], HoodieTableConfig) = null
+
+  while (counter <= maxRetry && !succeeded) {
+try {
+  toReturn = writeInternal(sqlContext, mode, optParams, sourceDf, 
streamingWritesParamsOpt, hoodieWriteClient)
+  if (counter > 0) {
+log.warn(s"Succeeded with attempt no $counter")
   }
+  succeeded = true
+} catch {
+  case e: HoodieWriteConflictException =>
+val writeConcurrencyMode = 
optParams.getOrElse(HoodieWriteConfig.WRITE_CONCURRENCY_MODE.key(), 
HoodieWriteConfig.WRITE_CONCURRENCY_MODE.defaultValue())
+if (WriteConcurrencyMode.supportsMultiWriter(writeConcurrencyMode) 
&& counter < maxRetry) {
+  counter += 1
+  log.warn(s"Conflict found. Retrying again for attempt no 
$counter")
+} else {
+  throw e
+}
+}
   }
+  toReturn
+}
+
+val executionId = getExecutionId(sqlContext.sparkContext, 
sourceDf.queryExecution)
+if (executionId.isEmpty) {
+  sparkAdapter.sqlExecutionWithNewExecutionId(sourceDf.sparkSession, 
sourceDf.queryExecution, Option("Hudi Command"))(

Review Comment:
   this executionId will be sub-list in rootExecutionId after this pr 
https://github.com/apache/spark/pull/40403, so ignore this `TODO` 
https://github.com/apache/hudi/pull/8233#discussion_r1298071684, cc @codope 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR] Fix LoggerName for BaseHoodieTableServiceClient [hudi]

2024-06-03 Thread via GitHub



danny0405 merged PR #11384:
URL: https://github.com/apache/hudi/pull/11384


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

(hudi) branch master updated: [MINOR] Fix LoggerName for BaseHoodieTableServiceClient (#11384)

2024-06-03 Thread danny0405

This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new e131a049de2 [MINOR] Fix LoggerName for BaseHoodieTableServiceClient 
(#11384)
e131a049de2 is described below

commit e131a049de25ebc08d86eb3148e49bd2c1f87b54
Author: wuzhenhua <102498303+wuzhenhu...@users.noreply.github.com>
AuthorDate: Mon Jun 3 18:36:26 2024 +0800

[MINOR] Fix LoggerName for BaseHoodieTableServiceClient (#11384)

Co-authored-by: Admin 
---
 .../main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java  | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java
index a05a236f31d..23dfec7dee3 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java
@@ -97,7 +97,7 @@ import static 
org.apache.hudi.metadata.HoodieTableMetadataUtil.isIndexingCommit;
  */
 public abstract class BaseHoodieTableServiceClient extends 
BaseHoodieClient implements RunsTableService {
 
-  private static final Logger LOG = 
LoggerFactory.getLogger(BaseHoodieWriteClient.class);
+  private static final Logger LOG = 
LoggerFactory.getLogger(BaseHoodieTableServiceClient.class);
 
   protected transient Timer.Context compactionTimer;
   protected transient Timer.Context clusteringTimer;

Re: [PR] [MINOR] Fix LoggerName for BaseHoodieTableServiceClient [hudi]

2024-06-03 Thread via GitHub



hudi-bot commented on PR #11384:
URL: https://github.com/apache/hudi/pull/11384#issuecomment-2144815626

   
   ## CI report:
   
   * b54b7755e458b8d4da262febd4d2cf9f0607ada8 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24197)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR] Fix LoggerName for BaseHoodieTableServiceClient [hudi]

2024-06-03 Thread via GitHub



hudi-bot commented on PR #11384:
URL: https://github.com/apache/hudi/pull/11384#issuecomment-2144794744

   
   ## CI report:
   
   * b54b7755e458b8d4da262febd4d2cf9f0607ada8 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7146] Integrate secondary index on reader path [hudi]

2024-06-03 Thread via GitHub



hudi-bot commented on PR #11162:
URL: https://github.com/apache/hudi/pull/11162#issuecomment-2144793701

   
   ## CI report:
   
   * b342d8f8e10f77419bf1bd0bc9f626a596ad65f9 UNKNOWN
   * b00831e0a0506714d27bc2a64e58084b357a83cc Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24196)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] [MINOR] Fix LoggerName for BaseHoodieTableServiceClient [hudi]

2024-06-03 Thread via GitHub



wuzhenhua01 opened a new pull request, #11384:
URL: https://github.com/apache/hudi/pull/11384

   ### Change Logs
   
   Fix LoggerName for BaseHoodieTableServiceClient
   
   ### Impact
   
   No
   
   ### Risk level (write none, low medium or high below)
   
   none
   
   ### Documentation Update
   
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Change Logs and Impact were stated clearly
   - [x] Adequate tests were added if applicable
   - [x] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7146] Integrate secondary index on reader path [hudi]

2024-06-03 Thread via GitHub



hudi-bot commented on PR #11162:
URL: https://github.com/apache/hudi/pull/11162#issuecomment-2144669166

   
   ## CI report:
   
   * df12fa59cbba5b14bb98d66dffb510f5b1659177 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24195)
 
   * b342d8f8e10f77419bf1bd0bc9f626a596ad65f9 UNKNOWN
   * b00831e0a0506714d27bc2a64e58084b357a83cc Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24196)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7704] Unify test client storage classes with duplicate code [hudi]

2024-06-03 Thread via GitHub



wombatu-kun commented on PR #11152:
URL: https://github.com/apache/hudi/pull/11152#issuecomment-2144664083

   I'm tired of resolving conflicts for this PR again and again. Somebody 
review it and merge, please!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [DISCUSSION] Deltastreamer - Reading commit checkpoint from Kafka instead of latest Hoodie commit [hudi]

2024-06-03 Thread via GitHub



danny0405 commented on issue #11268:
URL: https://github.com/apache/hudi/issues/11268#issuecomment-2144599040

   Ususally we do not creating PR agains released tag, can you fire a new one 
against master?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7146] Integrate secondary index on reader path [hudi]

2024-06-03 Thread via GitHub



danny0405 commented on code in PR #11162:
URL: https://github.com/apache/hudi/pull/11162#discussion_r1624007332


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestSecondaryIndexWithSql.scala:
##
@@ -95,4 +97,39 @@ class TestSecondaryIndexWithSql extends 
SecondaryIndexTestBase {
   private def checkAnswer(sql: String)(expects: Seq[Any]*): Unit = {
 assertResult(expects.map(row => Row(row: 
_*)).toArray.sortBy(_.toString()))(spark.sql(sql).collect().sortBy(_.toString()))
   }
+
+  @Test
+  def testSecondaryIndexWithInFilter(): Unit = {
+if (HoodieSparkUtils.gteqSpark3_2) {
+  var hudiOpts = commonOpts
+  hudiOpts = hudiOpts + (
+DataSourceWriteOptions.TABLE_TYPE.key -> 
HoodieTableType.COPY_ON_WRITE.name(),
+DataSourceReadOptions.ENABLE_DATA_SKIPPING.key -> "true")
+
+  spark.sql(
+s"""
+   |create table $tableName (
+   |  record_key_col string,
+   |  not_record_key_col string,
+   |  partition_key_col string
+   |) using hudi
+   | options (
+   |  primaryKey ='record_key_col',
+   |  hoodie.metadata.enable = 'true',
+   |  hoodie.metadata.record.index.enable = 'true',
+   |  hoodie.datasource.write.recordkey.field = 'record_key_col',
+   |  hoodie.enable.data.skipping = 'true'
+   | )
+   | partitioned by(partition_key_col)
+   | location '$basePath'
+   """.stripMargin)
+  spark.sql(s"insert into $tableName values('row1', 'abc', 'p1')")

Review Comment:
   do we have test case for non-string values?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7146] Integrate secondary index on reader path [hudi]

2024-06-03 Thread via GitHub



danny0405 commented on code in PR #11162:
URL: https://github.com/apache/hudi/pull/11162#discussion_r1624004588


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SecondaryIndexSupport.scala:
##
@@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.RecordLevelIndexSupport.filterQueryWithRecordKey
+import org.apache.hudi.SecondaryIndexSupport.filterQueriesWithSecondaryKey
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.model.FileSlice
+import org.apache.hudi.common.table.HoodieTableMetaClient
+import 
org.apache.hudi.metadata.HoodieTableMetadataUtil.PARTITION_NAME_SECONDARY_INDEX
+import org.apache.hudi.storage.StoragePath
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.expressions.Expression
+
+import scala.collection.JavaConverters._
+import scala.collection.{JavaConverters, mutable}
+
+class SecondaryIndexSupport(spark: SparkSession,
+metadataConfig: HoodieMetadataConfig,
+metaClient: HoodieTableMetaClient) extends 
RecordLevelIndexSupport(spark, metadataConfig, metaClient) {
+  override def getIndexName: String = SecondaryIndexSupport.INDEX_NAME

Review Comment:
   This is also a code reuse from RLI right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7146] Integrate secondary index on reader path [hudi]

2024-06-03 Thread via GitHub



danny0405 commented on code in PR #11162:
URL: https://github.com/apache/hudi/pull/11162#discussion_r1624001312


##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java:
##
@@ -851,6 +852,158 @@ private Map 
reverseLookupSecondaryKeys(String partitionName, Lis
 return recordKeyMap;
   }
 
+  @Override
+  protected Map>> 
getSecondaryIndexRecords(List keys, String partitionName) {
+if (keys.isEmpty()) {
+  return Collections.emptyMap();
+}
+
+Map>> result = new 
HashMap<>();
+
+// Load the file slices for the partition. Each file slice is a shard 
which saves a portion of the keys.
+List partitionFileSlices = 
partitionFileSliceMap.computeIfAbsent(partitionName,
+k -> 
HoodieTableMetadataUtil.getPartitionLatestMergedFileSlices(metadataMetaClient, 
metadataFileSystemView, partitionName));
+final int numFileSlices = partitionFileSlices.size();
+ValidationUtils.checkState(numFileSlices > 0, "Number of file slices for 
partition " + partitionName + " should be > 0");
+
+// Lookup keys from each file slice
+// TODO: parallelize this loop
+for (FileSlice partition : partitionFileSlices) {
+  Map>> 
currentFileSliceResult = lookupSecondaryKeysFromFileSlice(partitionName, keys, 
partition);
+
+  currentFileSliceResult.forEach((secondaryKey, secondaryRecords) -> {
+result.merge(secondaryKey, secondaryRecords, (oldRecords, newRecords) 
-> {
+  newRecords.addAll(oldRecords);
+  return newRecords;
+});
+  });
+}
+
+return result;
+  }
+
+  /**
+   * Lookup list of keys from a single file slice.
+   *
+   * @param partitionName Name of the partition
+   * @param secondaryKeys The list of secondary keys to lookup
+   * @param fileSlice The file slice to read
+   * @return A {@code Map} of secondary-key to list of {@code HoodieRecord} 
for the secondary-keys which were found in the file slice
+   */
+  private Map>> 
lookupSecondaryKeysFromFileSlice(String partitionName, List 
secondaryKeys, FileSlice fileSlice) {
+Map> logRecordsMap = new HashMap<>();
+
+Pair, HoodieMetadataLogRecordReader> readers = 
getOrCreateReaders(partitionName, fileSlice);
+try {
+  List timings = new ArrayList<>(1);
+  HoodieSeekingFileReader baseFileReader = readers.getKey();
+  HoodieMetadataLogRecordReader logRecordScanner = readers.getRight();
+  if (baseFileReader == null && logRecordScanner == null) {
+return Collections.emptyMap();
+  }
+
+  // Sort it here once so that we don't need to sort individually for base 
file and for each individual log files.
+  Set secondaryKeySet = new HashSet<>(secondaryKeys.size());
+  List sortedSecondaryKeys = new ArrayList<>(secondaryKeys);
+  Collections.sort(sortedSecondaryKeys);

Review Comment:
   Wondering if we have general sort solution that supports spilling to disk, 
the in-memory sort is slow and also has risk of OOM.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7146] Integrate secondary index on reader path [hudi]

2024-06-03 Thread via GitHub



hudi-bot commented on PR #11162:
URL: https://github.com/apache/hudi/pull/11162#issuecomment-2144576528

   
   ## CI report:
   
   * df12fa59cbba5b14bb98d66dffb510f5b1659177 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24195)
 
   * b342d8f8e10f77419bf1bd0bc9f626a596ad65f9 UNKNOWN
   * b00831e0a0506714d27bc2a64e58084b357a83cc UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7146] Integrate secondary index on reader path [hudi]

2024-06-03 Thread via GitHub



danny0405 commented on code in PR #11162:
URL: https://github.com/apache/hudi/pull/11162#discussion_r1623995661


##
hudi-common/src/main/java/org/apache/hudi/metadata/BaseTableMetadata.java:
##
@@ -312,6 +312,33 @@ public Map> 
readRecordIndex(List
+   * If the Metadata Table is not enabled, an exception is thrown to 
distinguish this from the absence of the key.
+   *
+   * @param secondaryKeys The list of secondary keys to read
+   */
+  @Override
+  public Map> 
readSecondaryIndex(List secondaryKeys) {
+
ValidationUtils.checkState(dataMetaClient.getTableConfig().isMetadataPartitionAvailable(MetadataPartitionType.RECORD_INDEX),
+"Record index is not initialized in MDT");
+ValidationUtils.checkState(
+
dataMetaClient.getTableConfig().getMetadataPartitions().stream().anyMatch(partitionName
 -> 
partitionName.startsWith(MetadataPartitionType.SECONDARY_INDEX.getPartitionPath())),
+"Secondary index is not initialized in MDT");
+// Fetch secondary-index records
+Map>> secondaryKeyRecords 
= getSecondaryIndexRecords(secondaryKeys, 
MetadataPartitionType.SECONDARY_INDEX.getPartitionPath());
+// Now collect the record-keys and fetch the RLI records

Review Comment:
   do you mean secondary index records?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7146] Integrate secondary index on reader path [hudi]

2024-06-03 Thread via GitHub



hudi-bot commented on PR #11162:
URL: https://github.com/apache/hudi/pull/11162#issuecomment-2144549129

   
   ## CI report:
   
   * 3c52961bdbcb210e4c7140f5939143cfda7adb50 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24151)
 
   * df12fa59cbba5b14bb98d66dffb510f5b1659177 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24195)
 
   * b342d8f8e10f77419bf1bd0bc9f626a596ad65f9 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7704] Unify test client storage classes with duplicate code [hudi]

2024-06-03 Thread via GitHub



hudi-bot commented on PR #11152:
URL: https://github.com/apache/hudi/pull/11152#issuecomment-2144531002

   
   ## CI report:
   
   * 9946df9dd0aca2e4e8613b36265462d76397c8d8 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24194)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

1 2 >

1 - 100 of 107 matches

Mail list logo