[hudi] 01/01: [MINOR] Update DOAP with 0.14.0 Release

2023-09-28 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch prashantwason-update-doap-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 051cd1f6f78893025d6d55d65d86553e426f900b
Author: Prashant Wason 
AuthorDate: Thu Sep 28 12:40:01 2023 -0700

[MINOR] Update DOAP with 0.14.0 Release
---
 doap_HUDI.rdf | 5 +
 1 file changed, 5 insertions(+)

diff --git a/doap_HUDI.rdf b/doap_HUDI.rdf
index 259c776a7e7..9a5eb593a3f 100644
--- a/doap_HUDI.rdf
+++ b/doap_HUDI.rdf
@@ -126,6 +126,11 @@
 2023-05-25
 0.13.1
   
+  
+Apache Hudi 0.14.0
+2023-09-28
+0.14.0
+  
 
 
   



[hudi] branch prashantwason-update-doap-0.14.0 created (now 051cd1f6f78)

2023-09-28 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a change to branch prashantwason-update-doap-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git


  at 051cd1f6f78 [MINOR] Update DOAP with 0.14.0 Release

This branch includes the following new commits:

 new 051cd1f6f78 [MINOR] Update DOAP with 0.14.0 Release

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.




svn commit: r64222 - in /dev/hudi/hudi-0.14.0: ./ hudi-0.14.0.src.tgz hudi-0.14.0.src.tgz.asc hudi-0.14.0.src.tgz.sha512

2023-09-27 Thread pwason
Author: pwason
Date: Wed Sep 27 17:48:49 2023
New Revision: 64222

Log:
Add Apache Hudi 0.14.0 source release


Added:
dev/hudi/hudi-0.14.0/
dev/hudi/hudi-0.14.0/hudi-0.14.0.src.tgz   (with props)
dev/hudi/hudi-0.14.0/hudi-0.14.0.src.tgz.asc
dev/hudi/hudi-0.14.0/hudi-0.14.0.src.tgz.sha512

Added: dev/hudi/hudi-0.14.0/hudi-0.14.0.src.tgz
==
Binary file - no diff available.

Propchange: dev/hudi/hudi-0.14.0/hudi-0.14.0.src.tgz
--
svn:mime-type = application/octet-stream

Added: dev/hudi/hudi-0.14.0/hudi-0.14.0.src.tgz.asc
==
--- dev/hudi/hudi-0.14.0/hudi-0.14.0.src.tgz.asc (added)
+++ dev/hudi/hudi-0.14.0/hudi-0.14.0.src.tgz.asc Wed Sep 27 17:48:49 2023
@@ -0,0 +1,16 @@
+-BEGIN PGP SIGNATURE-
+
+iQIzBAABCAAdFiEEdcV0Tp5c1cSOGcCCxNhY1zudsbgFAmUUad4ACgkQxNhY1zud
+sbhQWg/9HZ0+uTz9+mlg1mj+rqcw1p3ogki3+bQud4VtYD9WcTxdSzkp2BQVYwqO
+ZeiJrafD4MSNCuaz0HWCCtvRR3EAeYuMp46Ct9VL5Cf8dZ7pI4UdpvDuygke6y6E
+yRA8vDyaVznhwPxLMaKip+h0lesve7wHQ4bi+NBMz8yFoL2gYxIvDnKa/CJ3TZwv
+RassW9NYrVvnS5Qdz8rTCIHd0fFQRXJ+rCP9n+uUMacAfThnxNZ7kqaO0PqyVlMo
+8jS7Q/vJN/pd3T1cryGZWn0xzOBbg21l7hEKs1aqwGUh5/4FutaiT35Z3OAtIONT
+vLicEfoF9oX9+aTtvgW3Ydei7HKnMmgnx022ANePS+D4a4YX2aJP2pIW73mdQEeL
+SLdrGCDeu6UiVZj/22xubfb+QVWjiHybpARskAPZlkEJbd89W20IwAx+V+8kCaWS
+qI5LiOfavrVZyEVgDjZXuucCtHkH5HiM/YOaZzcCpiEiSgN9PGfUXp1GVkBl/soL
+lcbPaXyuaOXdegwB6TnpqGPylS70Lvx87+3POclWgd/onKmcsSZiJxff9KCWefUF
+/ICE0BU7YAv1wKjJHr/PrPiRu1jiaxcwrDDoqHnXWAmnO5SkaxAEaZoO/CEAXgcC
+1qCx7qnImsXgnrdUw8T5m4NfjPqMXIYsGQaef5HKOV1V1/1vbbs=
+=aim4
+-END PGP SIGNATURE-

Added: dev/hudi/hudi-0.14.0/hudi-0.14.0.src.tgz.sha512
==
--- dev/hudi/hudi-0.14.0/hudi-0.14.0.src.tgz.sha512 (added)
+++ dev/hudi/hudi-0.14.0/hudi-0.14.0.src.tgz.sha512 Wed Sep 27 17:48:49 2023
@@ -0,0 +1 @@
+fe40b171a76514cc0f94ae6ee8180bbfd81c5dc1bec8867ad0ebec0a07dbd6d87de90bf3bfee35cfcf6659e14960af4aa1fcf0709019dfbac044c4c390362ad0
  hudi-0.14.0.src.tgz




[hudi] annotated tag release-0.14.0 updated (47bdc270956 -> bf5b2585abb)

2023-09-27 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a change to annotated tag release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git


*** WARNING: tag release-0.14.0 was modified! ***

from 47bdc270956 (commit)
  to bf5b2585abb (tag)
 tagging 47bdc2709566f726fa503919c87004ec26f14817 (commit)
 replaces release-0.14.0-rc2
  by Prashant Wason
  on Wed Sep 27 10:45:12 2023 -0700

- Log -
0.14.0
-BEGIN PGP SIGNATURE-

iQIzBAABCAAdFiEEdcV0Tp5c1cSOGcCCxNhY1zudsbgFAmUUaigACgkQxNhY1zud
sbiACA//asd7F30JvrcfuBDcGTIU02yyQR5qI9Ltok3cYkdB423H/qjPMI0sumh6
qhl9PmDDpS2pYDECJy4NiR+YbLtpKZUWJdJIgDCDda4YjOMwkG+h/iCTA7fMxnFv
WWEhBG2BCpSgBGuM5GpqY8dwj+Kxte4v6B0CH2e5oWRWkaVgaGOsy+IjRhBqfY5w
E0mbvfkBRKpM7mMmuHLMFDAhhqlL1Jex/WOd3vzFqq00TKTXwWNFgXp+0XV7JLD6
uq3C+BG7ZQteNaoQ3j8lK21/JlXKKXaMcl+J1WIcDAjtOlkVOrX0r5dcqkLTm94y
SCFU81kgzl49kh3ISoFOkXw9IDyihZhw+V0XGmeu3J/fXrj1bRMINBI1J1Yj0ExL
/DqbC93dMBckJ/zkVJtIpxWikvk86LnfgbUxRlTiTStOoyf9JchoqgOJMtAwCWHs
vfzDptv2XdSHy5K2fG5+ypw5XN0Q5Bt6JSy60h4KY4JlyXnXSjjQ0G5f9mvo5Jxc
Ow51BfuNMy+9tn5u6HJCMguq7dvSI0weLoUC8BaPB8xmaFjYY6yuMX6vI3fZGbWD
kfMR7faIM6yni11bNn8e5os++8eosVdnCwcgzsN5A8NfGO8KXWnq6oOv+44zdDl2
bND9GNwQZWnGkPut7mUGtfI8IG5ZYCfcLW5jUkgOy8oZWNhQdfo=
=zZcK
-END PGP SIGNATURE-
---


No new revisions were added by this update.

Summary of changes:



[hudi] branch release-0.14.0 updated: [MINOR] Update release version to reflect published version 0.14.0

2023-09-27 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/release-0.14.0 by this push:
 new 47bdc270956 [MINOR] Update release version to reflect published 
version 0.14.0
47bdc270956 is described below

commit 47bdc2709566f726fa503919c87004ec26f14817
Author: Prashant Wason 
AuthorDate: Wed Sep 27 10:40:09 2023 -0700

[MINOR] Update release version to reflect published version 0.14.0
---
 docker/hoodie/hadoop/base/pom.xml| 2 +-
 docker/hoodie/hadoop/base_java11/pom.xml | 2 +-
 docker/hoodie/hadoop/datanode/pom.xml| 2 +-
 docker/hoodie/hadoop/historyserver/pom.xml   | 2 +-
 docker/hoodie/hadoop/hive_base/pom.xml   | 2 +-
 docker/hoodie/hadoop/namenode/pom.xml| 2 +-
 docker/hoodie/hadoop/pom.xml | 2 +-
 docker/hoodie/hadoop/prestobase/pom.xml  | 2 +-
 docker/hoodie/hadoop/spark_base/pom.xml  | 2 +-
 docker/hoodie/hadoop/sparkadhoc/pom.xml  | 2 +-
 docker/hoodie/hadoop/sparkmaster/pom.xml | 2 +-
 docker/hoodie/hadoop/sparkworker/pom.xml | 2 +-
 docker/hoodie/hadoop/trinobase/pom.xml   | 2 +-
 docker/hoodie/hadoop/trinocoordinator/pom.xml| 2 +-
 docker/hoodie/hadoop/trinoworker/pom.xml | 2 +-
 hudi-aws/pom.xml | 4 ++--
 hudi-cli/pom.xml | 2 +-
 hudi-client/hudi-client-common/pom.xml   | 4 ++--
 hudi-client/hudi-flink-client/pom.xml| 4 ++--
 hudi-client/hudi-java-client/pom.xml | 4 ++--
 hudi-client/hudi-spark-client/pom.xml| 4 ++--
 hudi-client/pom.xml  | 2 +-
 hudi-common/pom.xml  | 2 +-
 hudi-examples/hudi-examples-common/pom.xml   | 2 +-
 hudi-examples/hudi-examples-flink/pom.xml| 2 +-
 hudi-examples/hudi-examples-java/pom.xml | 2 +-
 hudi-examples/hudi-examples-spark/pom.xml| 2 +-
 hudi-examples/pom.xml| 2 +-
 hudi-flink-datasource/hudi-flink/pom.xml | 4 ++--
 hudi-flink-datasource/hudi-flink1.13.x/pom.xml   | 4 ++--
 hudi-flink-datasource/hudi-flink1.14.x/pom.xml   | 4 ++--
 hudi-flink-datasource/hudi-flink1.15.x/pom.xml   | 4 ++--
 hudi-flink-datasource/hudi-flink1.16.x/pom.xml   | 4 ++--
 hudi-flink-datasource/hudi-flink1.17.x/pom.xml   | 4 ++--
 hudi-flink-datasource/pom.xml| 4 ++--
 hudi-gcp/pom.xml | 2 +-
 hudi-hadoop-mr/pom.xml   | 2 +-
 hudi-integ-test/pom.xml  | 2 +-
 hudi-kafka-connect/pom.xml   | 4 ++--
 hudi-platform-service/hudi-metaserver/hudi-metaserver-client/pom.xml | 2 +-
 hudi-platform-service/hudi-metaserver/hudi-metaserver-server/pom.xml | 2 +-
 hudi-platform-service/hudi-metaserver/pom.xml| 4 ++--
 hudi-platform-service/pom.xml| 2 +-
 hudi-spark-datasource/hudi-spark-common/pom.xml  | 4 ++--
 hudi-spark-datasource/hudi-spark/pom.xml | 4 ++--
 hudi-spark-datasource/hudi-spark2-common/pom.xml | 2 +-
 hudi-spark-datasource/hudi-spark2/pom.xml| 4 ++--
 hudi-spark-datasource/hudi-spark3-common/pom.xml | 2 +-
 hudi-spark-datasource/hudi-spark3.0.x/pom.xml| 4 ++--
 hudi-spark-datasource/hudi-spark3.1.x/pom.xml| 4 ++--
 hudi-spark-datasource/hudi-spark3.2.x/pom.xml| 4 ++--
 hudi-spark-datasource/hudi-spark3.2plus-common/pom.xml   | 2 +-
 hudi-spark-datasource/hudi-spark3.3.x/pom.xml| 4 ++--
 hudi-spark-datasource/hudi-spark3.4.x/pom.xml| 4 ++--
 hudi-spark-datasource/pom.xml| 2 +-
 hudi-sync/hudi-adb-sync/pom.xml  | 2 +-
 hudi-sync/hudi-datahub-sync/pom.xml  | 2

svn commit: r64107 - in /dev/hudi/hudi-0.14.0-rc3: ./ hudi-0.14.0-rc3.src.tgz hudi-0.14.0-rc3.src.tgz.asc hudi-0.14.0-rc3.src.tgz.sha512

2023-09-19 Thread pwason
Author: pwason
Date: Tue Sep 19 06:10:39 2023
New Revision: 64107

Log:
Add Apache Hudi 0.14.0 source release candidate 3


Added:
dev/hudi/hudi-0.14.0-rc3/
dev/hudi/hudi-0.14.0-rc3/hudi-0.14.0-rc3.src.tgz   (with props)
dev/hudi/hudi-0.14.0-rc3/hudi-0.14.0-rc3.src.tgz.asc
dev/hudi/hudi-0.14.0-rc3/hudi-0.14.0-rc3.src.tgz.sha512

Added: dev/hudi/hudi-0.14.0-rc3/hudi-0.14.0-rc3.src.tgz
==
Binary file - no diff available.

Propchange: dev/hudi/hudi-0.14.0-rc3/hudi-0.14.0-rc3.src.tgz
--
svn:mime-type = application/octet-stream

Added: dev/hudi/hudi-0.14.0-rc3/hudi-0.14.0-rc3.src.tgz.asc
==
--- dev/hudi/hudi-0.14.0-rc3/hudi-0.14.0-rc3.src.tgz.asc (added)
+++ dev/hudi/hudi-0.14.0-rc3/hudi-0.14.0-rc3.src.tgz.asc Tue Sep 19 06:10:39 
2023
@@ -0,0 +1,16 @@
+-BEGIN PGP SIGNATURE-
+
+iQIzBAABCAAdFiEEdcV0Tp5c1cSOGcCCxNhY1zudsbgFAmUJOhkACgkQxNhY1zud
+sbjK0w/9G8SkvhRtZLgl7BTytjOqh+3ryl0O86xt86iF0YMQEH8qT/RMbVWzNF3P
+c45ScKv0WEZ+SI1mdQdlxNiwRp9+G6GvPH+xVlABkIf00qG5UtMSj+FxpmjSfL14
+TYsnXo9Rsknwsbz5Ze9Wr8pBcgq0jsAEZME7tBWI5xOIMA/y3jCNSOyallQNv3Y6
+NgJgHxcjmomzn3GX6uQuhGY+KkegBdqcoOkhnvPmtMFS4P2YR7miaJA7Sb4SFtF5
+Q0nISThEjEoHc3XfxPonSZDUEnQ5PoIGM7PNRvunun1o2I458rXNWslN0avgU67D
+Q/kE47ntWcI/RJTbAvHDvq0e1bKNr5mDgdUZfy7PTugNMoA2vvLQI2ptqujiaLCf
+fhk0o0JaGn0/u9KvsU6qDo7PP1zNyUabrFLuk4QHC4aI0m2PvA4rvIUUCaj70BI/
+pVVI4dsUejJWpMkibNXtIlj4pUdydEIHYTbKu2h5n6tRYjuzLuCOQl6RasnJ3ILn
+dN+rmkoKAgZ85j3VlOqPTcyQ9XXhNrNBE8ZUuAh/zcRkeYnMCwgzlXJl3bqU0MnT
+DQU+S230cpxPHzQoKRlXzlebSBQNAbCAQH8/ddzGBd0Eqeo1jaFXJEBSBrqCp+35
+G95DiPKRKZ7IqI3jJqv/bVBqpC+XNhMx6HMplQug0N00Sd+Pg78=
+=r4Ss
+-END PGP SIGNATURE-

Added: dev/hudi/hudi-0.14.0-rc3/hudi-0.14.0-rc3.src.tgz.sha512
==
--- dev/hudi/hudi-0.14.0-rc3/hudi-0.14.0-rc3.src.tgz.sha512 (added)
+++ dev/hudi/hudi-0.14.0-rc3/hudi-0.14.0-rc3.src.tgz.sha512 Tue Sep 19 06:10:39 
2023
@@ -0,0 +1 @@
+4342a8cad955fa4c3e510ac6d65e768643b61e86daf839fa656f8a824bd1d8fcb052ba0bdb4778854348b3a91193e4ec2263c9770d8fe086fe0428f0fc7ed81a
  hudi-0.14.0-rc3.src.tgz




[hudi] annotated tag release-0.14.0-rc3 updated (3477d8b8519 -> 4ce150507c9)

2023-09-19 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a change to annotated tag release-0.14.0-rc3
in repository https://gitbox.apache.org/repos/asf/hudi.git


*** WARNING: tag release-0.14.0-rc3 was modified! ***

from 3477d8b8519 (commit)
  to 4ce150507c9 (tag)
 tagging 3477d8b8519568b9f239ebd6f2518486fe53d32f (commit)
 replaces release-0.14.0-rc2
  by Prashant Wason
  on Mon Sep 18 23:07:26 2023 -0700

- Log -
0.14.0
-BEGIN PGP SIGNATURE-

iQIzBAABCAAdFiEEdcV0Tp5c1cSOGcCCxNhY1zudsbgFAmUJOp4ACgkQxNhY1zud
sbgZuhAAtI8n0gdcIuJ+g2L+JK8XN93De9XcTct2uU0FvJBpVC9gJo/efWXDYOLr
qAo785Rzu0uB+DWVIo4NU0+QomNBr2B8boBxvlsx7vy9sp8/GFqJBQ2jKPn8srgU
wfVlFdlir0z1pgRLoqdDZXBTrP0cFRc/GabK5ju/ZaBJjnHPNUFJxHc4LyIF3lQV
Fa714BFu7b38BFdt33SDY/39rCPwNc+wrEaFC8S4MSRpLlBzvOtBoX6RlHeb1Z9H
A4pqOhh+EORceszQ5pFw0SLJAHvQVpyYy0p48VFjLV5hiqDmThCFZNUO8rMUpflK
SwdZ0omIfRkJKIXQEbJYQOwn25q0LP+m4qjqatmShxXOSqG/7rXdBMsmsGetcKh0
4x9kYq6joGWPOA4dGNt9FEMtUuNmimotQ8v8Osl18ikD8bJzfpUx4znQLmzwA3lS
+PMA7WHP2pJbWCUfQHDr7k09I4FbajKPgRR9dvqm2mS3jCfxWGAEU4eoA5JtK7mK
Ig1vXBa4lMCe5ph2xyDc7bv3A3MLc1fuv3qiRx0wj2UEzbTexKCnYSxmZJW4Q/7g
JPkfH13iheJRaibB84R85/QPCnp3GBLgdq1GiENeDCR9Tp5rBRwTTO6Oxj+LRYJW
AwLbc0+4XBezVxXzVkCQ3ipCsHSMUGHlrJU+NRn0K2u7j3aIhEE=
=Ij2i
-END PGP SIGNATURE-
---


No new revisions were added by this update.

Summary of changes:



[hudi] 03/04: [HUDI-6863] Revert auto-tuning of dedup parallelism (#9722)

2023-09-19 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit d5d2956a4df70202ef356db1bbd86e0640a19476
Author: Y Ethan Guo 
AuthorDate: Fri Sep 15 18:18:20 2023 -0700

[HUDI-6863] Revert auto-tuning of dedup parallelism (#9722)

Before this PR, the auto-tuning logic for dedup parallelism dictates the 
write parallelism so that the user-configured 
`hoodie.upsert.shuffle.parallelism` is ignored.  This commit reverts #6802 to 
fix the issue.
---
 .../org/apache/hudi/table/action/commit/HoodieWriteHelper.java | 7 ++-
 .../client/functional/TestHoodieClientOnCopyOnWriteStorage.java| 6 +++---
 2 files changed, 5 insertions(+), 8 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/HoodieWriteHelper.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/HoodieWriteHelper.java
index d7640c28e50..b56ac08e16f 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/HoodieWriteHelper.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/HoodieWriteHelper.java
@@ -60,9 +60,6 @@ public class HoodieWriteHelper extends 
BaseWriteHelper> records, HoodieIndex index, int 
parallelism, String schemaStr, TypedProperties props, HoodieRecordMerger 
merger) {
 boolean isIndexingGlobal = index.isGlobal();
 final SerializableSchema schema = new SerializableSchema(schemaStr);
-// Auto-tunes the parallelism for reduce transformation based on the 
number of data partitions
-// in engine-specific representation
-int reduceParallelism = Math.max(1, Math.min(records.getNumPartitions(), 
parallelism));
 return records.mapToPair(record -> {
   HoodieKey hoodieKey = record.getKey();
   // If index used is global, then records are expected to differ in their 
partitionPath
@@ -74,7 +71,7 @@ public class HoodieWriteHelper extends 
BaseWriteHelper {
   HoodieRecord reducedRecord;
   try {
-reducedRecord =  merger.merge(rec1, schema.get(), rec2, schema.get(), 
props).get().getLeft();
+reducedRecord = merger.merge(rec1, schema.get(), rec2, schema.get(), 
props).get().getLeft();
   } catch (IOException e) {
 throw new HoodieException(String.format("Error to merge two records, 
%s, %s", rec1, rec2), e);
   }
@@ -82,6 +79,6 @@ public class HoodieWriteHelper extends 
BaseWriteHelper> dedupedRecsRdd =
 (HoodieData>) 
HoodieWriteHelper.newInstance()
 .deduplicateRecords(records, index, dedupParallelism, 
writeConfig.getSchema(), writeConfig.getProps(), 
HoodiePreCombineAvroRecordMerger.INSTANCE);
 List> dedupedRecs = 
dedupedRecsRdd.collectAsList();
-assertEquals(records.getNumPartitions(), 
dedupedRecsRdd.getNumPartitions());
+assertEquals(dedupParallelism, dedupedRecsRdd.getNumPartitions());
 assertEquals(1, dedupedRecs.size());
 assertEquals(dedupedRecs.get(0).getPartitionPath(), 
recordThree.getPartitionPath());
 assertNodupesWithinPartition(dedupedRecs);
@@ -498,7 +498,7 @@ public class TestHoodieClientOnCopyOnWriteStorage extends 
HoodieClientTestBase {
 (HoodieData>) 
HoodieWriteHelper.newInstance()
 .deduplicateRecords(records, index, dedupParallelism, 
writeConfig.getSchema(), writeConfig.getProps(), 
HoodiePreCombineAvroRecordMerger.INSTANCE);
 dedupedRecs = dedupedRecsRdd.collectAsList();
-assertEquals(records.getNumPartitions(), 
dedupedRecsRdd.getNumPartitions());
+assertEquals(dedupParallelism, dedupedRecsRdd.getNumPartitions());
 assertEquals(2, dedupedRecs.size());
 assertNodupesWithinPartition(dedupedRecs);
 



[hudi] 02/04: [HUDI-6550] Add Hadoop conf to HiveConf for HiveSyncConfig (#9221)

2023-09-19 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit c0907b50079f02bb41a3cc5f97bf7aff77ebda8e
Author: Shawn Chang <42792772+c...@users.noreply.github.com>
AuthorDate: Wed Sep 13 18:26:34 2023 -0700

[HUDI-6550] Add Hadoop conf to HiveConf for HiveSyncConfig (#9221)

This commits fix the Hive sync config by creating new HiveConf object every 
time when initializing HiveSyncConfig and adding hadoopConf as resource. We 
have to load Hadoop conf otherwise properties like `--conf 
spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory`
 won't be able to be passed via Spark Hudi job.

Co-authored-by: Shawn Chang 
---
 .../src/main/java/org/apache/hudi/hive/HiveSyncConfig.java   | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git 
a/hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfig.java
 
b/hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfig.java
index cf9274d6910..73f25b1615f 100644
--- 
a/hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfig.java
+++ 
b/hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfig.java
@@ -98,8 +98,9 @@ public class HiveSyncConfig extends HoodieSyncConfig {
 
   public HiveSyncConfig(Properties props, Configuration hadoopConf) {
 super(props, hadoopConf);
-HiveConf hiveConf = hadoopConf instanceof HiveConf
-? (HiveConf) hadoopConf : new HiveConf(hadoopConf, HiveConf.class);
+HiveConf hiveConf = new HiveConf();
+// HiveConf needs to load Hadoop conf to allow instantiation via 
AWSGlueClientFactory
+hiveConf.addResource(hadoopConf);
 setHadoopConf(hiveConf);
 validateParameters();
   }



[hudi] 04/04: Bumping release candidate number 3

2023-09-19 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 794cfe488a4d68667778a91aa407661405e0e195
Author: Prashant Wason 
AuthorDate: Mon Sep 18 22:59:22 2023 -0700

Bumping release candidate number 3
---
 docker/hoodie/hadoop/base/pom.xml| 2 +-
 docker/hoodie/hadoop/base_java11/pom.xml | 2 +-
 docker/hoodie/hadoop/datanode/pom.xml| 2 +-
 docker/hoodie/hadoop/historyserver/pom.xml   | 2 +-
 docker/hoodie/hadoop/hive_base/pom.xml   | 2 +-
 docker/hoodie/hadoop/namenode/pom.xml| 2 +-
 docker/hoodie/hadoop/pom.xml | 2 +-
 docker/hoodie/hadoop/prestobase/pom.xml  | 2 +-
 docker/hoodie/hadoop/spark_base/pom.xml  | 2 +-
 docker/hoodie/hadoop/sparkadhoc/pom.xml  | 2 +-
 docker/hoodie/hadoop/sparkmaster/pom.xml | 2 +-
 docker/hoodie/hadoop/sparkworker/pom.xml | 2 +-
 docker/hoodie/hadoop/trinobase/pom.xml   | 2 +-
 docker/hoodie/hadoop/trinocoordinator/pom.xml| 2 +-
 docker/hoodie/hadoop/trinoworker/pom.xml | 2 +-
 hudi-aws/pom.xml | 4 ++--
 hudi-cli/pom.xml | 2 +-
 hudi-client/hudi-client-common/pom.xml   | 4 ++--
 hudi-client/hudi-flink-client/pom.xml| 4 ++--
 hudi-client/hudi-java-client/pom.xml | 4 ++--
 hudi-client/hudi-spark-client/pom.xml| 4 ++--
 hudi-client/pom.xml  | 2 +-
 hudi-common/pom.xml  | 2 +-
 hudi-examples/hudi-examples-common/pom.xml   | 2 +-
 hudi-examples/hudi-examples-flink/pom.xml| 2 +-
 hudi-examples/hudi-examples-java/pom.xml | 2 +-
 hudi-examples/hudi-examples-spark/pom.xml| 2 +-
 hudi-examples/pom.xml| 2 +-
 hudi-flink-datasource/hudi-flink/pom.xml | 4 ++--
 hudi-flink-datasource/hudi-flink1.13.x/pom.xml   | 4 ++--
 hudi-flink-datasource/hudi-flink1.14.x/pom.xml   | 4 ++--
 hudi-flink-datasource/hudi-flink1.15.x/pom.xml   | 4 ++--
 hudi-flink-datasource/hudi-flink1.16.x/pom.xml   | 4 ++--
 hudi-flink-datasource/hudi-flink1.17.x/pom.xml   | 4 ++--
 hudi-flink-datasource/pom.xml| 4 ++--
 hudi-gcp/pom.xml | 2 +-
 hudi-hadoop-mr/pom.xml   | 2 +-
 hudi-integ-test/pom.xml  | 2 +-
 hudi-kafka-connect/pom.xml   | 4 ++--
 hudi-platform-service/hudi-metaserver/hudi-metaserver-client/pom.xml | 2 +-
 hudi-platform-service/hudi-metaserver/hudi-metaserver-server/pom.xml | 2 +-
 hudi-platform-service/hudi-metaserver/pom.xml| 4 ++--
 hudi-platform-service/pom.xml| 2 +-
 hudi-spark-datasource/hudi-spark-common/pom.xml  | 4 ++--
 hudi-spark-datasource/hudi-spark/pom.xml | 4 ++--
 hudi-spark-datasource/hudi-spark2-common/pom.xml | 2 +-
 hudi-spark-datasource/hudi-spark2/pom.xml| 4 ++--
 hudi-spark-datasource/hudi-spark3-common/pom.xml | 2 +-
 hudi-spark-datasource/hudi-spark3.0.x/pom.xml| 4 ++--
 hudi-spark-datasource/hudi-spark3.1.x/pom.xml| 4 ++--
 hudi-spark-datasource/hudi-spark3.2.x/pom.xml| 4 ++--
 hudi-spark-datasource/hudi-spark3.2plus-common/pom.xml   | 2 +-
 hudi-spark-datasource/hudi-spark3.3.x/pom.xml| 4 ++--
 hudi-spark-datasource/hudi-spark3.4.x/pom.xml| 4 ++--
 hudi-spark-datasource/pom.xml| 2 +-
 hudi-sync/hudi-adb-sync/pom.xml  | 2 +-
 hudi-sync/hudi-datahub-sync/pom.xml  | 2 +-
 hudi-sync/hudi-hive-sync/pom.xml | 2 +-
 hudi-sync/hudi-sync-common/pom.xml   | 2 +-
 hudi-sync/pom.xml| 2

[hudi] branch release-0.14.0 updated (bc3dc019202 -> 794cfe488a4)

2023-09-19 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a change to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git


from bc3dc019202 Resetting the thrift.home property to the default for linux
 new 9f14d507c61 [HUDI-6858] Fix checkpoint reading in Spark structured 
streaming (#9711)
 new c0907b50079 [HUDI-6550] Add Hadoop conf to HiveConf for HiveSyncConfig 
(#9221)
 new d5d2956a4df [HUDI-6863] Revert auto-tuning of dedup parallelism (#9722)
 new 794cfe488a4 Bumping release candidate number 3

The 4 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 docker/hoodie/hadoop/base/pom.xml  |   2 +-
 docker/hoodie/hadoop/base_java11/pom.xml   |   2 +-
 docker/hoodie/hadoop/datanode/pom.xml  |   2 +-
 docker/hoodie/hadoop/historyserver/pom.xml |   2 +-
 docker/hoodie/hadoop/hive_base/pom.xml |   2 +-
 docker/hoodie/hadoop/namenode/pom.xml  |   2 +-
 docker/hoodie/hadoop/pom.xml   |   2 +-
 docker/hoodie/hadoop/prestobase/pom.xml|   2 +-
 docker/hoodie/hadoop/spark_base/pom.xml|   2 +-
 docker/hoodie/hadoop/sparkadhoc/pom.xml|   2 +-
 docker/hoodie/hadoop/sparkmaster/pom.xml   |   2 +-
 docker/hoodie/hadoop/sparkworker/pom.xml   |   2 +-
 docker/hoodie/hadoop/trinobase/pom.xml |   2 +-
 docker/hoodie/hadoop/trinocoordinator/pom.xml  |   2 +-
 docker/hoodie/hadoop/trinoworker/pom.xml   |   2 +-
 hudi-aws/pom.xml   |   4 +-
 hudi-cli/pom.xml   |   2 +-
 hudi-client/hudi-client-common/pom.xml |   4 +-
 .../table/action/commit/HoodieWriteHelper.java |   7 +-
 hudi-client/hudi-flink-client/pom.xml  |   4 +-
 hudi-client/hudi-java-client/pom.xml   |   4 +-
 hudi-client/hudi-spark-client/pom.xml  |   4 +-
 .../TestHoodieClientOnCopyOnWriteStorage.java  |   6 +-
 hudi-client/pom.xml|   2 +-
 hudi-common/pom.xml|   2 +-
 .../org/apache/hudi/common/util/CommitUtils.java   |  33 +++---
 .../org/apache/hudi/common/util/StringUtils.java   |   5 +
 .../apache/hudi/common/util/TestCommitUtils.java   | 118 -
 hudi-examples/hudi-examples-common/pom.xml |   2 +-
 hudi-examples/hudi-examples-flink/pom.xml  |   2 +-
 hudi-examples/hudi-examples-java/pom.xml   |   2 +-
 hudi-examples/hudi-examples-spark/pom.xml  |   2 +-
 hudi-examples/pom.xml  |   2 +-
 hudi-flink-datasource/hudi-flink/pom.xml   |   4 +-
 hudi-flink-datasource/hudi-flink1.13.x/pom.xml |   4 +-
 hudi-flink-datasource/hudi-flink1.14.x/pom.xml |   4 +-
 hudi-flink-datasource/hudi-flink1.15.x/pom.xml |   4 +-
 hudi-flink-datasource/hudi-flink1.16.x/pom.xml |   4 +-
 hudi-flink-datasource/hudi-flink1.17.x/pom.xml |   4 +-
 hudi-flink-datasource/pom.xml  |   4 +-
 hudi-gcp/pom.xml   |   2 +-
 hudi-hadoop-mr/pom.xml |   2 +-
 hudi-integ-test/pom.xml|   2 +-
 hudi-kafka-connect/pom.xml |   4 +-
 .../hudi-metaserver/hudi-metaserver-client/pom.xml |   2 +-
 .../hudi-metaserver/hudi-metaserver-server/pom.xml |   2 +-
 hudi-platform-service/hudi-metaserver/pom.xml  |   4 +-
 hudi-platform-service/pom.xml  |   2 +-
 hudi-spark-datasource/hudi-spark-common/pom.xml|   4 +-
 hudi-spark-datasource/hudi-spark/pom.xml   |   4 +-
 hudi-spark-datasource/hudi-spark2-common/pom.xml   |   2 +-
 hudi-spark-datasource/hudi-spark2/pom.xml  |   4 +-
 hudi-spark-datasource/hudi-spark3-common/pom.xml   |   2 +-
 hudi-spark-datasource/hudi-spark3.0.x/pom.xml  |   4 +-
 hudi-spark-datasource/hudi-spark3.1.x/pom.xml  |   4 +-
 hudi-spark-datasource/hudi-spark3.2.x/pom.xml  |   4 +-
 .../hudi-spark3.2plus-common/pom.xml   |   2 +-
 hudi-spark-datasource/hudi-spark3.3.x/pom.xml  |   4 +-
 hudi-spark-datasource/hudi-spark3.4.x/pom.xml  |   4 +-
 hudi-spark-datasource/pom.xml  |   2 +-
 hudi-sync/hudi-adb-sync/pom.xml|   2 +-
 hudi-sync/hudi-datahub-sync/pom.xml|   2 +-
 hudi-sync/hudi-hive-sync/pom.xml   |   2 +-
 .../java/org/apache/hudi/hive/HiveSyncConfig.java  |   5 +-
 hudi-sync/hudi-sync-common/pom.xml |   2 +-
 hudi-sync/pom.xml  |   2 +-
 hudi-tests-common/pom.xml  |   2 +-
 hudi-time

[hudi] 01/04: [HUDI-6858] Fix checkpoint reading in Spark structured streaming (#9711)

2023-09-19 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 9f14d507c6195366c827ed2a7b5609e894841a96
Author: Y Ethan Guo 
AuthorDate: Wed Sep 13 22:45:52 2023 -0700

[HUDI-6858] Fix checkpoint reading in Spark structured streaming (#9711)
---
 .../org/apache/hudi/common/util/CommitUtils.java   |  33 +++---
 .../org/apache/hudi/common/util/StringUtils.java   |   5 +
 .../apache/hudi/common/util/TestCommitUtils.java   | 118 -
 3 files changed, 139 insertions(+), 17 deletions(-)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/util/CommitUtils.java 
b/hudi-common/src/main/java/org/apache/hudi/common/util/CommitUtils.java
index ed31f79e518..07901d14b6b 100644
--- a/hudi-common/src/main/java/org/apache/hudi/common/util/CommitUtils.java
+++ b/hudi-common/src/main/java/org/apache/hudi/common/util/CommitUtils.java
@@ -164,22 +164,23 @@ public class CommitUtils {
*/
   public static Option 
getValidCheckpointForCurrentWriter(HoodieTimeline timeline, String 
checkpointKey,
   String 
keyToLookup) {
-return (Option) 
timeline.getWriteTimeline().getReverseOrderedInstants().map(instant -> {
-  try {
-HoodieCommitMetadata commitMetadata = HoodieCommitMetadata
-.fromBytes(timeline.getInstantDetails(instant).get(), 
HoodieCommitMetadata.class);
-// process commits only with checkpoint entries
-String checkpointValue = commitMetadata.getMetadata(checkpointKey);
-if (StringUtils.nonEmpty(checkpointValue)) {
-  // return if checkpoint for "keyForLookup" exists.
-  return readCheckpointValue(checkpointValue, keyToLookup);
-} else {
-  return Option.empty();
-}
-  } catch (IOException e) {
-throw new HoodieIOException("Failed to parse HoodieCommitMetadata for 
" + instant.toString(), e);
-  }
-}).filter(Option::isPresent).findFirst().orElse(Option.empty());
+return (Option) 
timeline.getWriteTimeline().filterCompletedInstants().getReverseOrderedInstants()
+.map(instant -> {
+  try {
+HoodieCommitMetadata commitMetadata = HoodieCommitMetadata
+.fromBytes(timeline.getInstantDetails(instant).get(), 
HoodieCommitMetadata.class);
+// process commits only with checkpoint entries
+String checkpointValue = commitMetadata.getMetadata(checkpointKey);
+if (StringUtils.nonEmpty(checkpointValue)) {
+  // return if checkpoint for "keyForLookup" exists.
+  return readCheckpointValue(checkpointValue, keyToLookup);
+} else {
+  return Option.empty();
+}
+  } catch (IOException e) {
+throw new HoodieIOException("Failed to parse HoodieCommitMetadata 
for " + instant.toString(), e);
+  }
+}).filter(Option::isPresent).findFirst().orElse(Option.empty());
   }
 
   public static Option readCheckpointValue(String value, String id) {
diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/util/StringUtils.java 
b/hudi-common/src/main/java/org/apache/hudi/common/util/StringUtils.java
index 24200a7a261..d7d79796aec 100644
--- a/hudi-common/src/main/java/org/apache/hudi/common/util/StringUtils.java
+++ b/hudi-common/src/main/java/org/apache/hudi/common/util/StringUtils.java
@@ -21,6 +21,7 @@ package org.apache.hudi.common.util;
 import javax.annotation.Nullable;
 
 import java.nio.ByteBuffer;
+import java.nio.charset.StandardCharsets;
 import java.util.Collections;
 import java.util.List;
 import java.util.stream.Collectors;
@@ -103,6 +104,10 @@ public class StringUtils {
 return out;
   }
 
+  public static byte[] getUTF8Bytes(String str) {
+return str.getBytes(StandardCharsets.UTF_8);
+  }
+
   public static boolean isNullOrEmpty(String str) {
 return str == null || str.length() == 0;
   }
diff --git 
a/hudi-common/src/test/java/org/apache/hudi/common/util/TestCommitUtils.java 
b/hudi-common/src/test/java/org/apache/hudi/common/util/TestCommitUtils.java
index 6d0b2738b3c..e524f298129 100644
--- a/hudi-common/src/test/java/org/apache/hudi/common/util/TestCommitUtils.java
+++ b/hudi-common/src/test/java/org/apache/hudi/common/util/TestCommitUtils.java
@@ -18,20 +18,37 @@
 
 package org.apache.hudi.common.util;
 
+import org.apache.hudi.avro.model.HoodieClusteringPlan;
+import org.apache.hudi.avro.model.HoodieCompactionPlan;
+import org.apache.hudi.avro.model.HoodieCompactionStrategy;
+import org.apache.hudi.avro.model.HoodieRequestedReplaceMetadata;
 import org.apache.hudi.common.model.HoodieCommitMetadata;
 import org.apache.hudi.common.model.HoodieReplaceCommitMetadata;
+import org.apache.hudi.common.model.HoodieTableType;
 import org.apa

[hudi] annotated tag release-0.14.0-rc2 updated (bc3dc019202 -> 179756bdcbc)

2023-09-13 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a change to annotated tag release-0.14.0-rc2
in repository https://gitbox.apache.org/repos/asf/hudi.git


*** WARNING: tag release-0.14.0-rc2 was modified! ***

from bc3dc019202 (commit)
  to 179756bdcbc (tag)
 tagging bc3dc019202d9ca78908cf841a912350f73e7da6 (commit)
 replaces release-0.14.0-rc1
  by Prashant Wason
  on Wed Sep 13 01:23:43 2023 -0700

- Log -
0.14.0
-BEGIN PGP SIGNATURE-

iQIzBAABCAAdFiEEdcV0Tp5c1cSOGcCCxNhY1zudsbgFAmUBcY8ACgkQxNhY1zud
sbg8ohAAusQH0ccD7qKUphOqisTpfL6dV04xafwJmUSkF2ocWIC59OMIt9Vz+d5X
3O6geJ7HNmlEkY6rVVT+i4Y9YsZveqJPmIdqbg4Eyc/eAwqg6GcOrKP59OXlkO2N
A7W6KHD0qNds3x0QzJAGTXVTmQu9g/aa0vCEqDEuv44yClJTo6tSma4nJekv8xPQ
BtyoJ7+hECKuoOuMFBCHEgyE9+aWETwqsQUkpvZMYddjI+wUX0uyzvyxooxZ2mk0
mbiA/sMvDRTN+USrWmEJ/HhlzDSUwAtQ47XwJzNE94+4xZcay6kXy4yesUPPTlv4
3xzyJLH7iOhBY99LINNmKaED9PAgQ7RwZ/EpVtnMnUwjFLoW9jAuJq3tNr/OJ/li
ZlkHpcNp3aip4ZQGb3+VrzjrI2wdLwvFE7DloEIy4kijBHF+lHyFjKBBlG6bJjgJ
JjRr5ogZdGgqcT2Xb07YyqLTnHUEG+30+SXJc4NvArDVzvbu+At1Nf2ZdOl6aGP6
sptjMex7LYXyYCied8LfRu7MWDVuv6mmXMIUWQ3ElHA5dBzYyIk9HWWEXDMzaYgz
HXTq6jRz2rfdquFdAHkSIANAGnEyiH6p7KpNPDIbAmsXoAC13FBgBU7P3T0Q+1x7
Zt79w261GxvCiKampxLEHLNIuF6hGtWxKI48/TZPjjCmtAwH7wM=
=Uu8B
-END PGP SIGNATURE-
---


No new revisions were added by this update.

Summary of changes:



[hudi] annotated tag release-0.14.0-rc2 deleted (was 1d36c554ed8)

2023-09-13 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a change to annotated tag release-0.14.0-rc2
in repository https://gitbox.apache.org/repos/asf/hudi.git


*** WARNING: tag release-0.14.0-rc2 was deleted! ***

   tag was  1d36c554ed8

The revisions that were on this annotated tag are still contained in
other references; therefore, this change does not discard any commits
from the repository.



[hudi] branch release-0.14.0 updated: Resetting the thrift.home property to the default for linux

2023-09-13 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/release-0.14.0 by this push:
 new bc3dc019202 Resetting the thrift.home property to the default for linux
bc3dc019202 is described below

commit bc3dc019202d9ca78908cf841a912350f73e7da6
Author: Prashant Wason 
AuthorDate: Wed Sep 13 01:20:38 2023 -0700

Resetting the thrift.home property to the default for linux
---
 hudi-platform-service/hudi-metaserver/pom.xml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hudi-platform-service/hudi-metaserver/pom.xml 
b/hudi-platform-service/hudi-metaserver/pom.xml
index 15d22f0bc1d..57fb3caac66 100644
--- a/hudi-platform-service/hudi-metaserver/pom.xml
+++ b/hudi-platform-service/hudi-metaserver/pom.xml
@@ -34,7 +34,7 @@
 ${project.parent.basedir}
 1.4.200
 
-/opt/homebrew/
+/usr/local
 docker
 0.1.11
 



svn commit: r63965 - in /dev/hudi/hudi-0.14.0-rc2: ./ hudi-0.14.0-rc2.src.tgz hudi-0.14.0-rc2.src.tgz.asc hudi-0.14.0-rc2.src.tgz.sha512

2023-09-13 Thread pwason
Author: pwason
Date: Wed Sep 13 08:15:49 2023
New Revision: 63965

Log:
Add hudi-0.14.0-rc2 source release


Added:
dev/hudi/hudi-0.14.0-rc2/
dev/hudi/hudi-0.14.0-rc2/hudi-0.14.0-rc2.src.tgz   (with props)
dev/hudi/hudi-0.14.0-rc2/hudi-0.14.0-rc2.src.tgz.asc
dev/hudi/hudi-0.14.0-rc2/hudi-0.14.0-rc2.src.tgz.sha512

Added: dev/hudi/hudi-0.14.0-rc2/hudi-0.14.0-rc2.src.tgz
==
Binary file - no diff available.

Propchange: dev/hudi/hudi-0.14.0-rc2/hudi-0.14.0-rc2.src.tgz
--
svn:mime-type = application/octet-stream

Added: dev/hudi/hudi-0.14.0-rc2/hudi-0.14.0-rc2.src.tgz.asc
==
--- dev/hudi/hudi-0.14.0-rc2/hudi-0.14.0-rc2.src.tgz.asc (added)
+++ dev/hudi/hudi-0.14.0-rc2/hudi-0.14.0-rc2.src.tgz.asc Wed Sep 13 08:15:49 
2023
@@ -0,0 +1,16 @@
+-BEGIN PGP SIGNATURE-
+
+iQIzBAABCAAdFiEEdcV0Tp5c1cSOGcCCxNhY1zudsbgFAmUBaZ8ACgkQxNhY1zud
+sbgxRhAAkh70RSMbAu3DRLgK0kW51TPVVZ1BzLnAxrHK+RXFLk8CE8eIMXdidulZ
+olWKx5FccVLAzPuIYMWQKIxNgB4KZeAra30RSgZVKB5oqGS/ugVzfKHL8fkBmxhv
+kkM3dw8b2sPCH5DTCt5H7nC2kPZsSfa1alysv+TGrUgu1uQmGyhwPH8bFgy/4yNG
+xVEptveajNbwwqavcAvXoW4rEZsIjJ507x9pCkdij2Clnr/0sPZg+xhk4AtEhuVx
+Mvap9i97XnUQl5Fid8D5ouesAxUJFB4t3R+uYAbrw0mUOovmcgDmB0d9PvaVv41b
+wEzQzYkjxqZiwWAfttFtWOL3DfSruLnjuV9inscXbNkxM+2QDm74hJZxIr3zvw8U
+rfmb843xyx6CoZOa9QDhYgSqzjs+TAptSGTjF/D/xYQNhb967ybZkHS2Pxqo72R3
+K/j4EfMylgtmisQpa6Y9X48s3E8hNxq95n2lKw71w0nng4S0WEB0z1Jm9tkjFtwM
+qOBIC54SuS+4o3SFkTegDjddHBF9CPEcDVGmc7zqkYKFF3It+g/NkHP5M6CHddNQ
+xKE5ORqhmSTx+EVQWJJ1xiFfpzRNmoZj4o35AXsUFN38QfiBTImPbACOsTruwASu
+uoLsAtZPBHcqred3mj0mjixtxhFrYy3SOE5xAoR98wAU9BuIDJs=
+=88EM
+-END PGP SIGNATURE-

Added: dev/hudi/hudi-0.14.0-rc2/hudi-0.14.0-rc2.src.tgz.sha512
==
--- dev/hudi/hudi-0.14.0-rc2/hudi-0.14.0-rc2.src.tgz.sha512 (added)
+++ dev/hudi/hudi-0.14.0-rc2/hudi-0.14.0-rc2.src.tgz.sha512 Wed Sep 13 08:15:49 
2023
@@ -0,0 +1 @@
+98abaa2d25fc3b6d72b27b95560e33edbabb05324f18bce99b780a69f560a163c49ed42fedc2b241dafea64d233ba62f998eed890d1bf7df4416767f27c66769
  hudi-0.14.0-rc2.src.tgz




[hudi] annotated tag release-0.14.0-rc2 updated (3598818dcdc -> 1d36c554ed8)

2023-09-13 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a change to annotated tag release-0.14.0-rc2
in repository https://gitbox.apache.org/repos/asf/hudi.git


*** WARNING: tag release-0.14.0-rc2 was modified! ***

from 3598818dcdc (commit)
  to 1d36c554ed8 (tag)
 tagging 3598818dcdc78de7cb9811eb18917b832d923798 (commit)
 replaces release-0.14.0-rc1
  by Prashant Wason
  on Wed Sep 13 01:11:36 2023 -0700

- Log -
0.14.0
-BEGIN PGP SIGNATURE-

iQIzBAABCAAdFiEEdcV0Tp5c1cSOGcCCxNhY1zudsbgFAmUBbrgACgkQxNhY1zud
sbg1ehAAuXko+FkXjr6A2Cn6DUM+82310Jzkns6UDXpBB65MeARKKroc7kffho0d
W0UeITbBStjsj2G6FewWHORl503HkEjh74Ci0ROY9RRYNvsfq+C5qJ5j0M1QcvZ1
eR7cKZXa72w1LKLuPQYiN1xv1Bp8t5OWRmMDviiglUhfyb0mGuwq5Pbd+g2H8YIl
IAdfPRuKltx1TYl2b1PDh2oSvCAMbAfOBmFMmJ76bfHqFeauNF4Uh45NyNzEJKkv
ELx0dHZ7k2HokeyUiagqP0/xtgP7K+I3RCh47YF5Q8xXAZiPyDS4CTGKIgGPqbrr
zovAld6IYeFmtBo2BzghOVf3NGge4Fz+AC7lqLF4dq/6Nnq0qfhJaItFjKL4u2Pe
cjn84tngPXYacrB/Zq4Zz9ceAA/NyUSz4fiMQ7CNKJZRLrFSDVfZN4auZz7f4Xhn
yi4MZ5jtRvETTLEWadBP6TaXl/0Tc9ZyZ2GfTmwW7+svwXA0eFlSvZVBgHFXNJfH
xA9U/c9+iJ+HsP0EtiK/8Bte+yF2GYuxw586Fy7SHLyH+HIegQhPWv9yrjSLpfEK
O3jGT/YQ+yT96NlwcAq5gY9JCe5/qqKshLOsPnf/CA+s3fKlyF8BWW5/8loNxcbt
v2bpzxlVXtc3KYbrVZi1vtBrjhOGf4kt7nfpw1HCiWlDDASYsRk=
=/xYm
-END PGP SIGNATURE-
---


No new revisions were added by this update.

Summary of changes:



[hudi] branch release-0.14.0 updated: Bumping release candidate number 2

2023-09-13 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/release-0.14.0 by this push:
 new 3598818dcdc Bumping release candidate number 2
3598818dcdc is described below

commit 3598818dcdc78de7cb9811eb18917b832d923798
Author: Prashant Wason 
AuthorDate: Wed Sep 13 00:46:57 2023 -0700

Bumping release candidate number 2
---
 docker/hoodie/hadoop/base/pom.xml  |  2 +-
 docker/hoodie/hadoop/base_java11/pom.xml   |  2 +-
 docker/hoodie/hadoop/datanode/pom.xml  |  2 +-
 docker/hoodie/hadoop/historyserver/pom.xml |  2 +-
 docker/hoodie/hadoop/hive_base/pom.xml |  2 +-
 docker/hoodie/hadoop/namenode/pom.xml  |  2 +-
 docker/hoodie/hadoop/pom.xml   |  2 +-
 docker/hoodie/hadoop/prestobase/pom.xml|  2 +-
 docker/hoodie/hadoop/spark_base/pom.xml|  2 +-
 docker/hoodie/hadoop/sparkadhoc/pom.xml|  2 +-
 docker/hoodie/hadoop/sparkmaster/pom.xml   |  2 +-
 docker/hoodie/hadoop/sparkworker/pom.xml   |  2 +-
 docker/hoodie/hadoop/trinobase/pom.xml |  2 +-
 docker/hoodie/hadoop/trinocoordinator/pom.xml  |  2 +-
 docker/hoodie/hadoop/trinoworker/pom.xml   |  2 +-
 hudi-aws/pom.xml   |  4 ++--
 hudi-cli/pom.xml   |  2 +-
 hudi-client/hudi-client-common/pom.xml |  4 ++--
 hudi-client/hudi-flink-client/pom.xml  |  4 ++--
 hudi-client/hudi-java-client/pom.xml   |  4 ++--
 hudi-client/hudi-spark-client/pom.xml  |  4 ++--
 hudi-client/pom.xml|  2 +-
 hudi-common/pom.xml|  2 +-
 hudi-examples/hudi-examples-common/pom.xml |  2 +-
 hudi-examples/hudi-examples-flink/pom.xml  |  2 +-
 hudi-examples/hudi-examples-java/pom.xml   |  2 +-
 hudi-examples/hudi-examples-spark/pom.xml  |  2 +-
 hudi-examples/pom.xml  |  2 +-
 hudi-flink-datasource/hudi-flink/pom.xml   |  4 ++--
 hudi-flink-datasource/hudi-flink1.13.x/pom.xml |  4 ++--
 hudi-flink-datasource/hudi-flink1.14.x/pom.xml |  4 ++--
 hudi-flink-datasource/hudi-flink1.15.x/pom.xml |  4 ++--
 hudi-flink-datasource/hudi-flink1.16.x/pom.xml |  4 ++--
 hudi-flink-datasource/hudi-flink1.17.x/pom.xml |  4 ++--
 hudi-flink-datasource/pom.xml  |  4 ++--
 hudi-gcp/pom.xml   |  2 +-
 hudi-hadoop-mr/pom.xml |  2 +-
 hudi-integ-test/pom.xml|  2 +-
 hudi-kafka-connect/pom.xml |  4 ++--
 .../hudi-metaserver/hudi-metaserver-client/pom.xml |  2 +-
 .../hudi-metaserver/hudi-metaserver-server/pom.xml |  2 +-
 hudi-platform-service/hudi-metaserver/pom.xml  |  6 +++---
 hudi-platform-service/pom.xml  |  2 +-
 hudi-spark-datasource/hudi-spark-common/pom.xml|  4 ++--
 hudi-spark-datasource/hudi-spark/pom.xml   |  4 ++--
 hudi-spark-datasource/hudi-spark2-common/pom.xml   |  2 +-
 hudi-spark-datasource/hudi-spark2/pom.xml  |  4 ++--
 hudi-spark-datasource/hudi-spark3-common/pom.xml   |  2 +-
 hudi-spark-datasource/hudi-spark3.0.x/pom.xml  |  4 ++--
 hudi-spark-datasource/hudi-spark3.1.x/pom.xml  |  4 ++--
 hudi-spark-datasource/hudi-spark3.2.x/pom.xml  |  4 ++--
 .../hudi-spark3.2plus-common/pom.xml   |  2 +-
 hudi-spark-datasource/hudi-spark3.3.x/pom.xml  |  4 ++--
 hudi-spark-datasource/hudi-spark3.4.x/pom.xml  |  4 ++--
 hudi-spark-datasource/pom.xml  |  2 +-
 hudi-sync/hudi-adb-sync/pom.xml|  2 +-
 hudi-sync/hudi-datahub-sync/pom.xml|  2 +-
 hudi-sync/hudi-hive-sync/pom.xml   |  2 +-
 hudi-sync/hudi-sync-common/pom.xml |  2 +-
 hudi-sync/pom.xml  |  2 +-
 hudi-tests-common/pom.xml  |  2 +-
 hudi-timeline-service/pom.xml  |  2 +-
 hudi-utilities/pom.xml |  2 +-
 packaging/hudi-aws-bundle/pom.xml  |  2 +-
 packaging/hudi-cli-bundle/pom.xml  |  2 +-
 packaging/hudi-datahub-sync-bundle/pom.xml |  2 +-
 packaging/hudi-flink-bundle/pom.xml|  2 +-
 packaging/hudi-gcp-bundle/pom.xml  |  2 +-
 packaging/hudi-hadoop-mr-bundle/pom.xml|  2 +-
 packaging/hudi-hive-sync-bundle/pom.xml|  2 +-
 packaging/hudi-integ-test-bundle/pom.xml   |  2 +-
 packaging/hudi-kafka-connect-bundle/pom.xml|  2 +-
 packaging/hudi-metaserver-server-bundle/pom.xml|  2 +-
 packaging/hudi-presto-bundle/pom.xml   |  2

[hudi] 36/37: [HUDI-6478] Deduce op as upsert for INSERT INTO (#9665)

2023-09-13 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 5af6d70399496ff7b11d574e34b3691f3ab3d034
Author: Shiyan Xu <2701446+xushi...@users.noreply.github.com>
AuthorDate: Tue Sep 12 05:52:20 2023 -0500

[HUDI-6478] Deduce op as upsert for INSERT INTO (#9665)

When users explicitly defines primaryKey and preCombineField when CREATE 
TABLE,
subsequent INSERT INTO will deduce the operation as UPSERT.

-

Co-authored-by: sivabalan 
---
 .../apache/hudi/AutoRecordKeyGenerationUtils.scala |  11 +-
 .../scala/org/apache/hudi/HoodieWriterUtils.scala  |  31 ++--
 .../spark/sql/hudi/ProvidesHoodieConfig.scala  |  48 +++---
 .../sql/hudi/TestAlterTableDropPartition.scala |   1 -
 .../apache/spark/sql/hudi/TestInsertTable.scala| 161 -
 .../spark/sql/hudi/TestTimeTravelTable.scala   |  22 ++-
 6 files changed, 177 insertions(+), 97 deletions(-)

diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/AutoRecordKeyGenerationUtils.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/AutoRecordKeyGenerationUtils.scala
index 6c1b828f3be..f5bbfbf7fef 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/AutoRecordKeyGenerationUtils.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/AutoRecordKeyGenerationUtils.scala
@@ -20,7 +20,6 @@
 package org.apache.hudi
 
 import org.apache.hudi.DataSourceWriteOptions.{INSERT_DROP_DUPS, 
PRECOMBINE_FIELD}
-import org.apache.hudi.HoodieSparkSqlWriter.getClass
 import org.apache.hudi.common.config.HoodieConfig
 import org.apache.hudi.common.table.HoodieTableConfig
 import org.apache.hudi.config.HoodieWriteConfig
@@ -32,9 +31,7 @@ object AutoRecordKeyGenerationUtils {
   private val log = LoggerFactory.getLogger(getClass)
 
   def mayBeValidateParamsForAutoGenerationOfRecordKeys(parameters: Map[String, 
String], hoodieConfig: HoodieConfig): Unit = {
-val autoGenerateRecordKeys = isAutoGenerateRecordKeys(parameters)
-// hudi will auto generate.
-if (autoGenerateRecordKeys) {
+if (shouldAutoGenerateRecordKeys(parameters)) {
   // de-dup is not supported with auto generation of record keys
   if (parameters.getOrElse(HoodieWriteConfig.COMBINE_BEFORE_INSERT.key(),
 HoodieWriteConfig.COMBINE_BEFORE_INSERT.defaultValue()).toBoolean) {
@@ -54,7 +51,9 @@ object AutoRecordKeyGenerationUtils {
 }
   }
 
-  def isAutoGenerateRecordKeys(parameters: Map[String, String]): Boolean = {
-!parameters.contains(KeyGeneratorOptions.RECORDKEY_FIELD_NAME.key()) // if 
record key is not configured,
+  def shouldAutoGenerateRecordKeys(parameters: Map[String, String]): Boolean = 
{
+val recordKeyFromTableConfig = 
parameters.getOrElse(HoodieTableConfig.RECORDKEY_FIELDS.key(), "")
+val recordKeyFromWriterConfig = 
parameters.getOrElse(KeyGeneratorOptions.RECORDKEY_FIELD_NAME.key(), "")
+recordKeyFromTableConfig.isEmpty && recordKeyFromWriterConfig.isEmpty
   }
 }
diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieWriterUtils.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieWriterUtils.scala
index 3d043569835..5230c34984f 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieWriterUtils.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieWriterUtils.scala
@@ -17,8 +17,9 @@
 
 package org.apache.hudi
 
+import 
org.apache.hudi.AutoRecordKeyGenerationUtils.shouldAutoGenerateRecordKeys
 import org.apache.hudi.DataSourceOptionsHelper.allAlternatives
-import org.apache.hudi.DataSourceWriteOptions.{RECORD_MERGER_IMPLS, _}
+import org.apache.hudi.DataSourceWriteOptions._
 import org.apache.hudi.common.config.HoodieMetadataConfig.ENABLE
 import org.apache.hudi.common.config.{DFSPropertiesConfiguration, 
HoodieCommonConfig, HoodieConfig, TypedProperties}
 import org.apache.hudi.common.model.{HoodieRecord, WriteOperationType}
@@ -29,11 +30,10 @@ import org.apache.hudi.hive.HiveSyncConfigHolder
 import org.apache.hudi.keygen.{NonpartitionedKeyGenerator, SimpleKeyGenerator}
 import org.apache.hudi.sync.common.HoodieSyncConfig
 import org.apache.hudi.util.SparkKeyGenUtils
-import org.apache.spark.sql.{Dataset, Row, SparkSession}
 import org.apache.spark.sql.hudi.command.{MergeIntoKeyGenerator, 
SqlKeyGenerator}
+import org.apache.spark.sql.{Dataset, Row, SparkSession}
 import org.slf4j.LoggerFactory
 
-import java.util.Properties
 import scala.collection.JavaConversions.mapAsJavaMap
 import scala.collection.JavaConverters._
 
@@ -43,12 +43,10 @@ import scala.collection.JavaConverters._
 object HoodieWriterUtils {
 
   private val log = LoggerFactory.getLogger(getClass)
+

[hudi] 16/37: [HUDI-6819] Fix logic for throwing exception in getRecordIndexUpdates. (#9616)

2023-09-13 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 46c170425a7ac332e600941f3a06ff18f3c9aca4
Author: Amrish Lal 
AuthorDate: Tue Sep 5 21:31:29 2023 -0700

[HUDI-6819] Fix logic for throwing exception in getRecordIndexUpdates. 
(#9616)

* [HUDI-6819] Fix logic for throwing exception in 
HoodieBackedTableMetadataWriter.
---
 .../org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
index e99ec493558..460bfa2c6e2 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
@@ -1411,8 +1411,8 @@ public abstract class HoodieBackedTableMetadataWriter 
implements HoodieTableM
   .flatMapToPair(Stream::iterator)
   .reduceByKey((recordDelegate1, recordDelegate2) -> {
 if 
(recordDelegate1.getRecordKey().equals(recordDelegate2.getRecordKey())) {
-  if (recordDelegate1.getNewLocation().isPresent() && 
recordDelegate2.getNewLocation().isPresent()) {
-throw new HoodieIOException("Both version of records does not 
have location set. Record V1 " + recordDelegate1.toString()
+  if (!recordDelegate1.getNewLocation().isPresent() && 
!recordDelegate2.getNewLocation().isPresent()) {
+throw new HoodieIOException("Both version of records do not 
have location set. Record V1 " + recordDelegate1.toString()
 + ", Record V2 " + recordDelegate2.toString());
   }
   if (recordDelegate1.getNewLocation().isPresent()) {



[hudi] 20/37: [HUDI-6833] Add field for tracking log files from failed commit in rollback metadata (#9653)

2023-09-13 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit a948fa091584fa8c4fa01bf2cd5cab8f924a3540
Author: Lokesh Jain 
AuthorDate: Fri Sep 8 23:19:12 2023 +0530

[HUDI-6833] Add field for tracking log files from failed commit in rollback 
metadata (#9653)

 [HUDI-6833] Add field for tracking log files from failed commit in 
rollback metadata
---
 .../hudi/table/action/rollback/RollbackUtils.java|  6 --
 .../src/main/avro/HoodieRollbackMetadata.avsc| 13 -
 .../org/apache/hudi/common/HoodieRollbackStat.java   | 20 ++--
 .../common/table/timeline/TimelineMetadataUtils.java |  2 +-
 .../apache/hudi/common/table/TestTimelineUtils.java  |  3 ++-
 .../common/table/view/TestIncrementalFSViewSync.java |  2 +-
 6 files changed, 38 insertions(+), 8 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/rollback/RollbackUtils.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/rollback/RollbackUtils.java
index f350b71da82..c3ee30ed3f4 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/rollback/RollbackUtils.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/rollback/RollbackUtils.java
@@ -82,14 +82,16 @@ public class RollbackUtils {
 final List successDeleteFiles = new ArrayList<>();
 final List failedDeleteFiles = new ArrayList<>();
 final Map commandBlocksCount = new HashMap<>();
-final Map writtenLogFileSizeMap = new HashMap<>();
+final Map logFilesFromFailedCommit = new HashMap<>();
 
Option.ofNullable(stat1.getSuccessDeleteFiles()).ifPresent(successDeleteFiles::addAll);
 
Option.ofNullable(stat2.getSuccessDeleteFiles()).ifPresent(successDeleteFiles::addAll);
 
Option.ofNullable(stat1.getFailedDeleteFiles()).ifPresent(failedDeleteFiles::addAll);
 
Option.ofNullable(stat2.getFailedDeleteFiles()).ifPresent(failedDeleteFiles::addAll);
 
Option.ofNullable(stat1.getCommandBlocksCount()).ifPresent(commandBlocksCount::putAll);
 
Option.ofNullable(stat2.getCommandBlocksCount()).ifPresent(commandBlocksCount::putAll);
-return new HoodieRollbackStat(stat1.getPartitionPath(), 
successDeleteFiles, failedDeleteFiles, commandBlocksCount);
+
Option.ofNullable(stat1.getLogFilesFromFailedCommit()).ifPresent(logFilesFromFailedCommit::putAll);
+
Option.ofNullable(stat2.getLogFilesFromFailedCommit()).ifPresent(logFilesFromFailedCommit::putAll);
+return new HoodieRollbackStat(stat1.getPartitionPath(), 
successDeleteFiles, failedDeleteFiles, commandBlocksCount, 
logFilesFromFailedCommit);
   }
 
 }
diff --git a/hudi-common/src/main/avro/HoodieRollbackMetadata.avsc 
b/hudi-common/src/main/avro/HoodieRollbackMetadata.avsc
index 5a300cda9e6..727a1461d99 100644
--- a/hudi-common/src/main/avro/HoodieRollbackMetadata.avsc
+++ b/hudi-common/src/main/avro/HoodieRollbackMetadata.avsc
@@ -38,7 +38,18 @@
 "type": "long",
 "doc": "Size of this file in bytes"
 }
-}], "default":null }
+}], "default":null },
+{"name": "logFilesFromFailedCommit",
+ "type": ["null", {
+   "type": "map",
+   "doc": "Log files from the failed commit(commit to be rolled 
back)",
+   "values": {
+   "type": "long",
+   "doc": "Size of this file in bytes"
+   }
+ }],
+ "default":null
+}
 ]
  }}},
  {
diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/HoodieRollbackStat.java 
b/hudi-common/src/main/java/org/apache/hudi/common/HoodieRollbackStat.java
index a3191fa026c..ba546866b54 100644
--- a/hudi-common/src/main/java/org/apache/hudi/common/HoodieRollbackStat.java
+++ b/hudi-common/src/main/java/org/apache/hudi/common/HoodieRollbackStat.java
@@ -39,12 +39,15 @@ public class HoodieRollbackStat implements Serializable {
   // Count of HoodieLogFile to commandBlocks written for a particular rollback
   private final Map commandBlocksCount;
 
+  private final Map logFilesFromFailedCommit;
+
   public HoodieRollbackStat(String partitionPath, List 
successDeleteFiles, List failedDeleteFiles,
-  Map commandBlocksCount) {
+Map commandBlocksCount, 
Map logFilesFromFailedCommit) {
 this.partitionPath = partitionPath;
 this.successDeleteFiles = successDeleteFiles;
 this.failedDeleteFiles = failedDeleteFiles;
 this.commandBlocksCount = commandBlocksCount;
+thi

[hudi] 27/37: [HUDI-6738] - Apply object filter before checkpoint batching in GcsEventsHoodieIncrSource (#9538)

2023-09-13 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 5b99ed406caac976d893c3fb0250163808c00cca
Author: lokesh-lingarajan-0310 
<84048984+lokesh-lingarajan-0...@users.noreply.github.com>
AuthorDate: Mon Sep 11 10:26:24 2023 -0700

[HUDI-6738] - Apply object filter before checkpoint batching in 
GcsEventsHoodieIncrSource (#9538)

Apply filtering before we start checkpoint batching.
This change list will bring GCS job similar to S3 job.

-

Co-authored-by: Lokesh Lingarajan 

Co-authored-by: sivabalan 
---
 .../sources/GcsEventsHoodieIncrSource.java |   3 +-
 .../helpers/gcs/GcsObjectMetadataFetcher.java  |  17 ++-
 .../sources/TestGcsEventsHoodieIncrSource.java | 169 ++---
 3 files changed, 63 insertions(+), 126 deletions(-)

diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/GcsEventsHoodieIncrSource.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/GcsEventsHoodieIncrSource.java
index 891881095fd..d09bad71916 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/GcsEventsHoodieIncrSource.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/GcsEventsHoodieIncrSource.java
@@ -172,10 +172,11 @@ public class GcsEventsHoodieIncrSource extends 
HoodieIncrSource {
 }
 
 Dataset cloudObjectMetadataDF = queryRunner.run(queryInfo);
+Dataset filteredSourceData = 
gcsObjectMetadataFetcher.applyFilter(cloudObjectMetadataDF);
 LOG.info("Adjusting end checkpoint:" + queryInfo.getEndInstant() + " based 
on sourceLimit :" + sourceLimit);
 Pair>> checkPointAndDataset 
=
 IncrSourceHelper.filterAndGenerateCheckpointBasedOnSourceLimit(
-cloudObjectMetadataDF, sourceLimit, queryInfo, 
cloudObjectIncrCheckpoint);
+filteredSourceData, sourceLimit, queryInfo, 
cloudObjectIncrCheckpoint);
 if (!checkPointAndDataset.getRight().isPresent()) {
   LOG.info("Empty source, returning endpoint:" + 
queryInfo.getEndInstant());
   return Pair.of(Option.empty(), queryInfo.getEndInstant());
diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/gcs/GcsObjectMetadataFetcher.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/gcs/GcsObjectMetadataFetcher.java
index 08116ac0fa5..c92901d14cf 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/gcs/GcsObjectMetadataFetcher.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/gcs/GcsObjectMetadataFetcher.java
@@ -78,19 +78,26 @@ public class GcsObjectMetadataFetcher implements 
Serializable {
* @return A {@link List} of {@link CloudObjectMetadata} containing GCS info.
*/
   public List getGcsObjectMetadata(JavaSparkContext jsc, 
Dataset cloudObjectMetadataDF, boolean checkIfExists) {
-String filter = createFilter();
-LOG.info("Adding filter string to Dataset: " + filter);
-
 SerializableConfiguration serializableHadoopConf = new 
SerializableConfiguration(jsc.hadoopConfiguration());
-
 return cloudObjectMetadataDF
-.filter(filter)
 .select("bucket", "name", "size")
 .distinct()
 .mapPartitions(getCloudObjectMetadataPerPartition(GCS_PREFIX, 
serializableHadoopConf, checkIfExists), 
Encoders.kryo(CloudObjectMetadata.class))
 .collectAsList();
   }
 
+  /**
+   * @param cloudObjectMetadataDF a Dataset that contains metadata of GCS 
objects. Assumed to be a persisted form
+   *  of a Cloud Storage Pubsub Notification event.
+   * @return Dataset after apply the filtering.
+   */
+  public Dataset applyFilter(Dataset cloudObjectMetadataDF) {
+String filter = createFilter();
+LOG.info("Adding filter string to Dataset: " + filter);
+
+return cloudObjectMetadataDF.filter(filter);
+  }
+
   /**
* Add optional filters that narrow down the list of GCS objects to fetch.
*/
diff --git 
a/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestGcsEventsHoodieIncrSource.java
 
b/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestGcsEventsHoodieIncrSource.java
index cc80123a19c..5c31f310800 100644
--- 
a/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestGcsEventsHoodieIncrSource.java
+++ 
b/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestGcsEventsHoodieIncrSource.java
@@ -39,7 +39,6 @@ import 
org.apache.hudi.testutils.SparkClientFunctionalTestHarness;
 import org.apache.hudi.utilities.schema.FilebasedSchemaProvider;
 import org.apache.hudi.utilities.schema.SchemaProvider;
 import org.apache.hudi.utilities.sources.helpers.Clou

[hudi] 22/37: [HUDI-6820] Fixing CI stability issues (#9661)

2023-09-13 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 688d6c07a2110a2dba0286f8277cfa8cb4bdb881
Author: Lokesh Jain 
AuthorDate: Sat Sep 9 08:43:29 2023 +0530

[HUDI-6820] Fixing CI stability issues (#9661)

- We face frequent flakiness around 2 modules (hudi-hadoop-mr and 
hudi-java-client). so, moving them out to github actions from azure CI.
- Added explicit timeouts for few of deltastreamer continuous tests so that 
those fail instead of timing out.

-

Co-authored-by: sivabalan 
---
 .github/workflows/bot.yml  | 32 ++
 azure-pipelines-20230430.yml   |  2 ++
 .../deltastreamer/TestHoodieDeltaStreamer.java |  5 
 3 files changed, 39 insertions(+)

diff --git a/.github/workflows/bot.yml b/.github/workflows/bot.yml
index 0811c828e49..acd51b8e123 100644
--- a/.github/workflows/bot.yml
+++ b/.github/workflows/bot.yml
@@ -112,6 +112,38 @@ jobs:
 run:
   mvn test -Pfunctional-tests -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" 
-pl "$SPARK_COMMON_MODULES,$SPARK_MODULES" $MVN_ARGS
 
+  test-hudi-hadoop-mr-and-hudi-java-client:
+runs-on: ubuntu-latest
+strategy:
+  matrix:
+include:
+  - scalaProfile: "scala-2.12"
+sparkProfile: "spark3.2"
+flinkProfile: "flink1.17"
+
+steps:
+  - uses: actions/checkout@v3
+  - name: Set up JDK 8
+uses: actions/setup-java@v3
+with:
+  java-version: '8'
+  distribution: 'adopt'
+  architecture: x64
+  - name: Build Project
+env:
+  SCALA_PROFILE: ${{ matrix.scalaProfile }}
+  SPARK_PROFILE: ${{ matrix.sparkProfile }}
+  FLINK_PROFILE: ${{ matrix.flinkProfile }}
+run:
+  mvn clean install -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" 
-D"FLINK_PROFILE" -DskipTests=true -Phudi-platform-service $MVN_ARGS
+  - name: UT - hudi-hadoop-mr and hudi-client/hudi-java-client
+env:
+  SCALA_PROFILE: ${{ matrix.scalaProfile }}
+  SPARK_PROFILE: ${{ matrix.sparkProfile }}
+  FLINK_PROFILE: ${{ matrix.flinkProfile }}
+run:
+  mvn test -Punit-tests -fae -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" 
-D"FLINK_PROFILE" -pl hudi-hadoop-mr,hudi-client/hudi-java-client $MVN_ARGS
+
   test-spark-java17:
 runs-on: ubuntu-latest
 strategy:
diff --git a/azure-pipelines-20230430.yml b/azure-pipelines-20230430.yml
index 2da5ab0d4f9..25a149b5cf4 100644
--- a/azure-pipelines-20230430.yml
+++ b/azure-pipelines-20230430.yml
@@ -53,6 +53,8 @@ parameters:
   - name: job4UTModules
 type: object
 default:
+  - '!hudi-hadoop-mr'
+  - '!hudi-client/hudi-java-client'
   - '!hudi-client/hudi-spark-client'
   - '!hudi-common'
   - '!hudi-examples'
diff --git 
a/hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/TestHoodieDeltaStreamer.java
 
b/hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/TestHoodieDeltaStreamer.java
index 6324fb83fc9..2a7db25647e 100644
--- 
a/hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/TestHoodieDeltaStreamer.java
+++ 
b/hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/TestHoodieDeltaStreamer.java
@@ -120,6 +120,7 @@ import org.apache.spark.sql.types.StructField;
 import org.junit.jupiter.api.Assertions;
 import org.junit.jupiter.api.Disabled;
 import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.Timeout;
 import org.junit.jupiter.params.ParameterizedTest;
 import org.junit.jupiter.params.provider.Arguments;
 import org.junit.jupiter.params.provider.CsvSource;
@@ -869,6 +870,7 @@ public class TestHoodieDeltaStreamer extends 
HoodieDeltaStreamerTestBase {
 defaultSchemaProviderClassName = FilebasedSchemaProvider.class.getName();
   }
 
+  @Timeout(600)
   @ParameterizedTest
   @EnumSource(value = HoodieRecordType.class, names = {"AVRO", "SPARK"})
   public void testUpsertsCOWContinuousMode(HoodieRecordType recordType) throws 
Exception {
@@ -892,12 +894,14 @@ public class TestHoodieDeltaStreamer extends 
HoodieDeltaStreamerTestBase {
 UtilitiesTestBase.Helpers.deleteFileFromDfs(fs, tableBasePath);
   }
 
+  @Timeout(600)
   @ParameterizedTest
   @EnumSource(value = HoodieRecordType.class, names = {"AVRO"})
   public void testUpsertsMORContinuousModeShutdownGracefully(HoodieRecordType 
recordType) throws Exception {
 testUpsertsContinuousMode(HoodieTableType.MERGE_ON_READ, "continuous_cow", 
true, recordType);
   }
 
+  @Timeout(600)
   @ParameterizedTest
   @EnumSource(value = HoodieRecordType.class, names = {"AVRO", "SPARK"})
   publi

[hudi] 32/37: [MINOR] Avoiding to ingest update records to RLI (#9675)

2023-09-13 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit a03483f09c0522d7c71b673bed14f24041de7aa2
Author: Sivabalan Narayanan 
AuthorDate: Tue Sep 12 01:59:28 2023 -0400

[MINOR] Avoiding to ingest update records to RLI (#9675)
---
 .../apache/hudi/metadata/HoodieBackedTableMetadataWriter.java | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
index 8a930ba5972..c548bfcfeae 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
@@ -1434,7 +1434,7 @@ public abstract class HoodieBackedTableMetadataWriter 
implements HoodieTableM
 return recordKeyDelegatePairs
 .map(writeStatusRecordDelegate -> {
   HoodieRecordDelegate recordDelegate = 
writeStatusRecordDelegate.getValue();
-  HoodieRecord hoodieRecord;
+  HoodieRecord hoodieRecord = null;
   Option newLocation = 
recordDelegate.getNewLocation();
   if (newLocation.isPresent()) {
 if (recordDelegate.getCurrentLocation().isPresent()) {
@@ -1448,11 +1448,12 @@ public abstract class 
HoodieBackedTableMetadataWriter implements HoodieTableM
 LOG.error(msg);
 throw new HoodieMetadataException(msg);
   }
+  // for updates, we can skip updating RLI partition in MDT
+} else {
+  hoodieRecord = HoodieMetadataPayload.createRecordIndexUpdate(
+  recordDelegate.getRecordKey(), 
recordDelegate.getPartitionPath(),
+  newLocation.get().getFileId(), 
newLocation.get().getInstantTime(), dataWriteConfig.getWritesFileIdEncoding());
 }
-
-hoodieRecord = HoodieMetadataPayload.createRecordIndexUpdate(
-recordDelegate.getRecordKey(), 
recordDelegate.getPartitionPath(),
-newLocation.get().getFileId(), 
newLocation.get().getInstantTime(), dataWriteConfig.getWritesFileIdEncoding());
   } else {
 // Delete existing index for a deleted record
 hoodieRecord = 
HoodieMetadataPayload.createRecordIndexDelete(recordDelegate.getRecordKey());



[hudi] 37/37: [HUDI-6724] - Defaulting previous Instant time to init time to enable full read of initial commit (#9473)

2023-09-13 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 63a37211384f320b3e4af00a8f2dd46dd280e9cd
Author: lokesh-lingarajan-0310 
<84048984+lokesh-lingarajan-0...@users.noreply.github.com>
AuthorDate: Tue Sep 12 05:45:44 2023 -0700

[HUDI-6724] - Defaulting previous Instant time to init time to enable full 
read of initial commit (#9473)

This will happen in new onboarding as the old code will initialize 
prev=start = firstcommit-time,
incremental read following this will always get entries > prev,
which case we will skip part of first commit in processing.

-

Co-authored-by: Lokesh Lingarajan 

Co-authored-by: sivabalan 
---
 .../sources/helpers/IncrSourceHelper.java  |  11 +-
 .../utilities/sources/helpers/QueryRunner.java |   6 ++
 .../sources/helpers/TestIncrSourceHelper.java  | 120 +
 3 files changed, 136 insertions(+), 1 deletion(-)

diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/IncrSourceHelper.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/IncrSourceHelper.java
index ceec1851ee9..8b40edcf044 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/IncrSourceHelper.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/IncrSourceHelper.java
@@ -130,11 +130,20 @@ public class IncrSourceHelper {
   }
 });
 
-String previousInstantTime = beginInstantTime;
+// When `beginInstantTime` is present, `previousInstantTime` is set to the 
completed commit before `beginInstantTime` if that exists.
+// If there is no completed commit before `beginInstantTime`, e.g., 
`beginInstantTime` is the first commit in the active timeline,
+// `previousInstantTime` is set to `DEFAULT_BEGIN_TIMESTAMP`.
+String previousInstantTime = DEFAULT_BEGIN_TIMESTAMP;
 if (!beginInstantTime.equals(DEFAULT_BEGIN_TIMESTAMP)) {
   Option previousInstant = 
activeCommitTimeline.findInstantBefore(beginInstantTime);
   if (previousInstant.isPresent()) {
 previousInstantTime = previousInstant.get().getTimestamp();
+  } else {
+// if begin instant time matches first entry in active timeline, we 
can set previous = beginInstantTime - 1
+if 
(activeCommitTimeline.filterCompletedInstants().firstInstant().isPresent()
+&& 
activeCommitTimeline.filterCompletedInstants().firstInstant().get().getTimestamp().equals(beginInstantTime))
 {
+  previousInstantTime = 
String.valueOf(Long.parseLong(beginInstantTime) - 1);
+}
   }
 }
 
diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/QueryRunner.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/QueryRunner.java
index f65930d18ff..761e942549c 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/QueryRunner.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/QueryRunner.java
@@ -54,6 +54,12 @@ public class QueryRunner {
 this.sourcePath = getStringWithAltKeys(props, 
HoodieIncrSourceConfig.HOODIE_SRC_BASE_PATH);
   }
 
+  /**
+   * This is used to execute queries for cloud stores incremental pipelines.
+   * Regular Hudi incremental queries does not take this flow.
+   * @param queryInfo all meta info about the query to be executed.
+   * @return the output of the query as Dataset < Row >.
+   */
   public Dataset run(QueryInfo queryInfo) {
 Dataset dataset = null;
 if (queryInfo.isIncremental()) {
diff --git 
a/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/helpers/TestIncrSourceHelper.java
 
b/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/helpers/TestIncrSourceHelper.java
index 78020697c2e..9ce864aceae 100644
--- 
a/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/helpers/TestIncrSourceHelper.java
+++ 
b/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/helpers/TestIncrSourceHelper.java
@@ -18,13 +18,31 @@
 
 package org.apache.hudi.utilities.sources.helpers;
 
+import org.apache.hudi.client.SparkRDDWriteClient;
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.common.config.HoodieMetadataConfig;
+import org.apache.hudi.common.model.HoodieAvroPayload;
+import org.apache.hudi.common.model.HoodieAvroRecord;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.TimelineUtils;
+import org.apache.hudi.common.table.timeline.versioning.TimelineLayoutVersion;
+import org.apache.hudi.common.testutils.SchemaTestUtil;
 import org.apache.hudi.common.util.Option;
 import org.apache.hudi.common.util.c

[hudi] 31/37: [MINOR] Add timeout for github check test-hudi-hadoop-mr-and-hudi-java-client (#9682)

2023-09-13 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit f33265d3bd87212b5ef924bd4fb6665365ecb617
Author: Lokesh Jain 
AuthorDate: Tue Sep 12 05:38:42 2023 +0530

[MINOR] Add timeout for github check 
test-hudi-hadoop-mr-and-hudi-java-client (#9682)
---
 .github/workflows/bot.yml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.github/workflows/bot.yml b/.github/workflows/bot.yml
index acd51b8e123..7708b2c9536 100644
--- a/.github/workflows/bot.yml
+++ b/.github/workflows/bot.yml
@@ -16,7 +16,6 @@ on:
   - '**.png'
   - '**.svg'
   - '**.yaml'
-  - '**.yml'
   - '.gitignore'
 branches:
   - master
@@ -114,6 +113,7 @@ jobs:
 
   test-hudi-hadoop-mr-and-hudi-java-client:
 runs-on: ubuntu-latest
+timeout-minutes: 40
 strategy:
   matrix:
 include:



[hudi] 29/37: [HUDI-6753] Fix parquet inline reading flaky test (#9618)

2023-09-13 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 456f6731cc4fb29abbc3c9fbd51a9c798efab310
Author: Lokesh Jain 
AuthorDate: Tue Sep 12 02:04:24 2023 +0530

[HUDI-6753] Fix parquet inline reading flaky test (#9618)
---
 .../deltastreamer/HoodieDeltaStreamerTestBase.java | 269 +++-
 .../deltastreamer/TestHoodieDeltaStreamer.java | 472 +
 .../TestHoodieDeltaStreamerDAGExecution.java   |   4 +-
 .../TestHoodieDeltaStreamerWithMultiWriter.java| 127 +++---
 .../TestHoodieMultiTableDeltaStreamer.java |  12 +-
 .../utilities/deltastreamer/TestTransformer.java   |   4 +-
 .../utilities/testutils/UtilitiesTestBase.java |   3 +-
 7 files changed, 462 insertions(+), 429 deletions(-)

diff --git 
a/hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamerTestBase.java
 
b/hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamerTestBase.java
index 3c5b45b35c1..b117b2001fa 100644
--- 
a/hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamerTestBase.java
+++ 
b/hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamerTestBase.java
@@ -21,6 +21,7 @@ package org.apache.hudi.utilities.deltastreamer;
 
 import org.apache.hudi.common.config.TypedProperties;
 import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.common.model.HoodieRecord;
 import org.apache.hudi.common.model.WriteOperationType;
 import org.apache.hudi.common.table.HoodieTableMetaClient;
 import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
@@ -34,6 +35,7 @@ import org.apache.hudi.config.HoodieClusteringConfig;
 import org.apache.hudi.hive.MultiPartKeysValueExtractor;
 import org.apache.hudi.utilities.config.SourceTestConfig;
 import org.apache.hudi.utilities.schema.FilebasedSchemaProvider;
+import org.apache.hudi.utilities.sources.HoodieIncrSource;
 import org.apache.hudi.utilities.sources.TestDataSource;
 import org.apache.hudi.utilities.sources.TestParquetDFSSourceEmptyBatch;
 import org.apache.hudi.utilities.testutils.UtilitiesTestBase;
@@ -41,18 +43,27 @@ import 
org.apache.hudi.utilities.testutils.UtilitiesTestBase;
 import org.apache.avro.Schema;
 import org.apache.hadoop.fs.FileSystem;
 import org.apache.hadoop.fs.Path;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SQLContext;
 import org.apache.spark.streaming.kafka010.KafkaTestUtils;
 import org.junit.jupiter.api.AfterAll;
 import org.junit.jupiter.api.BeforeAll;
 import org.junit.jupiter.api.BeforeEach;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
 
 import java.io.IOException;
 import java.nio.charset.StandardCharsets;
 import java.util.ArrayList;
 import java.util.Collections;
+import java.util.HashMap;
 import java.util.List;
 import java.util.Map;
 import java.util.Random;
+import java.util.concurrent.Executors;
+import java.util.concurrent.Future;
+import java.util.concurrent.TimeUnit;
+import java.util.function.Function;
 
 import static org.apache.hudi.common.util.StringUtils.nonEmpty;
 import static org.apache.hudi.hive.HiveSyncConfigHolder.HIVE_URL;
@@ -62,9 +73,14 @@ import static 
org.apache.hudi.sync.common.HoodieSyncConfig.META_SYNC_DATABASE_NA
 import static 
org.apache.hudi.sync.common.HoodieSyncConfig.META_SYNC_PARTITION_EXTRACTOR_CLASS;
 import static 
org.apache.hudi.sync.common.HoodieSyncConfig.META_SYNC_PARTITION_FIELDS;
 import static 
org.apache.hudi.sync.common.HoodieSyncConfig.META_SYNC_TABLE_NAME;
+import static org.apache.hudi.utilities.streamer.HoodieStreamer.CHECKPOINT_KEY;
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertTrue;
 
 public class HoodieDeltaStreamerTestBase extends UtilitiesTestBase {
 
+  private static final Logger LOG = 
LoggerFactory.getLogger(HoodieDeltaStreamerTestBase.class);
+
   static final Random RANDOM = new Random();
   static final String PROPS_FILENAME_TEST_SOURCE = "test-source.properties";
   static final String PROPS_FILENAME_TEST_SOURCE1 = "test-source1.properties";
@@ -111,6 +127,8 @@ public class HoodieDeltaStreamerTestBase extends 
UtilitiesTestBase {
   protected static String defaultSchemaProviderClassName = 
FilebasedSchemaProvider.class.getName();
   protected static int testNum = 1;
 
+  Map hudiOpts = new HashMap<>();
+
   protected static void prepareTestSetup() throws IOException {
 PARQUET_SOURCE_ROOT = basePath + "/parquetFiles";
 ORC_SOURCE_ROOT = basePath + "/orcFiles";
@@ -230,8 +248,9 @@ public class HoodieDeltaStreamerTestBase extends 
UtilitiesTestBase {
   }
 
   @BeforeEach
-  public void resetTestDataSource() {
+  public void setupTest() {
 TestDataSource.returnEmptyBatch = false;
+hudiOpts = ne

[hudi] 24/37: [HUDI-6831] Add back missing project_id to query statement in BigQuerySyncTool (#9650)

2023-09-13 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 4af3b7eefa67822443d013ac4632089e02b97303
Author: Jinpeng 
AuthorDate: Sun Sep 10 21:12:28 2023 -0400

[HUDI-6831] Add back missing project_id to query statement in 
BigQuerySyncTool (#9650)

Co-authored-by: jp0317 
---
 .../java/org/apache/hudi/gcp/bigquery/HoodieBigQuerySyncClient.java| 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git 
a/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/HoodieBigQuerySyncClient.java
 
b/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/HoodieBigQuerySyncClient.java
index 17990e76929..8c8372a992a 100644
--- 
a/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/HoodieBigQuerySyncClient.java
+++ 
b/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/HoodieBigQuerySyncClient.java
@@ -94,8 +94,9 @@ public class HoodieBigQuerySyncClient extends 
HoodieSyncClient {
   }
   String query =
   String.format(
-  "CREATE EXTERNAL TABLE `%s.%s` %s OPTIONS (%s "
+  "CREATE EXTERNAL TABLE `%s.%s.%s` %s OPTIONS (%s "
   + "uris=[\"%s\"], format=\"PARQUET\", 
file_set_spec_type=\"NEW_LINE_DELIMITED_MANIFEST\")",
+  projectId,
   datasetName,
   tableName,
   withClauses,



[hudi] 18/37: [HUDI-2141] Support flink compaction metrics (#9515)

2023-09-13 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit bca4828bc08006769547549bf4e540dc35f89eed
Author: StreamingFlames <18889897...@163.com>
AuthorDate: Thu Sep 7 08:24:58 2023 +0800

[HUDI-2141] Support flink compaction metrics (#9515)
---
 .../hudi/metrics/FlinkCompactionMetrics.java   | 106 
 .../org/apache/hudi/metrics/FlinkWriteMetrics.java | 111 +
 .../apache/hudi/metrics/HoodieFlinkMetrics.java|  23 +
 .../apache/hudi/sink/compact/CompactOperator.java  |  16 +++
 .../hudi/sink/compact/CompactionCommitSink.java|  16 +++
 .../hudi/sink/compact/CompactionPlanOperator.java  |  19 +++-
 .../hudi/sink/utils/CompactFunctionWrapper.java|  11 +-
 7 files changed, 298 insertions(+), 4 deletions(-)

diff --git 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/metrics/FlinkCompactionMetrics.java
 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/metrics/FlinkCompactionMetrics.java
new file mode 100644
index 000..abf7ef05a3f
--- /dev/null
+++ 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/metrics/FlinkCompactionMetrics.java
@@ -0,0 +1,106 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.metrics;
+
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieInstantTimeGenerator;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.sink.compact.CompactOperator;
+import org.apache.hudi.sink.compact.CompactionPlanOperator;
+
+import org.apache.flink.metrics.MetricGroup;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.text.ParseException;
+import java.time.Duration;
+import java.time.Instant;
+
+/**
+ * Metrics for flink compaction.
+ */
+public class FlinkCompactionMetrics extends FlinkWriteMetrics {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(FlinkCompactionMetrics.class);
+
+  /**
+   * Key for compaction timer.
+   */
+  private static final String COMPACTION_KEY = "compaction";
+
+  /**
+   * Number of pending compaction instants.
+   *
+   * @see CompactionPlanOperator
+   */
+  private int pendingCompactionCount;
+
+  /**
+   * Duration between the earliest pending compaction instant time and now in 
seconds.
+   *
+   *  @see CompactionPlanOperator
+   */
+  private long compactionDelay;
+
+  /**
+   * Cost for consuming a compaction operation in milliseconds.
+   *
+   * @see CompactOperator
+   */
+  private long compactionCost;
+
+  public FlinkCompactionMetrics(MetricGroup metricGroup) {
+super(metricGroup, HoodieTimeline.COMPACTION_ACTION);
+  }
+
+  @Override
+  public void registerMetrics() {
+super.registerMetrics();
+metricGroup.gauge(getMetricsName(actionType, "pendingCompactionCount"), () 
-> pendingCompactionCount);
+metricGroup.gauge(getMetricsName(actionType, "compactionDelay"), () -> 
compactionDelay);
+metricGroup.gauge(getMetricsName(actionType, "compactionCost"), () -> 
compactionCost);
+  }
+
+  public void setPendingCompactionCount(int pendingCompactionCount) {
+this.pendingCompactionCount = pendingCompactionCount;
+  }
+
+  public void setFirstPendingCompactionInstant(Option 
firstPendingCompactionInstant) {
+try {
+  if (!firstPendingCompactionInstant.isPresent()) {
+this.compactionDelay = 0L;
+  } else {
+Instant start = 
HoodieInstantTimeGenerator.parseDateFromInstantTime(firstPendingCompactionInstant.get().getTimestamp()).toInstant();
+this.compactionDelay = Duration.between(start, 
Instant.now()).getSeconds();
+  }
+} catch (ParseException e) {
+  LOG.warn("Invalid input compaction instant" + 
firstPendingCompactionInstant);
+}
+  }
+
+  public void startCompaction() {
+startTimer(COMPACTION_KEY);
+  }
+
+  public void endCompaction() {
+this.compactionCost = stopTimer(COMPACTION_KEY);
+  }
+
+}

[hudi] 34/37: [MINOR] Avoiding warn log for succeeding in first attempt (#9686)

2023-09-13 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 88f744da58cc518f0e490d97eafd4e3ba4e993ec
Author: Sivabalan Narayanan 
AuthorDate: Tue Sep 12 02:57:42 2023 -0400

[MINOR] Avoiding warn log for succeeding in first attempt (#9686)


-

Co-authored-by: Danny Chan 
---
 .../src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
index 6d0ce7d16bf..7828cc7ee5a 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
@@ -130,7 +130,9 @@ object HoodieSparkSqlWriter {
 while (counter <= maxRetry && !succeeded) {
   try {
 toReturn = writeInternal(sqlContext, mode, optParams, sourceDf, 
streamingWritesParamsOpt, hoodieWriteClient)
-log.warn(s"Succeeded with attempt no $counter")
+if (counter > 0) {
+  log.warn(s"Succeeded with attempt no $counter")
+}
 succeeded = true
   } catch {
 case e: HoodieWriteConflictException =>



[hudi] 07/37: [HUDI-6813] Support table name for meta sync in bootstrap (#9600)

2023-09-13 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 033a9f80ff962d77d3f98c92ebee2eacbef06710
Author: Jing Zhang 
AuthorDate: Sat Sep 2 09:38:31 2023 +0800

[HUDI-6813] Support table name for meta sync in bootstrap (#9600)
---
 .../src/main/java/org/apache/hudi/cli/BootstrapExecutorUtils.java   | 2 ++
 1 file changed, 2 insertions(+)

diff --git 
a/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/cli/BootstrapExecutorUtils.java
 
b/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/cli/BootstrapExecutorUtils.java
index 7ea1ccdc745..90ab2f9cbab 100644
--- 
a/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/cli/BootstrapExecutorUtils.java
+++ 
b/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/cli/BootstrapExecutorUtils.java
@@ -73,6 +73,7 @@ import static 
org.apache.hudi.keygen.constant.KeyGeneratorOptions.URL_ENCODE_PAR
 import static 
org.apache.hudi.sync.common.HoodieSyncConfig.META_SYNC_BASE_FILE_FORMAT;
 import static org.apache.hudi.sync.common.HoodieSyncConfig.META_SYNC_BASE_PATH;
 import static 
org.apache.hudi.sync.common.HoodieSyncConfig.META_SYNC_DATABASE_NAME;
+import static 
org.apache.hudi.sync.common.HoodieSyncConfig.META_SYNC_TABLE_NAME;
 
 /**
  * Performs bootstrap from a non-hudi source.
@@ -194,6 +195,7 @@ public class BootstrapExecutorUtils implements Serializable 
{
   TypedProperties metaProps = new TypedProperties();
   metaProps.putAll(props);
   metaProps.put(META_SYNC_DATABASE_NAME.key(), cfg.database);
+  metaProps.put(META_SYNC_TABLE_NAME.key(), cfg.tableName);
   metaProps.put(META_SYNC_BASE_PATH.key(), cfg.basePath);
   metaProps.put(META_SYNC_BASE_FILE_FORMAT.key(), cfg.baseFileFormat);
   if (props.getBoolean(HIVE_SYNC_BUCKET_SYNC.key(), 
HIVE_SYNC_BUCKET_SYNC.defaultValue())) {



[hudi] 03/37: [HUDI-6066] HoodieTableSource supports parquet predicate push down (#8437)

2023-09-13 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit d4de459784940bd7f0443e051a3ff79c5d26c14c
Author: Nicholas Jiang 
AuthorDate: Fri Sep 1 09:36:45 2023 +0800

[HUDI-6066] HoodieTableSource supports parquet predicate push down (#8437)
---
 .../apache/hudi/source/ExpressionPredicates.java   | 654 +
 .../org/apache/hudi/table/HoodieTableSource.java   |  18 +-
 .../apache/hudi/table/format/RecordIterators.java  |  60 +-
 .../hudi/table/format/cdc/CdcInputFormat.java  |  11 +-
 .../table/format/cow/CopyOnWriteInputFormat.java   |   9 +-
 .../table/format/mor/MergeOnReadInputFormat.java   |  17 +-
 .../hudi/source/TestExpressionPredicates.java  | 167 ++
 .../apache/hudi/table/ITTestHoodieDataSource.java  |  14 +
 .../apache/hudi/table/TestHoodieTableSource.java   |  23 +
 .../table/format/cow/ParquetSplitReaderUtil.java   |  10 +-
 .../reader/ParquetColumnarRowSplitReader.java  |  10 +-
 .../table/format/cow/ParquetSplitReaderUtil.java   |  10 +-
 .../reader/ParquetColumnarRowSplitReader.java  |  10 +-
 .../table/format/cow/ParquetSplitReaderUtil.java   |  10 +-
 .../reader/ParquetColumnarRowSplitReader.java  |  10 +-
 .../table/format/cow/ParquetSplitReaderUtil.java   |  10 +-
 .../reader/ParquetColumnarRowSplitReader.java  |  10 +-
 .../table/format/cow/ParquetSplitReaderUtil.java   |  10 +-
 .../reader/ParquetColumnarRowSplitReader.java  |  10 +-
 19 files changed, 1037 insertions(+), 36 deletions(-)

diff --git 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/ExpressionPredicates.java
 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/ExpressionPredicates.java
new file mode 100644
index 000..046e4b739ad
--- /dev/null
+++ 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/ExpressionPredicates.java
@@ -0,0 +1,654 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.source;
+
+import org.apache.flink.table.expressions.CallExpression;
+import org.apache.flink.table.expressions.Expression;
+import org.apache.flink.table.expressions.FieldReferenceExpression;
+import org.apache.flink.table.expressions.ResolvedExpression;
+import org.apache.flink.table.expressions.ValueLiteralExpression;
+import org.apache.flink.table.functions.BuiltInFunctionDefinitions;
+import org.apache.flink.table.functions.FunctionDefinition;
+import org.apache.flink.table.types.logical.LogicalType;
+import org.apache.parquet.filter2.predicate.FilterPredicate;
+import org.apache.parquet.filter2.predicate.Operators;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.Serializable;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Objects;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+
+import static org.apache.hudi.common.util.ValidationUtils.checkState;
+import static org.apache.hudi.util.ExpressionUtils.getValueFromLiteral;
+import static org.apache.parquet.filter2.predicate.FilterApi.and;
+import static org.apache.parquet.filter2.predicate.FilterApi.binaryColumn;
+import static org.apache.parquet.filter2.predicate.FilterApi.booleanColumn;
+import static org.apache.parquet.filter2.predicate.FilterApi.doubleColumn;
+import static org.apache.parquet.filter2.predicate.FilterApi.eq;
+import static org.apache.parquet.filter2.predicate.FilterApi.floatColumn;
+import static org.apache.parquet.filter2.predicate.FilterApi.gt;
+import static org.apache.parquet.filter2.predicate.FilterApi.gtEq;
+import static org.apache.parquet.filter2.predicate.FilterApi.intColumn;
+import static org.apache.parquet.filter2.predicate.FilterApi.longColumn;
+import static org.apache.parquet.filter2.predicate.FilterApi.lt;
+import static org.apache.parquet.filter2.predicate.FilterApi.ltEq;
+import static org.apache.parquet.filter2.predicate.FilterApi.not;
+import static org.apache.parquet.filter2.predicate.FilterApi.notEq;
+import static org.apache.parquet.filter2.predicate.FilterApi.or;
+import static org.apache.parquet.io.api.Binary.f

[hudi] 19/37: [HUDI-6736] Fixing rollback completion and commit timeline files removal (#9521)

2023-09-13 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit ae3d886e991458fb145132357f0c0c490982491c
Author: Jon Vexler 
AuthorDate: Thu Sep 7 15:09:54 2023 -0400

[HUDI-6736] Fixing rollback completion and commit timeline files removal 
(#9521)

The purpose of 8849 change is to fix the ordering of rollbacks such that, 
the completion of rollback instant happens first followed by commits file 
removal from the timeline.
For eg,
if t5.c.inflight is partially failed, and t6.rb.requested is triggered to 
rollback.
towards the completion, t6.rb is moved to completed state. and later all t5 
commit files are removed from the timeline.
This could lead to dangling commit files (t5) if the process crashes just 
after moving the t6 rollback to completion. So, 8849 also introduced polling 
completed rollbacks and ensure we don't trigger another rollback for t5.

But we missed that we already landed 5148 which was addressing a similar 
issue.
As per 5148, we first need to delete the commit files from timeline (t5) 
and then transition the rollback to completion (t6.rb). So, even if there is a 
crash, if we re-attempt t6.rb.requested, it will get to completion w/o any 
issues (even if t5 is not in the timeline at all).
Hence reverting some of the core changes added as part of 8849. But there 
are some tests added and so not reverting the entire patch.

-

Co-authored-by: Jonathan Vexler <=>
Co-authored-by: sivabalan 
---
 .../hudi/client/BaseHoodieTableServiceClient.java  | 57 --
 .../rollback/BaseRollbackActionExecutor.java   | 25 +-
 .../java/org/apache/hudi/table/TestCleaner.java| 38 +++
 .../TestCopyOnWriteRollbackActionExecutor.java | 47 --
 .../hudi/testutils/HoodieClientTestBase.java   | 44 -
 .../hudi/common/testutils/HoodieTestTable.java |  8 ---
 .../deltastreamer/TestHoodieDeltaStreamer.java | 14 --
 7 files changed, 62 insertions(+), 171 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java
index 0af2ace25f0..5af681d9a8a 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java
@@ -42,7 +42,6 @@ import org.apache.hudi.common.table.HoodieTableMetaClient;
 import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
 import org.apache.hudi.common.table.timeline.HoodieInstant;
 import org.apache.hudi.common.table.timeline.HoodieTimeline;
-import org.apache.hudi.common.table.timeline.TimelineMetadataUtils;
 import org.apache.hudi.common.util.CleanerUtils;
 import org.apache.hudi.common.util.ClusteringUtils;
 import org.apache.hudi.common.util.CollectionUtils;
@@ -61,7 +60,6 @@ import org.apache.hudi.metadata.HoodieTableMetadataWriter;
 import org.apache.hudi.table.HoodieTable;
 import org.apache.hudi.table.action.HoodieWriteMetadata;
 import org.apache.hudi.table.action.compact.CompactHelpers;
-import org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor;
 import org.apache.hudi.table.action.rollback.RollbackUtils;
 import org.apache.hudi.table.marker.WriteMarkersFactory;
 
@@ -913,7 +911,6 @@ public abstract class BaseHoodieTableServiceClient 
extends BaseHoodieCl
   protected Boolean rollbackFailedWrites() {
 HoodieTable table = createTable(config, hadoopConf);
 List instantsToRollback = 
getInstantsToRollback(table.getMetaClient(), 
config.getFailedWritesCleanPolicy(), Option.empty());
-removeInflightFilesAlreadyRolledBack(instantsToRollback, 
table.getMetaClient());
 Map> pendingRollbacks = 
getPendingRollbackInfos(table.getMetaClient());
 instantsToRollback.forEach(entry -> pendingRollbacks.putIfAbsent(entry, 
Option.empty()));
 rollbackFailedWrites(pendingRollbacks);
@@ -978,60 +975,6 @@ public abstract class BaseHoodieTableServiceClient extends BaseHoodieCl
 }
   }
 
-  /**
-   * This method filters out the instants that are already rolled back, but 
their pending commit files are left
-   * because of job failures. In addition to filtering out these instants, it 
will also cleanup the inflight instants
-   * from the timeline.
-   */
-  protected void removeInflightFilesAlreadyRolledBack(List 
instantsToRollback, HoodieTableMetaClient metaClient) {
-if (instantsToRollback.isEmpty()) {
-  return;
-}
-// Find the oldest inflight timestamp.
-String lowestInflightCommitTime = Collections.min(instantsToRollback);
-HoodieActiveTimeline activeTimeline = metaClient.getAc

[hudi] 14/37: [HUDI-6818] Create a database automatically when using the flink catalog dfs mode (#9592)

2023-09-13 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit ed1d7c97d166edceeac77fdde15f39b2fb0b069f
Author: empcl <1515827...@qq.com>
AuthorDate: Tue Sep 5 10:24:34 2023 +0800

[HUDI-6818] Create a database automatically when using the flink catalog 
dfs mode (#9592)
---
 .../main/java/org/apache/hudi/table/catalog/HoodieCatalog.java | 10 ++
 .../java/org/apache/hudi/table/catalog/TestHoodieCatalog.java  |  5 +++--
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/catalog/HoodieCatalog.java
 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/catalog/HoodieCatalog.java
index 17e3cfa2838..d9e387476cb 100644
--- 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/catalog/HoodieCatalog.java
+++ 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/catalog/HoodieCatalog.java
@@ -125,6 +125,16 @@ public class HoodieCatalog extends AbstractCatalog {
 } catch (IOException e) {
   throw new CatalogException(String.format("Checking catalog path %s 
exists exception.", catalogPathStr), e);
 }
+
+if (!databaseExists(getDefaultDatabase())) {
+  LOG.info("Creating database {} automatically because it does not 
exist.", getDefaultDatabase());
+  Path dbPath = new Path(catalogPath, getDefaultDatabase());
+  try {
+fs.mkdirs(dbPath);
+  } catch (IOException e) {
+throw new CatalogException(String.format("Creating database %s 
exception.", getDefaultDatabase()), e);
+  }
+}
   }
 
   @Override
diff --git 
a/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/catalog/TestHoodieCatalog.java
 
b/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/catalog/TestHoodieCatalog.java
index 5983192fc82..dc4e0db058a 100644
--- 
a/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/catalog/TestHoodieCatalog.java
+++ 
b/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/catalog/TestHoodieCatalog.java
@@ -157,8 +157,9 @@ public class TestHoodieCatalog {
 streamTableEnv = TableEnvironmentImpl.create(settings);
 streamTableEnv.getConfig().getConfiguration()
 
.setInteger(ExecutionConfigOptions.TABLE_EXEC_RESOURCE_DEFAULT_PARALLELISM, 2);
-File testDb = new File(tempFile, TEST_DEFAULT_DATABASE);
-testDb.mkdir();
+
+File catalogPath = new File(tempFile.getPath());
+catalogPath.mkdir();
 
 catalog = new HoodieCatalog("hudi", 
Configuration.fromMap(getDefaultCatalogOption()));
 catalog.open();



[hudi] 13/37: [HUDI-6804] Fix hive read schema evolution MOR table (#9573)

2023-09-13 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit a136369344f4123fc77d8109afb402ab416f0ce5
Author: Zouxxyy 
AuthorDate: Tue Sep 5 09:40:43 2023 +0800

[HUDI-6804] Fix hive read schema evolution MOR table (#9573)
---
 .../apache/hudi/hadoop/SchemaEvolutionContext.java |  11 +-
 .../functional/TestHiveTableSchemaEvolution.java   | 159 +++--
 2 files changed, 93 insertions(+), 77 deletions(-)

diff --git 
a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/SchemaEvolutionContext.java
 
b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/SchemaEvolutionContext.java
index f9f7faf9e29..746066e1c1c 100644
--- 
a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/SchemaEvolutionContext.java
+++ 
b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/SchemaEvolutionContext.java
@@ -82,7 +82,7 @@ public class SchemaEvolutionContext {
 
   private final InputSplit split;
   private final JobConf job;
-  private HoodieTableMetaClient metaClient;
+  private final HoodieTableMetaClient metaClient;
   public Option internalSchemaOption;
 
   public SchemaEvolutionContext(InputSplit split, JobConf job) throws 
IOException {
@@ -149,6 +149,7 @@ public class SchemaEvolutionContext {
   realtimeRecordReader.setWriterSchema(writerSchema);
   realtimeRecordReader.setReaderSchema(readerSchema);
   realtimeRecordReader.setHiveSchema(hiveSchema);
+  internalSchemaOption = Option.of(prunedInternalSchema);
   RealtimeSplit realtimeSplit = (RealtimeSplit) split;
   LOG.info(String.format("About to read compacted logs %s for base split 
%s, projecting cols %s",
   realtimeSplit.getDeltaLogPaths(), realtimeSplit.getPath(), 
requiredColumns));
@@ -171,7 +172,7 @@ public class SchemaEvolutionContext {
   if (!disableSchemaEvolution) {
 prunedSchema = 
InternalSchemaUtils.pruneInternalSchema(internalSchemaOption.get(), 
requiredColumns);
 InternalSchema querySchema = prunedSchema;
-Long commitTime = 
Long.valueOf(FSUtils.getCommitTime(finalPath.getName()));
+long commitTime = 
Long.parseLong(FSUtils.getCommitTime(finalPath.getName()));
 InternalSchema fileSchema = 
InternalSchemaCache.searchSchemaAndCache(commitTime, metaClient, false);
 InternalSchema mergedInternalSchema = new 
InternalSchemaMerger(fileSchema, querySchema, true,
 true).mergeSchema();
@@ -258,10 +259,10 @@ public class SchemaEvolutionContext {
   case DECIMAL:
 return typeInfo;
   case TIME:
-throw new UnsupportedOperationException(String.format("cannot convert 
%s type to hive", new Object[] { type }));
+throw new UnsupportedOperationException(String.format("cannot convert 
%s type to hive", type));
   default:
-LOG.error(String.format("cannot convert unknown type: %s to Hive", new 
Object[] { type }));
-throw new UnsupportedOperationException(String.format("cannot convert 
unknown type: %s to Hive", new Object[] { type }));
+LOG.error(String.format("cannot convert unknown type: %s to Hive", 
type));
+throw new UnsupportedOperationException(String.format("cannot convert 
unknown type: %s to Hive", type));
 }
   }
 
diff --git 
a/hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/functional/TestHiveTableSchemaEvolution.java
 
b/hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/functional/TestHiveTableSchemaEvolution.java
index 027224dbe60..dff9d2e9ccc 100644
--- 
a/hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/functional/TestHiveTableSchemaEvolution.java
+++ 
b/hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/functional/TestHiveTableSchemaEvolution.java
@@ -19,39 +19,46 @@
 package org.apache.hudi.functional;
 
 import org.apache.hudi.HoodieSparkUtils;
-import org.apache.hudi.common.fs.FSUtils;
 import org.apache.hudi.hadoop.HoodieParquetInputFormat;
-import org.apache.hudi.hadoop.SchemaEvolutionContext;
-import org.apache.hudi.hadoop.realtime.HoodieEmptyRecordReader;
-import org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader;
-import org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader;
-import org.apache.hudi.hadoop.realtime.RealtimeSplit;
+import org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat;
 
-import com.uber.hoodie.hadoop.realtime.HoodieRealtimeInputFormat;
 import org.apache.hadoop.fs.Path;
-import org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat;
 import org.apache.hadoop.hive.serde.serdeConstants;
 import org.apache.hadoop.hive.serde2.ColumnProjectionUtils;
+import org.apache.hadoop.io.ArrayWritable;
+import org.apache.hadoop.io.DoubleWritable;
+import org.apache.hadoop.io.NullWritable;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.

[hudi] 11/37: [HUDI-6812]Fix bootstrap operator null point exception while lastInstantTime is null (#9599)

2023-09-13 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 629ee75fe5f38890d63c479c569596e3a8a3d04c
Author: oliver jude <75296820+zhuzhengju...@users.noreply.github.com>
AuthorDate: Mon Sep 4 09:58:55 2023 +0800

[HUDI-6812]Fix bootstrap operator null point exception while 
lastInstantTime is null (#9599)

Co-authored-by: zhuzhengjun 
---
 .../main/java/org/apache/hudi/sink/bootstrap/BootstrapOperator.java   | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/bootstrap/BootstrapOperator.java
 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/bootstrap/BootstrapOperator.java
index 7c9daf4075d..1bdfeb7296b 100644
--- 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/bootstrap/BootstrapOperator.java
+++ 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/bootstrap/BootstrapOperator.java
@@ -108,7 +108,9 @@ public class BootstrapOperator>
   @Override
   public void snapshotState(StateSnapshotContext context) throws Exception {
 lastInstantTime = this.ckpMetadata.lastPendingInstant();
-instantState.update(Collections.singletonList(lastInstantTime));
+if (null != lastInstantTime) {
+  instantState.update(Collections.singletonList(lastInstantTime));
+}
   }
 
   @Override



[hudi] 01/37: [HUDI-6562] Fixed issue for delete events for AWSDmsAvroPayload when CDC enabled (#9519)

2023-09-13 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 655904a6f29d1223cdddfb7ff0c3535c1580f3f7
Author: Aditya Goenka <63430370+ad1happy...@users.noreply.github.com>
AuthorDate: Fri Sep 1 04:47:48 2023 +0530

[HUDI-6562] Fixed issue for delete events for AWSDmsAvroPayload when CDC 
enabled (#9519)

Co-authored-by: Y Ethan Guo 
---
 .../hudi/io/HoodieMergeHandleWithChangeLog.java|  2 +-
 .../functional/cdc/TestCDCDataFrameSuite.scala | 56 +-
 2 files changed, 56 insertions(+), 2 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandleWithChangeLog.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandleWithChangeLog.java
index d610891c2ca..f8669416f0c 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandleWithChangeLog.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandleWithChangeLog.java
@@ -103,7 +103,7 @@ public class HoodieMergeHandleWithChangeLog 
extends HoodieMergeHandl
 // TODO Remove these unnecessary newInstance invocations
 HoodieRecord savedRecord = newRecord.newInstance();
 super.writeInsertRecord(newRecord);
-if (!HoodieOperation.isDelete(newRecord.getOperation())) {
+if (!HoodieOperation.isDelete(newRecord.getOperation()) && 
!savedRecord.isDelete(schema, config.getPayloadConfig().getProps())) {
   cdcLogger.put(newRecord, null, savedRecord.toIndexedRecord(schema, 
config.getPayloadConfig().getProps()).map(HoodieAvroIndexedRecord::getData));
 }
   }
diff --git 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/cdc/TestCDCDataFrameSuite.scala
 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/cdc/TestCDCDataFrameSuite.scala
index 36629687106..aac836d8c3a 100644
--- 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/cdc/TestCDCDataFrameSuite.scala
+++ 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/cdc/TestCDCDataFrameSuite.scala
@@ -26,7 +26,8 @@ import org.apache.hudi.common.table.cdc.{HoodieCDCOperation, 
HoodieCDCSupplement
 import org.apache.hudi.common.table.{HoodieTableConfig, HoodieTableMetaClient, 
TableSchemaResolver}
 import org.apache.hudi.common.testutils.HoodieTestDataGenerator
 import 
org.apache.hudi.common.testutils.RawTripTestPayload.{deleteRecordsToStrings, 
recordsToStrings}
-import org.apache.spark.sql.SaveMode
+import org.apache.spark.sql.{Row, SaveMode}
+import org.apache.spark.sql.types.{StringType, StructField, StructType}
 import org.junit.jupiter.api.Assertions.{assertEquals, assertFalse, assertTrue}
 import org.junit.jupiter.params.ParameterizedTest
 import org.junit.jupiter.params.provider.{CsvSource, EnumSource}
@@ -634,4 +635,57 @@ class TestCDCDataFrameSuite extends HoodieCDCTestBase {
 val cdcDataOnly2 = cdcDataFrame((commitTime2.toLong - 1).toString)
 assertCDCOpCnt(cdcDataOnly2, insertedCnt2, updatedCnt2, 0)
   }
+
+  @ParameterizedTest
+  @EnumSource(classOf[HoodieCDCSupplementalLoggingMode])
+  def testCDCWithAWSDMSPayload(loggingMode: HoodieCDCSupplementalLoggingMode): 
Unit = {
+val options = Map(
+  "hoodie.table.name" -> "test",
+  "hoodie.datasource.write.recordkey.field" -> "id",
+  "hoodie.datasource.write.precombine.field" -> "replicadmstimestamp",
+  "hoodie.datasource.write.keygenerator.class" -> 
"org.apache.hudi.keygen.NonpartitionedKeyGenerator",
+  "hoodie.datasource.write.partitionpath.field" -> "",
+  "hoodie.datasource.write.payload.class" -> 
"org.apache.hudi.common.model.AWSDmsAvroPayload",
+  "hoodie.table.cdc.enabled" -> "true",
+  "hoodie.table.cdc.supplemental.logging.mode" -> "data_before_after"
+)
+
+val data: Seq[(String, String, String, String)] = Seq(
+  ("1", "I", "2023-06-14 15:46:06.953746", "A"),
+  ("2", "I", "2023-06-14 15:46:07.953746", "B"),
+  ("3", "I", "2023-06-14 15:46:08.953746", "C")
+)
+
+val schema: StructType = StructType(Seq(
+  StructField("id", StringType),
+  StructField("Op", StringType),
+  StructField("replicadmstimestamp", StringType),
+  StructField("code", StringType)
+))
+
+val df = spark.createDataFrame(data.map(Row.fromTuple), schema)
+df.write
+  .format("org.apache.hudi")
+  .option("hoodie.datasource.write.operation", "upsert")
+  

[hudi] 12/37: [HUDI-6805] Print detailed error message in clustering (#9577)

2023-09-13 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 620ee24b02b8e1e31f0d08a6d2a737fc96302d07
Author: Akira Ajisaka 
AuthorDate: Mon Sep 4 15:28:15 2023 +0900

[HUDI-6805] Print detailed error message in clustering (#9577)
---
 .../java/org/apache/hudi/io/storage/row/HoodieRowCreateHandle.java| 4 
 1 file changed, 4 insertions(+)

diff --git 
a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowCreateHandle.java
 
b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowCreateHandle.java
index 04362f94da5..05019d2e814 100644
--- 
a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowCreateHandle.java
+++ 
b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowCreateHandle.java
@@ -29,6 +29,7 @@ import org.apache.hudi.common.model.HoodieWriteStat;
 import org.apache.hudi.common.model.IOType;
 import org.apache.hudi.common.util.HoodieTimer;
 import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
 import org.apache.hudi.config.HoodieWriteConfig;
 import org.apache.hudi.exception.HoodieException;
 import org.apache.hudi.exception.HoodieIOException;
@@ -241,6 +242,9 @@ public class HoodieRowCreateHandle implements Serializable {
 stat.setTotalWriteBytes(fileSizeInBytes);
 stat.setFileSizeInBytes(fileSizeInBytes);
 stat.setTotalWriteErrors(writeStatus.getTotalErrorRecords());
+for (Pair pair : 
writeStatus.getFailedRecords()) {
+  LOG.error("Failed to write {}", pair.getLeft(), pair.getRight());
+}
 HoodieWriteStat.RuntimeStats runtimeStats = new 
HoodieWriteStat.RuntimeStats();
 runtimeStats.setTotalCreateTime(currTimer.endTimer());
 stat.setRuntimeStats(runtimeStats);



[hudi] 09/37: [MINOR] Catch EntityNotFoundException correctly (#9595)

2023-09-13 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 8b273631cfde855478d677a679f4365102e06f6b
Author: Shawn Chang <42792772+c...@users.noreply.github.com>
AuthorDate: Sat Sep 2 04:06:37 2023 -0700

[MINOR] Catch EntityNotFoundException correctly (#9595)

When table/database is not found when syncing table to Glue, glue should 
return `EntityNotFoundException`.
After upgrading to AWS SDK V2, Hudi uses `GlueAsyncClient` to get a 
`CompletableFuture`, which would throw `ExecutionException` with 
`EntityNotFoundException` nested when table/database doesn't exist. However, 
existing Hudi code doesn't handle `ExecutionException` and would fail the job.

-

Co-authored-by: Shawn Chang 
---
 .../hudi/aws/sync/AWSGlueCatalogSyncClient.java | 21 +++--
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git 
a/hudi-aws/src/main/java/org/apache/hudi/aws/sync/AWSGlueCatalogSyncClient.java 
b/hudi-aws/src/main/java/org/apache/hudi/aws/sync/AWSGlueCatalogSyncClient.java
index d45cc76a6bc..a76ca86894a 100644
--- 
a/hudi-aws/src/main/java/org/apache/hudi/aws/sync/AWSGlueCatalogSyncClient.java
+++ 
b/hudi-aws/src/main/java/org/apache/hudi/aws/sync/AWSGlueCatalogSyncClient.java
@@ -67,6 +67,7 @@ import java.util.List;
 import java.util.Map;
 import java.util.Objects;
 import java.util.concurrent.CompletableFuture;
+import java.util.concurrent.ExecutionException;
 import java.util.stream.Collectors;
 
 import static org.apache.hudi.aws.utils.S3Utils.s3aToS3;
@@ -456,9 +457,13 @@ public class AWSGlueCatalogSyncClient extends 
HoodieSyncClient {
 .build();
 try {
   return Objects.nonNull(awsGlue.getTable(request).get().table());
-} catch (EntityNotFoundException e) {
-  LOG.info("Table not found: " + tableId(databaseName, tableName), e);
-  return false;
+} catch (ExecutionException e) {
+  if (e.getCause() instanceof EntityNotFoundException) {
+LOG.info("Table not found: " + tableId(databaseName, tableName), e);
+return false;
+  } else {
+throw new HoodieGlueSyncException("Fail to get table: " + 
tableId(databaseName, tableName), e);
+  }
 } catch (Exception e) {
   throw new HoodieGlueSyncException("Fail to get table: " + 
tableId(databaseName, tableName), e);
 }
@@ -469,9 +474,13 @@ public class AWSGlueCatalogSyncClient extends 
HoodieSyncClient {
 GetDatabaseRequest request = 
GetDatabaseRequest.builder().name(databaseName).build();
 try {
   return Objects.nonNull(awsGlue.getDatabase(request).get().database());
-} catch (EntityNotFoundException e) {
-  LOG.info("Database not found: " + databaseName, e);
-  return false;
+} catch (ExecutionException e) {
+  if (e.getCause() instanceof EntityNotFoundException) {
+LOG.info("Database not found: " + databaseName, e);
+return false;
+  } else {
+throw new HoodieGlueSyncException("Fail to check if database exists " 
+ databaseName, e);
+  }
 } catch (Exception e) {
   throw new HoodieGlueSyncException("Fail to check if database exists " + 
databaseName, e);
 }



[hudi] 15/37: [HUDI-6766] Fixing mysql debezium data loss (#9475)

2023-09-13 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 83cdca8bc5d6beabcd60b8f8717a3b0133920d67
Author: Sandeep Parwal <129802178+twlo-sand...@users.noreply.github.com>
AuthorDate: Mon Sep 4 19:36:03 2023 -0700

[HUDI-6766] Fixing mysql debezium data loss  (#9475)
---
 .../model/debezium/MySqlDebeziumAvroPayload.java   | 29 +++---
 .../debezium/TestMySqlDebeziumAvroPayload.java |  6 +
 2 files changed, 32 insertions(+), 3 deletions(-)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/model/debezium/MySqlDebeziumAvroPayload.java
 
b/hudi-common/src/main/java/org/apache/hudi/common/model/debezium/MySqlDebeziumAvroPayload.java
index a0a6304fa40..fceafee554c 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/model/debezium/MySqlDebeziumAvroPayload.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/model/debezium/MySqlDebeziumAvroPayload.java
@@ -66,8 +66,31 @@ public class MySqlDebeziumAvroPayload extends 
AbstractDebeziumAvroPayload {
 new HoodieDebeziumAvroPayloadException(String.format("%s cannot be 
null in insert record: %s",
 DebeziumConstants.ADDED_SEQ_COL_NAME, insertRecord)));
 Option currentSourceSeqOpt = extractSeq(currentRecord);
-// Pick the current value in storage only if its Seq (file+pos) is latest
-// compared to the Seq (file+pos) of the insert value
-return currentSourceSeqOpt.isPresent() && 
insertSourceSeq.compareTo(currentSourceSeqOpt.get()) < 0;
+
+// handle bootstrap case
+if (!currentSourceSeqOpt.isPresent()) {
+  return false;
+}
+
+// Seq is file+pos string like "001.10", getting [001,10] from it
+String[] currentFilePos = currentSourceSeqOpt.get().split("\\.");
+String[] insertFilePos = insertSourceSeq.split("\\.");
+
+long currentFileNum = Long.valueOf(currentFilePos[0]);
+long insertFileNum = Long.valueOf(insertFilePos[0]);
+
+if (insertFileNum < currentFileNum) {
+  // pick the current value
+  return true;
+} else if (insertFileNum > currentFileNum) {
+  // pick the insert value
+  return false;
+}
+
+// file name is the same, compare the position in the file
+Long currentPos = Long.valueOf(currentFilePos[1]);
+Long insertPos = Long.valueOf(insertFilePos[1]);
+
+return insertPos <= currentPos;
   }
 }
diff --git 
a/hudi-common/src/test/java/org/apache/hudi/common/model/debezium/TestMySqlDebeziumAvroPayload.java
 
b/hudi-common/src/test/java/org/apache/hudi/common/model/debezium/TestMySqlDebeziumAvroPayload.java
index f5c3563f064..e257e2bee02 100644
--- 
a/hudi-common/src/test/java/org/apache/hudi/common/model/debezium/TestMySqlDebeziumAvroPayload.java
+++ 
b/hudi-common/src/test/java/org/apache/hudi/common/model/debezium/TestMySqlDebeziumAvroPayload.java
@@ -96,6 +96,12 @@ public class TestMySqlDebeziumAvroPayload {
 payload = new MySqlDebeziumAvroPayload(lateRecord, "0.222");
 mergedRecord = payload.combineAndGetUpdateValue(existingRecord, 
avroSchema);
 validateRecord(mergedRecord, 1, Operation.INSERT, "1.111");
+
+GenericRecord originalRecord = createRecord(1, Operation.INSERT, 
"0.23");
+payload = new MySqlDebeziumAvroPayload(originalRecord, "0.23");
+updateRecord = createRecord(1, Operation.UPDATE, "0.123");
+mergedRecord = payload.combineAndGetUpdateValue(updateRecord, avroSchema);
+validateRecord(mergedRecord, 1, Operation.UPDATE, "0.123");
   }
 
   @Test



[hudi] 23/37: [HUDI-6758] Fixing deducing spurious log blocks due to spark retries (#9611)

2023-09-13 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit bba95305a073b5ffe94fb579b8a525fd92d54294
Author: Sivabalan Narayanan 
AuthorDate: Sun Sep 10 14:11:49 2023 -0400

[HUDI-6758] Fixing deducing spurious log blocks due to spark retries (#9611)

- We attempted a fix to avoid reading spurious log blocks on the reader 
side with #9545.
When I tested the patch end to end, found some gaps. Specifically, the 
attempt Id we had with taskContextSupplier was not referring to task's attempt 
number. So, fixing it in this patch. Tested end to test by simulating spark 
retries and spurious log blocks. Reader is able to detect them and ignore 
multiple copies of log blocks.
---
 .../org/apache/hudi/io/HoodieAppendHandle.java | 22 -
 .../org/apache/hudi/DummyTaskContextSupplier.java  |  5 ++
 .../hudi/client/FlinkTaskContextSupplier.java  |  5 ++
 .../java/org/apache/hudi/io/FlinkAppendHandle.java |  4 +
 .../client/common/JavaTaskContextSupplier.java |  6 ++
 .../testutils/HoodieJavaClientTestHarness.java |  5 ++
 .../hudi/client/SparkTaskContextSupplier.java  |  6 ++
 .../common/engine/LocalTaskContextSupplier.java|  6 ++
 .../hudi/common/engine/TaskContextSupplier.java|  5 ++
 .../table/log/AbstractHoodieLogRecordReader.java   | 95 ++
 .../common/table/log/block/HoodieLogBlock.java |  2 +-
 .../common/functional/TestHoodieLogFormat.java |  2 +-
 12 files changed, 123 insertions(+), 40 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java
index 65f79c5147e..ca081fce60f 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java
@@ -54,6 +54,7 @@ import org.apache.hudi.config.HoodieWriteConfig;
 import org.apache.hudi.exception.HoodieAppendException;
 import org.apache.hudi.exception.HoodieException;
 import org.apache.hudi.exception.HoodieUpsertException;
+import org.apache.hudi.metadata.HoodieTableMetadata;
 import org.apache.hudi.table.HoodieTable;
 
 import org.apache.avro.Schema;
@@ -132,6 +133,8 @@ public class HoodieAppendHandle extends 
HoodieWriteHandle extends 
HoodieWriteHandle hoodieTable,
@@ -153,6 +157,7 @@ public class HoodieAppendHandle extends 
HoodieWriteHandle();
 this.recordProperties.putAll(config.getProps());
+this.attemptNumber = taskContextSupplier.getAttemptNumberSupplier().get();
   }
 
   public HoodieAppendHandle(HoodieWriteConfig config, String instantTime, 
HoodieTable hoodieTable,
@@ -461,11 +466,13 @@ public class HoodieAppendHandle extends 
HoodieWriteHandle 0) {
-blocks.add(new HoodieDeleteBlock(recordsToDelete.toArray(new 
DeleteRecord[0]), getUpdatedHeader(header, blockSequenceNumber++, 
taskContextSupplier.getAttemptIdSupplier().get(;
+blocks.add(new HoodieDeleteBlock(recordsToDelete.toArray(new 
DeleteRecord[0]), getUpdatedHeader(header, blockSequenceNumber++, 
attemptNumber, config,
+addBlockIdentifier(;
   }
 
   if (blocks.size() > 0) {
@@ -562,6 +569,10 @@ public class HoodieAppendHandle extends 
HoodieWriteHandle record) {
 if (!partitionPath.equals(record.getPartitionPath())) {
   HoodieUpsertException failureEx = new HoodieUpsertException("mismatched 
partition path, record partition: "
@@ -635,10 +646,13 @@ public class HoodieAppendHandle extends 
HoodieWriteHandle 
getUpdatedHeader(Map header, int 
blockSequenceNumber, long attemptNumber) {
+  private static Map 
getUpdatedHeader(Map header, int 
blockSequenceNumber, long attemptNumber,
+  
HoodieWriteConfig config, boolean addBlockIdentifier) {
 Map updatedHeader = new HashMap<>();
 updatedHeader.putAll(header);
-updatedHeader.put(HeaderMetadataType.BLOCK_SEQUENCE_NUMBER, 
String.valueOf(attemptNumber) + "," + String.valueOf(blockSequenceNumber));
+if (addBlockIdentifier && 
!HoodieTableMetadata.isMetadataTable(config.getBasePath())) { // add block 
sequence numbers only for data table.
+  updatedHeader.put(HeaderMetadataType.BLOCK_IDENTIFIER, 
String.valueOf(attemptNumber) + "," + String.valueOf(blockSequenceNumber));
+}
 return updatedHeader;
   }
 
diff --git 
a/hudi-client/hudi-client-common/src/test/java/org/apache/hudi/DummyTaskContextSupplier.java
 
b/hudi-client/hudi-client-common/src/test/java/org/apache/hudi/DummyTaskContextSupplier.java
index d2c07e35509..d87b6147302 100644
--- 
a/hudi-client/hudi-client-common/src/test/java/org/apache/hudi/DummyTaskContextSupplier.java
+++ 
b/hudi-client/hudi-client-common/src/test/jav

[hudi] 33/37: [HUDI-6834] Fixing time travel queries when overlaps with cleaner and archival time window (#9666)

2023-09-13 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit c1a497059c42b7116d46b8afae4b826124fce77f
Author: Sivabalan Narayanan 
AuthorDate: Tue Sep 12 02:33:11 2023 -0400

[HUDI-6834] Fixing time travel queries when overlaps with cleaner and 
archival time window (#9666)

When time travel query overlaps with cleaner or archival window, we should 
explicitly fail the query.
If not, we might end up serving partial/wrong results or empty rows.
---
 .../hudi/common/table/timeline/TimelineUtils.java  |  30 ++
 .../hudi/functional/TestTimeTravelQuery.scala  | 104 +++--
 2 files changed, 127 insertions(+), 7 deletions(-)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimelineUtils.java
 
b/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimelineUtils.java
index a763f4d9053..a682c9face9 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimelineUtils.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimelineUtils.java
@@ -25,9 +25,12 @@ import org.apache.hudi.common.model.HoodieCommitMetadata;
 import org.apache.hudi.common.model.HoodieReplaceCommitMetadata;
 import org.apache.hudi.common.model.WriteOperationType;
 import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.util.CleanerUtils;
 import org.apache.hudi.common.util.ClusteringUtils;
 import org.apache.hudi.common.util.CollectionUtils;
 import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.util.ValidationUtils;
 import org.apache.hudi.exception.HoodieException;
 import org.apache.hudi.exception.HoodieIOException;
 import org.apache.hudi.exception.HoodieTimeTravelException;
@@ -50,6 +53,7 @@ import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.DELTA_COMMIT_
 import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.GREATER_THAN;
 import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.GREATER_THAN_OR_EQUALS;
 import static org.apache.hudi.common.table.timeline.HoodieTimeline.LESSER_THAN;
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.LESSER_THAN_OR_EQUALS;
 import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.REPLACE_COMMIT_ACTION;
 import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.SAVEPOINT_ACTION;
 import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.compareTimestamps;
@@ -339,6 +343,32 @@ public class TimelineUtils {
 timestampAsOf, incompleteCommitTime));
   }
 }
+
+// also timestamp as of cannot query cleaned up data.
+Option latestCleanOpt = 
metaClient.getActiveTimeline().getCleanerTimeline().filterCompletedInstants().lastInstant();
+if (latestCleanOpt.isPresent()) {
+  // Ensure timestamp as of is > than the earliest commit to retain and
+  try {
+HoodieCleanMetadata cleanMetadata = 
CleanerUtils.getCleanerMetadata(metaClient, latestCleanOpt.get());
+String earliestCommitToRetain = 
cleanMetadata.getEarliestCommitToRetain();
+if (!StringUtils.isNullOrEmpty(earliestCommitToRetain)) {
+  
ValidationUtils.checkArgument(HoodieTimeline.compareTimestamps(earliestCommitToRetain,
 LESSER_THAN_OR_EQUALS, timestampAsOf),
+  "Cleaner cleaned up the timestamp of interest. Please ensure 
sufficient commits are retained with cleaner "
+  + "for Timestamp as of query to work");
+} else {
+  // when cleaner is based on file versions, we may not find value for 
earliestCommitToRetain.
+  // so, lets check if timestamp of interest is archived based on 
first entry in active timeline
+  Option firstCompletedInstant = 
metaClient.getActiveTimeline().getWriteTimeline().filterCompletedInstants().firstInstant();
+  if (firstCompletedInstant.isPresent()) {
+
ValidationUtils.checkArgument(HoodieTimeline.compareTimestamps(firstCompletedInstant.get().getTimestamp(),
 LESSER_THAN_OR_EQUALS, timestampAsOf),
+"Please ensure sufficient commits are retained (uncleaned and 
un-archived) for timestamp as of query to work.");
+  }
+}
+  } catch (IOException e) {
+throw new HoodieTimeTravelException("Cleaner cleaned up the timestamp 
of interest. "
++ "Please ensure sufficient commits are retained with cleaner for 
Timestamp as of query to work ");
+  }
+}
   }
 
   /**
diff --git 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestTimeTravelQuery.scala
 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestTimeTravelQuery.scala
index cdb94907158..7f3d9386fb2 100644
--- 
a/hudi-spark-datasource

[hudi] 05/37: [HUDI-6579] Fix streaming write when meta cols dropped (#9589)

2023-09-13 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 26cc766ded7f9b898554a346d1a0d4b6dc8837e9
Author: Shiyan Xu <2701446+xushi...@users.noreply.github.com>
AuthorDate: Thu Aug 31 21:57:11 2023 -0500

[HUDI-6579] Fix streaming write when meta cols dropped (#9589)
---
 .../main/scala/org/apache/hudi/DefaultSource.scala | 36 +++---
 .../org/apache/hudi/HoodieCreateRecordUtils.scala  | 11 +++
 .../org/apache/hudi/HoodieSparkSqlWriter.scala | 14 -
 3 files changed, 29 insertions(+), 32 deletions(-)

diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala
index 5a0b0a53d33..f982fb1e1c3 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala
@@ -19,17 +19,17 @@ package org.apache.hudi
 
 import org.apache.hadoop.fs.Path
 import org.apache.hudi.DataSourceReadOptions._
-import org.apache.hudi.DataSourceWriteOptions.{BOOTSTRAP_OPERATION_OPT_VAL, 
OPERATION, RECORDKEY_FIELD, SPARK_SQL_WRITES_PREPPED_KEY, 
STREAMING_CHECKPOINT_IDENTIFIER}
+import org.apache.hudi.DataSourceWriteOptions.{BOOTSTRAP_OPERATION_OPT_VAL, 
OPERATION, STREAMING_CHECKPOINT_IDENTIFIER}
 import org.apache.hudi.cdc.CDCRelation
 import org.apache.hudi.common.fs.FSUtils
 import org.apache.hudi.common.model.HoodieTableType.{COPY_ON_WRITE, 
MERGE_ON_READ}
-import org.apache.hudi.common.model.{HoodieRecord, WriteConcurrencyMode}
+import org.apache.hudi.common.model.WriteConcurrencyMode
 import org.apache.hudi.common.table.timeline.HoodieInstant
 import org.apache.hudi.common.table.{HoodieTableMetaClient, 
TableSchemaResolver}
 import org.apache.hudi.common.util.ConfigUtils
 import org.apache.hudi.common.util.ValidationUtils.checkState
 import org.apache.hudi.config.HoodieBootstrapConfig.DATA_QUERIES_ONLY
-import 
org.apache.hudi.config.HoodieWriteConfig.{SPARK_SQL_MERGE_INTO_PREPPED_KEY, 
WRITE_CONCURRENCY_MODE}
+import org.apache.hudi.config.HoodieWriteConfig.WRITE_CONCURRENCY_MODE
 import org.apache.hudi.exception.HoodieException
 import org.apache.hudi.util.PathUtils
 import org.apache.spark.sql.execution.streaming.{Sink, Source}
@@ -124,21 +124,21 @@ class DefaultSource extends RelationProvider
   }
 
   /**
-* This DataSource API is used for writing the DataFrame at the 
destination. For now, we are returning a dummy
-* relation here because Spark does not really make use of the relation 
returned, and just returns an empty
-* dataset at 
[[org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run()]]. 
This saves us the cost
-* of creating and returning a parquet relation here.
-*
-* TODO: Revisit to return a concrete relation here when we support CREATE 
TABLE AS for Hudi with DataSource API.
-*   That is the only case where Spark seems to actually need a 
relation to be returned here
-*   
[[org.apache.spark.sql.execution.datasources.DataSource.writeAndRead()]]
-*
-* @param sqlContext Spark SQL Context
-* @param mode Mode for saving the DataFrame at the destination
-* @param optParams Parameters passed as part of the DataFrame write 
operation
-* @param rawDf Spark DataFrame to be written
-* @return Spark Relation
-*/
+   * This DataSource API is used for writing the DataFrame at the destination. 
For now, we are returning a dummy
+   * relation here because Spark does not really make use of the relation 
returned, and just returns an empty
+   * dataset at 
[[org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run()]]. 
This saves us the cost
+   * of creating and returning a parquet relation here.
+   *
+   * TODO: Revisit to return a concrete relation here when we support CREATE 
TABLE AS for Hudi with DataSource API.
+   * That is the only case where Spark seems to actually need a relation to be 
returned here
+   * [[org.apache.spark.sql.execution.datasources.DataSource.writeAndRead()]]
+   *
+   * @param sqlContext Spark SQL Context
+   * @param mode   Mode for saving the DataFrame at the destination
+   * @param optParams  Parameters passed as part of the DataFrame write 
operation
+   * @param df Spark DataFrame to be written
+   * @return Spark Relation
+   */
   override def createRelation(sqlContext: SQLContext,
   mode: SaveMode,
   optParams: Map[String, String],
diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieCreateRecordUtils.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieCreateRecordUtils.scala
index b7d9429331e..e9201cc66cc 

[hudi] 08/37: [MINOR] Fix ut due to the scala compile ambiguity of Properties#putAll (#9601)

2023-09-13 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit b7a1f80062b15508cb82dc31681b93dcd8d0bf93
Author: xuzifu666 
AuthorDate: Sat Sep 2 17:50:48 2023 +0800

[MINOR] Fix ut due to the scala compile ambiguity of Properties#putAll 
(#9601)

Co-authored-by: xuyu <11161...@vivo.com>
---
 .../org/apache/hudi/functional/RecordLevelIndexTestBase.scala  | 7 ++-
 .../org/apache/hudi/functional/TestColumnStatsIndexWithSQL.scala   | 6 ++
 .../scala/org/apache/hudi/functional/TestMetadataRecordIndex.scala | 6 ++
 3 files changed, 6 insertions(+), 13 deletions(-)

diff --git 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/RecordLevelIndexTestBase.scala
 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/RecordLevelIndexTestBase.scala
index fcaac58e072..8e898deb537 100644
--- 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/RecordLevelIndexTestBase.scala
+++ 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/RecordLevelIndexTestBase.scala
@@ -23,7 +23,7 @@ import org.apache.hudi.DataSourceWriteOptions._
 import org.apache.hudi.client.SparkRDDWriteClient
 import org.apache.hudi.client.common.HoodieSparkEngineContext
 import org.apache.hudi.client.utils.MetadataConversionUtils
-import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.config.{HoodieMetadataConfig, TypedProperties}
 import org.apache.hudi.common.model._
 import org.apache.hudi.common.table.timeline.HoodieInstant
 import org.apache.hudi.common.table.{HoodieTableConfig, HoodieTableMetaClient}
@@ -37,12 +37,10 @@ import org.apache.spark.sql.functions.{col, not}
 import org.junit.jupiter.api.Assertions.{assertEquals, assertTrue}
 import org.junit.jupiter.api._
 
-import java.util.Properties
 import java.util.concurrent.atomic.AtomicInteger
 import java.util.stream.Collectors
 import scala.collection.JavaConverters._
 import scala.collection.{JavaConverters, mutable}
-import scala.util.Using
 
 class RecordLevelIndexTestBase extends HoodieSparkClientTestBase {
   var spark: SparkSession = _
@@ -230,8 +228,7 @@ class RecordLevelIndexTestBase extends 
HoodieSparkClientTestBase {
   }
 
   protected def getWriteConfig(hudiOpts: Map[String, String]): 
HoodieWriteConfig = {
-val props = new Properties()
-props.putAll(JavaConverters.mapAsJavaMapConverter(hudiOpts).asJava)
+val props = 
TypedProperties.fromMap(JavaConverters.mapAsJavaMapConverter(hudiOpts).asJava)
 HoodieWriteConfig.newBuilder()
   .withProps(props)
   .withPath(basePath)
diff --git 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestColumnStatsIndexWithSQL.scala
 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestColumnStatsIndexWithSQL.scala
index 1bb35bc150c..bb0c0065a91 100644
--- 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestColumnStatsIndexWithSQL.scala
+++ 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestColumnStatsIndexWithSQL.scala
@@ -22,7 +22,7 @@ import 
org.apache.hudi.DataSourceWriteOptions.{DELETE_OPERATION_OPT_VAL, PRECOMB
 import org.apache.hudi.client.SparkRDDWriteClient
 import org.apache.hudi.client.common.HoodieSparkEngineContext
 import org.apache.hudi.client.utils.MetadataConversionUtils
-import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.config.{HoodieMetadataConfig, TypedProperties}
 import org.apache.hudi.common.fs.FSUtils
 import org.apache.hudi.common.model.{HoodieCommitMetadata, HoodieTableType, 
WriteOperationType}
 import org.apache.hudi.common.table.HoodieTableConfig
@@ -40,7 +40,6 @@ import org.junit.jupiter.api.Assertions.{assertEquals, 
assertFalse, assertTrue}
 import org.junit.jupiter.params.ParameterizedTest
 import org.junit.jupiter.params.provider.MethodSource
 
-import java.util.Properties
 import scala.collection.JavaConverters
 import scala.jdk.CollectionConverters.{asScalaIteratorConverter, 
collectionAsScalaIterableConverter}
 
@@ -299,8 +298,7 @@ class TestColumnStatsIndexWithSQL extends 
ColumnStatIndexTestBase {
   }
 
   protected def getWriteConfig(hudiOpts: Map[String, String]): 
HoodieWriteConfig = {
-val props = new Properties()
-props.putAll(JavaConverters.mapAsJavaMapConverter(hudiOpts).asJava)
+val props = 
TypedProperties.fromMap(JavaConverters.mapAsJavaMapConverter(hudiOpts).asJava)
 HoodieWriteConfig.newBuilder()
   .withProps(props)
   .withPath(basePath)
diff --git 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMetadataRecordIndex.scala
 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMetadataRecordIndex.scala
index 0f716e18951..e29b2a2b0ed 100644
--- 
a/hudi

[hudi] 35/37: [HUDI-6842] Fixing flaky tests for async clustering test (#9671)

2023-09-13 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit da81614a0deebd801cb256032deea26869d634de
Author: Sivabalan Narayanan 
AuthorDate: Tue Sep 12 06:20:03 2023 -0400

[HUDI-6842] Fixing flaky tests for async clustering test (#9671)
---
 .../apache/hudi/io/TestHoodieTimelineArchiver.java | 20 +-
 .../deltastreamer/HoodieDeltaStreamerTestBase.java | 14 +
 .../deltastreamer/TestHoodieDeltaStreamer.java | 24 ++
 3 files changed, 44 insertions(+), 14 deletions(-)

diff --git 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/io/TestHoodieTimelineArchiver.java
 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/io/TestHoodieTimelineArchiver.java
index f49f3d5920a..c8907fba510 100644
--- 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/io/TestHoodieTimelineArchiver.java
+++ 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/io/TestHoodieTimelineArchiver.java
@@ -684,7 +684,7 @@ public class TestHoodieTimelineArchiver extends 
HoodieSparkClientTestHarness {
 assertThrows(HoodieException.class, () -> 
metaClient.getArchivedTimeline().reload());
   }
 
-  @Test
+  @Disabled("HUDI-6841")
   public void testArchivalWithMultiWritersMDTDisabled() throws Exception {
 testArchivalWithMultiWriters(false);
   }
@@ -750,17 +750,27 @@ public class TestHoodieTimelineArchiver extends 
HoodieSparkClientTestHarness {
 }
   }
 
-  public static CompletableFuture 
allOfTerminateOnFailure(List> futures) {
+  private static CompletableFuture 
allOfTerminateOnFailure(List> futures) {
 CompletableFuture failure = new CompletableFuture();
 AtomicBoolean jobFailed = new AtomicBoolean(false);
-for (CompletableFuture f : futures) {
-  f.exceptionally(ex -> {
+int counter = 0;
+while (counter < futures.size()) {
+  CompletableFuture curFuture = futures.get(counter);
+  int finalCounter = counter;
+  curFuture.exceptionally(ex -> {
 if (!jobFailed.getAndSet(true)) {
   LOG.warn("One of the job failed. Cancelling all other futures. " + 
ex.getCause() + ", " + ex.getMessage());
-  futures.forEach(future -> future.cancel(true));
+  int secondCounter = 0;
+  while (secondCounter < futures.size()) {
+if (secondCounter != finalCounter) {
+  futures.get(secondCounter).cancel(true);
+}
+secondCounter++;
+  }
 }
 return null;
   });
+  counter++;
 }
 return CompletableFuture.anyOf(failure, 
CompletableFuture.allOf(futures.toArray(new CompletableFuture[0])));
   }
diff --git 
a/hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamerTestBase.java
 
b/hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamerTestBase.java
index b117b2001fa..be5e47faf70 100644
--- 
a/hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamerTestBase.java
+++ 
b/hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamerTestBase.java
@@ -697,5 +697,19 @@ public class HoodieDeltaStreamerTestBase extends 
UtilitiesTestBase {
   int numDeltaCommits = timeline.countInstants();
   assertTrue(minExpected <= numDeltaCommits, "Got=" + numDeltaCommits + ", 
exp >=" + minExpected);
 }
+
+static void assertAtLeastNCommitsAfterRollback(int minExpectedRollback, 
int minExpectedCommits, String tablePath, FileSystem fs) {
+  HoodieTableMetaClient meta = 
HoodieTableMetaClient.builder().setConf(fs.getConf()).setBasePath(tablePath).setLoadActiveTimelineOnLoad(true).build();
+  HoodieTimeline timeline = 
meta.getActiveTimeline().getRollbackTimeline().filterCompletedInstants();
+  LOG.info("Rollback Timeline Instants=" + 
meta.getActiveTimeline().getInstants());
+  int numRollbackCommits = timeline.countInstants();
+  assertTrue(minExpectedRollback <= numRollbackCommits, "Got=" + 
numRollbackCommits + ", exp >=" + minExpectedRollback);
+  HoodieInstant firstRollback = timeline.getInstants().get(0);
+  //
+  HoodieTimeline commitsTimeline = 
meta.getActiveTimeline().filterCompletedInstants()
+  .filter(instant -> 
HoodieTimeline.compareTimestamps(instant.getTimestamp(), 
HoodieTimeline.GREATER_THAN, firstRollback.getTimestamp()));
+  int numCommits = commitsTimeline.countInstants();
+  assertTrue(minExpectedCommits <= numCommits, "Got=" + numCommits + ", 
exp >=" + minExpectedCommits);
+}
   }
 }
diff --git 
a/hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/TestHoodieDeltaStreamer.java
 
b/hudi-utilities/src/test/java/org/apache/hu

[hudi] 17/37: [HUDI-6397][HUDI-6759] Fixing misc bugs w/ metadata table (#9546)

2023-09-13 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 135387c31774c41130ec3aaa5e02d0339817
Author: Sivabalan Narayanan 
AuthorDate: Wed Sep 6 13:56:21 2023 -0400

[HUDI-6397][HUDI-6759] Fixing misc bugs w/ metadata table (#9546)

1. This commit allows users to disable metadata using write configs cleanly.
2. Valid instants consideration while reading from MDT is solid now. We are 
going to treat any special instant time (that has additional suffix compared to 
DT's commit time) as valid.

Especially with MDT partition initialization, the suffix is dynamic, and so 
we can't really find exact match. So, might have to go with total instant time 
length and treat all special instant times as valid ones.

In the LogRecordReader, we will first ignore any uncommitted instants. And 
then if it's completed in MDT timeline, we check w/ the instantRange. So it 
should be fine to return true for any special instant times.
---
 .../metadata/HoodieBackedTableMetadataWriter.java  |  2 +-
 .../java/org/apache/hudi/table/HoodieTable.java|  6 +
 .../org/apache/hudi/table/HoodieSparkTable.java|  3 ++-
 .../functional/TestHoodieBackedMetadata.java   | 28 ++
 .../hudi/metadata/HoodieBackedTableMetadata.java   |  1 +
 .../hudi/metadata/HoodieTableMetadataUtil.java | 11 +
 .../sink/TestStreamWriteOperatorCoordinator.java   |  9 +++
 7 files changed, 40 insertions(+), 20 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
index 460bfa2c6e2..8a930ba5972 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
@@ -172,7 +172,7 @@ public abstract class HoodieBackedTableMetadataWriter 
implements HoodieTableM
 
 this.dataMetaClient = 
HoodieTableMetaClient.builder().setConf(hadoopConf).setBasePath(dataWriteConfig.getBasePath()).build();
 
-if (dataMetaClient.getTableConfig().isMetadataTableAvailable() || 
writeConfig.isMetadataTableEnabled()) {
+if (writeConfig.isMetadataTableEnabled()) {
   this.metadataWriteConfig = 
HoodieMetadataWriteUtils.createMetadataWriteConfig(writeConfig, 
failedWritesCleaningPolicy);
 
   try {
diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java
index f1de637edf5..101931f8c76 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java
@@ -1003,12 +1003,8 @@ public abstract class HoodieTable implements 
Serializable {
 // Only execute metadata table deletion when all the following conditions 
are met
 // (1) This is data table
 // (2) Metadata table is disabled in HoodieWriteConfig for the writer
-// (3) Check `HoodieTableConfig.TABLE_METADATA_PARTITIONS`.  Either the 
table config
-// does not exist, or the table config is non-empty indicating that 
metadata table
-// partitions are ready to use
 return !HoodieTableMetadata.isMetadataTable(metaClient.getBasePath())
-&& !config.isMetadataTableEnabled()
-&& !metaClient.getTableConfig().getMetadataPartitions().isEmpty();
+&& !config.isMetadataTableEnabled();
   }
 
   /**
diff --git 
a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/HoodieSparkTable.java
 
b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/HoodieSparkTable.java
index a5202fb7bbe..111b254634b 100644
--- 
a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/HoodieSparkTable.java
+++ 
b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/HoodieSparkTable.java
@@ -91,7 +91,7 @@ public abstract class HoodieSparkTable
   protected Option getMetadataWriter(
   String triggeringInstantTimestamp,
   HoodieFailedWritesCleaningPolicy failedWritesCleaningPolicy) {
-if (config.isMetadataTableEnabled() || 
metaClient.getTableConfig().isMetadataTableAvailable()) {
+if (config.isMetadataTableEnabled()) {
   // if any partition is deleted, we need to reload the metadata table 
writer so that new table configs are picked up
   // to reflect the delete mdt partitions.
   deleteMetadataIndexIfNecessary();
@@ -112,6 +112,7 @@ public abstract class HoodieSparkTable
 throw new HoodieMetadataException("Checking existence of metadata 
table failed", e);
   }
 

[hudi] 26/37: [HUDI-6728] Update BigQuery manifest sync to support schema evolution (#9482)

2023-09-13 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit a808f74ce0342f93af131de5edc6cae56b292fd7
Author: Tim Brown 
AuthorDate: Mon Sep 11 06:35:02 2023 -0500

[HUDI-6728] Update BigQuery manifest sync to support schema evolution 
(#9482)

Adds schema evolution support to the BigQuerySyncTool by converting
the Hudi schema into the BigQuery Schema format when creating
and updating the table.
---
 hudi-gcp/pom.xml   |  13 +
 .../hudi/gcp/bigquery/BigQuerySchemaResolver.java  | 197 ++
 .../hudi/gcp/bigquery/BigQuerySyncConfig.java  |   3 +-
 .../apache/hudi/gcp/bigquery/BigQuerySyncTool.java |  95 ---
 .../gcp/bigquery/HoodieBigQuerySyncClient.java |  49 +++-
 .../gcp/bigquery/TestBigQuerySchemaResolver.java   | 299 +
 .../hudi/gcp/bigquery/TestBigQuerySyncTool.java| 137 ++
 .../gcp/bigquery/TestHoodieBigQuerySyncClient.java | 119 
 .../org/apache/hudi/sync/adb/AdbSyncConfig.java|   2 +-
 .../apache/hudi/sync/common/HoodieSyncClient.java  |   4 +
 .../hudi/sync/common/util/ManifestFileWriter.java  |  28 +-
 .../sync/common/util/TestManifestFileWriter.java   |   8 +-
 12 files changed, 895 insertions(+), 59 deletions(-)

diff --git a/hudi-gcp/pom.xml b/hudi-gcp/pom.xml
index 202cbc2f8d9..c0a401551de 100644
--- a/hudi-gcp/pom.xml
+++ b/hudi-gcp/pom.xml
@@ -84,6 +84,12 @@ See 
https://github.com/GoogleCloudPlatform/cloud-opensource-java/wiki/The-Google
   parquet-avro
 
 
+
+
+  org.apache.avro
+  avro
+
+
 
 
   org.apache.hadoop
@@ -97,6 +103,13 @@ See 
https://github.com/GoogleCloudPlatform/cloud-opensource-java/wiki/The-Google
   test
 
 
+
+  org.apache.hudi
+  hudi-hive-sync
+  ${project.version}
+  test
+
+
   
 
   
diff --git 
a/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySchemaResolver.java
 
b/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySchemaResolver.java
new file mode 100644
index 000..035ce604e2b
--- /dev/null
+++ 
b/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySchemaResolver.java
@@ -0,0 +1,197 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.gcp.bigquery;
+
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.TableSchemaResolver;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.exception.HoodieException;
+
+import com.google.cloud.bigquery.Field;
+import com.google.cloud.bigquery.FieldList;
+import com.google.cloud.bigquery.Schema;
+import com.google.cloud.bigquery.StandardSQLTypeName;
+import org.apache.avro.LogicalType;
+import org.apache.avro.LogicalTypes;
+
+import java.util.List;
+import java.util.function.Function;
+import java.util.stream.Collectors;
+
+/**
+ * Extracts the BigQuery schema from a Hudi table.
+ */
+class BigQuerySchemaResolver {
+  private static final BigQuerySchemaResolver INSTANCE = new 
BigQuerySchemaResolver(TableSchemaResolver::new);
+
+  private final Function 
tableSchemaResolverSupplier;
+
+  @VisibleForTesting
+  BigQuerySchemaResolver(Function 
tableSchemaResolverSupplier) {
+this.tableSchemaResolverSupplier = tableSchemaResolverSupplier;
+  }
+
+  static BigQuerySchemaResolver getInstance() {
+return INSTANCE;
+  }
+
+  /**
+   * Get the BigQuery schema for the table. If the BigQuery table is 
configured with partitioning, the caller must pass in the partition fields so 
that they are not returned in the schema.
+   * If the partition fields are in the schema, it will cause an error when 
querying the table since BigQuery will treat it as a duplicate column.
+   * @param metaClient Meta client for the Hudi table
+   * @param partitionFields The fields that are used for partitioning in 
BigQuery
+   * @return The BigQuery schema for the table
+   */
+  Schema getTableSchema(HoodieTableMetaClient metaClient, List 
partitionFields) {
+try {
+  Schema schema = 
convertSchema(tableSchemaRe

[hudi] 25/37: [HUDI-6835] Adjust spark sql core flow test scenarios (#9664)

2023-09-13 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit f1114af22b52d663ad24f3fa5844464e65981be7
Author: Shiyan Xu <2701446+xushi...@users.noreply.github.com>
AuthorDate: Sun Sep 10 20:13:56 2023 -0500

[HUDI-6835] Adjust spark sql core flow test scenarios (#9664)
---
 .../hudi/functional/TestSparkSqlCoreFlow.scala | 160 ++---
 1 file changed, 76 insertions(+), 84 deletions(-)

diff --git 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestSparkSqlCoreFlow.scala
 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestSparkSqlCoreFlow.scala
index 7510204bac4..220c6930c4f 100644
--- 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestSparkSqlCoreFlow.scala
+++ 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestSparkSqlCoreFlow.scala
@@ -46,24 +46,22 @@ class TestSparkSqlCoreFlow extends HoodieSparkSqlTestBase {
 
   //params for core flow tests
   val params: List[String] = List(
-
"COPY_ON_WRITE|false|false|org.apache.hudi.keygen.SimpleKeyGenerator|BLOOM",
-"COPY_ON_WRITE|true|false|org.apache.hudi.keygen.SimpleKeyGenerator|BLOOM",
-"COPY_ON_WRITE|true|true|org.apache.hudi.keygen.SimpleKeyGenerator|BLOOM",
-
"COPY_ON_WRITE|false|false|org.apache.hudi.keygen.SimpleKeyGenerator|SIMPLE",
-
"COPY_ON_WRITE|true|false|org.apache.hudi.keygen.SimpleKeyGenerator|SIMPLE",
-"COPY_ON_WRITE|true|true|org.apache.hudi.keygen.SimpleKeyGenerator|SIMPLE",
-
"COPY_ON_WRITE|false|false|org.apache.hudi.keygen.NonpartitionedKeyGenerator|GLOBAL_BLOOM",
-
"COPY_ON_WRITE|true|false|org.apache.hudi.keygen.NonpartitionedKeyGenerator|GLOBAL_BLOOM",
-
"COPY_ON_WRITE|true|true|org.apache.hudi.keygen.NonpartitionedKeyGenerator|GLOBAL_BLOOM",
-
"MERGE_ON_READ|false|false|org.apache.hudi.keygen.SimpleKeyGenerator|BLOOM",
-"MERGE_ON_READ|true|false|org.apache.hudi.keygen.SimpleKeyGenerator|BLOOM",
-"MERGE_ON_READ|true|true|org.apache.hudi.keygen.SimpleKeyGenerator|BLOOM",
-
"MERGE_ON_READ|false|false|org.apache.hudi.keygen.SimpleKeyGenerator|SIMPLE",
-
"MERGE_ON_READ|true|false|org.apache.hudi.keygen.SimpleKeyGenerator|SIMPLE",
-"MERGE_ON_READ|true|true|org.apache.hudi.keygen.SimpleKeyGenerator|SIMPLE",
-
"MERGE_ON_READ|false|false|org.apache.hudi.keygen.NonpartitionedKeyGenerator|GLOBAL_BLOOM",
-
"MERGE_ON_READ|true|false|org.apache.hudi.keygen.NonpartitionedKeyGenerator|GLOBAL_BLOOM",
-
"MERGE_ON_READ|true|true|org.apache.hudi.keygen.NonpartitionedKeyGenerator|GLOBAL_BLOOM"
+
"COPY_ON_WRITE|false|org.apache.hudi.keygen.SimpleKeyGenerator|GLOBAL_BLOOM",
+
"COPY_ON_WRITE|true|org.apache.hudi.keygen.SimpleKeyGenerator|GLOBAL_BLOOM",
+
"COPY_ON_WRITE|false|org.apache.hudi.keygen.SimpleKeyGenerator|GLOBAL_SIMPLE",
+
"COPY_ON_WRITE|true|org.apache.hudi.keygen.SimpleKeyGenerator|GLOBAL_SIMPLE",
+
"COPY_ON_WRITE|false|org.apache.hudi.keygen.NonpartitionedKeyGenerator|BLOOM",
+
"COPY_ON_WRITE|true|org.apache.hudi.keygen.NonpartitionedKeyGenerator|BLOOM",
+
"COPY_ON_WRITE|false|org.apache.hudi.keygen.NonpartitionedKeyGenerator|SIMPLE",
+
"COPY_ON_WRITE|true|org.apache.hudi.keygen.NonpartitionedKeyGenerator|SIMPLE",
+
"MERGE_ON_READ|false|org.apache.hudi.keygen.SimpleKeyGenerator|GLOBAL_BLOOM",
+
"MERGE_ON_READ|true|org.apache.hudi.keygen.SimpleKeyGenerator|GLOBAL_BLOOM",
+
"MERGE_ON_READ|false|org.apache.hudi.keygen.SimpleKeyGenerator|GLOBAL_SIMPLE",
+
"MERGE_ON_READ|true|org.apache.hudi.keygen.SimpleKeyGenerator|GLOBAL_SIMPLE",
+
"MERGE_ON_READ|false|org.apache.hudi.keygen.NonpartitionedKeyGenerator|BLOOM",
+
"MERGE_ON_READ|true|org.apache.hudi.keygen.NonpartitionedKeyGenerator|BLOOM",
+
"MERGE_ON_READ|false|org.apache.hudi.keygen.NonpartitionedKeyGenerator|SIMPLE",
+
"MERGE_ON_READ|true|org.apache.hudi.keygen.NonpartitionedKeyGenerator|SIMPLE"
   )
 
   //extracts the params and runs each core flow test
@@ -73,16 +71,15 @@ class TestSparkSqlCoreFlow extends HoodieSparkSqlTestBase {
   withTempDir { basePath =>
 testCoreFlows(basePath,
   tableType = splits(0),
-  isMetadataEnabledOnWrite = splits(1).toBoolean,
-  isMetadataEnabledOnRead = splits(2).toBoolean,
-  keyGenClass = splits(3),
-  indexType = splits(4))
+  isMetadataEnabled = splits(1).toBoolean,
+  keyGenClass = splits(2),
+  indexType = splits(3))
   }
 }

[hudi] 30/37: [MINOR] Fixing failing tests with BQ sync tests (#9684)

2023-09-13 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 0081f0ab46f686d3a44c3752221afb8541b06b36
Author: Sivabalan Narayanan 
AuthorDate: Mon Sep 11 17:57:23 2023 -0400

[MINOR] Fixing failing tests with BQ sync tests (#9684)
---
 .../apache/hudi/gcp/bigquery/TestHoodieBigQuerySyncClient.java   | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git 
a/hudi-gcp/src/test/java/org/apache/hudi/gcp/bigquery/TestHoodieBigQuerySyncClient.java
 
b/hudi-gcp/src/test/java/org/apache/hudi/gcp/bigquery/TestHoodieBigQuerySyncClient.java
index df7e6a9f31e..189f3efa222 100644
--- 
a/hudi-gcp/src/test/java/org/apache/hudi/gcp/bigquery/TestHoodieBigQuerySyncClient.java
+++ 
b/hudi-gcp/src/test/java/org/apache/hudi/gcp/bigquery/TestHoodieBigQuerySyncClient.java
@@ -94,8 +94,9 @@ public class TestHoodieBigQuerySyncClient {
 
 QueryJobConfiguration configuration = 
jobInfoCaptor.getValue().getConfiguration();
 assertEquals(configuration.getQuery(),
-String.format("CREATE EXTERNAL TABLE `%s.%s` ( field STRING ) WITH 
PARTITION COLUMNS OPTIONS (enable_list_inference=true, 
hive_partition_uri_prefix=\"%s\", uris=[\"%s\"], format=\"PARQUET\", "
-+ "file_set_spec_type=\"NEW_LINE_DELIMITED_MANIFEST\")", 
TEST_DATASET, TEST_TABLE, SOURCE_PREFIX, MANIFEST_FILE_URI));
+String.format("CREATE EXTERNAL TABLE `%s.%s.%s` ( field STRING ) WITH 
PARTITION COLUMNS OPTIONS (enable_list_inference=true, "
++ "hive_partition_uri_prefix=\"%s\", uris=[\"%s\"], 
format=\"PARQUET\", "
++ "file_set_spec_type=\"NEW_LINE_DELIMITED_MANIFEST\")", 
PROJECT_ID, TEST_DATASET, TEST_TABLE, SOURCE_PREFIX, MANIFEST_FILE_URI));
   }
 
   @Test
@@ -113,7 +114,7 @@ public class TestHoodieBigQuerySyncClient {
 
 QueryJobConfiguration configuration = 
jobInfoCaptor.getValue().getConfiguration();
 assertEquals(configuration.getQuery(),
-String.format("CREATE EXTERNAL TABLE `%s.%s` ( field STRING ) OPTIONS 
(enable_list_inference=true, uris=[\"%s\"], format=\"PARQUET\", "
-+ "file_set_spec_type=\"NEW_LINE_DELIMITED_MANIFEST\")", 
TEST_DATASET, TEST_TABLE, MANIFEST_FILE_URI));
+String.format("CREATE EXTERNAL TABLE `%s.%s.%s` ( field STRING ) 
OPTIONS (enable_list_inference=true, uris=[\"%s\"], format=\"PARQUET\", "
++ "file_set_spec_type=\"NEW_LINE_DELIMITED_MANIFEST\")", 
PROJECT_ID, TEST_DATASET, TEST_TABLE, MANIFEST_FILE_URI));
   }
 }



[hudi] branch release-0.14.0 updated (d995bb8262c -> 63a37211384)

2023-09-13 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a change to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git


from d995bb8262c [HUDI-6763] Optimize collect calls (#9561)
 new 655904a6f29 [HUDI-6562] Fixed issue for delete events for 
AWSDmsAvroPayload when CDC enabled (#9519)
 new 2e7e1b3a7b7 [MINOR] Fix failing schema evolution tests in Flink 
versions < 1.17 (#9586)
 new d4de4597849 [HUDI-6066] HoodieTableSource supports parquet predicate 
push down (#8437)
 new 15ecee9674e [MINOR] Update operator name for compact test 
class (#9583)
 new 26cc766ded7 [HUDI-6579] Fix streaming write when meta cols dropped 
(#9589)
 new 4bc41844957 [HUDI-6732] Allow wildcards from Spark-SQL entrypoints for 
drop partition DDL (#9491)
 new 033a9f80ff9 [HUDI-6813] Support table name for meta sync in bootstrap 
(#9600)
 new b7a1f80062b [MINOR] Fix ut due to the scala compile ambiguity of 
Properties#putAll (#9601)
 new 8b273631cfd [MINOR] Catch EntityNotFoundException correctly (#9595)
 new 605eb24b226 [HUDI-6808] SkipCompaction Config should not affect the 
stream read of the cow table (#9584)
 new 629ee75fe5f [HUDI-6812]Fix bootstrap operator null point exception 
while lastInstantTime is null (#9599)
 new 620ee24b02b [HUDI-6805] Print detailed error message in clustering 
(#9577)
 new a136369344f [HUDI-6804] Fix hive read schema evolution MOR table 
(#9573)
 new ed1d7c97d16 [HUDI-6818] Create a database automatically when using the 
flink catalog dfs mode (#9592)
 new 83cdca8bc5d [HUDI-6766] Fixing mysql debezium data loss  (#9475)
 new 46c170425a7 [HUDI-6819] Fix logic for throwing exception in 
getRecordIndexUpdates. (#9616)
 new 135387c3177 [HUDI-6397][HUDI-6759] Fixing misc bugs w/ metadata table 
(#9546)
 new bca4828bc08 [HUDI-2141] Support flink compaction metrics (#9515)
 new ae3d886e991 [HUDI-6736] Fixing rollback completion and commit timeline 
files removal (#9521)
 new a948fa09158 [HUDI-6833] Add field for tracking log files from failed 
commit in rollback metadata (#9653)
 new fadde0317fc [HUDI-6820] Close write clients in tests (#9642)
 new 688d6c07a21 [HUDI-6820] Fixing CI stability issues (#9661)
 new bba95305a07 [HUDI-6758] Fixing deducing spurious log blocks due to 
spark retries (#9611)
 new 4af3b7eefa6 [HUDI-6831] Add back missing project_id to query statement 
in BigQuerySyncTool (#9650)
 new f1114af22b5 [HUDI-6835] Adjust spark sql core flow test scenarios 
(#9664)
 new a808f74ce03 [HUDI-6728] Update BigQuery manifest sync to support 
schema evolution (#9482)
 new 5b99ed406ca [HUDI-6738] - Apply object filter before checkpoint 
batching in GcsEventsHoodieIncrSource (#9538)
 new 225c2ab5bd0 [HUDI-6838] Fix file writers to honor bloom filter configs 
(#9669)
 new 456f6731cc4 [HUDI-6753] Fix parquet inline reading flaky test (#9618)
 new 0081f0ab46f [MINOR] Fixing failing tests with BQ sync tests (#9684)
 new f33265d3bd8 [MINOR] Add timeout for github check 
test-hudi-hadoop-mr-and-hudi-java-client (#9682)
 new a03483f09c0 [MINOR] Avoiding to ingest update records to RLI (#9675)
 new c1a497059c4 [HUDI-6834] Fixing time travel queries when overlaps with 
cleaner and archival time window (#9666)
 new 88f744da58c [MINOR] Avoiding warn log for succeeding in first attempt 
(#9686)
 new da81614a0de [HUDI-6842] Fixing flaky tests for async clustering test 
(#9671)
 new 5af6d703994 [HUDI-6478] Deduce op as upsert for INSERT INTO (#9665)
 new 63a37211384 [HUDI-6724] - Defaulting previous Instant time to init 
time to enable full read of initial commit (#9473)

The 37 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 .github/workflows/bot.yml  |  34 +-
 azure-pipelines-20230430.yml   |   2 +
 .../hudi/aws/sync/AWSGlueCatalogSyncClient.java|  21 +-
 .../hudi/cli/commands/TestRestoresCommand.java |  24 +-
 .../hudi/cli/integ/ITTestClusteringCommand.java|   8 +-
 .../hudi/cli/integ/ITTestCompactionCommand.java|   9 +-
 .../hudi/client/BaseHoodieTableServiceClient.java  |  57 --
 .../org/apache/hudi/config/HoodieIndexConfig.java  |  63 +-
 .../org/apache/hudi/config/HoodieWriteConfig.java  |   8 +-
 .../org/apache/hudi/io/HoodieAppendHandle.java |  22 +-
 .../hudi/io/HoodieMergeHandleWithChangeLog.java|   2 +-
 .../metadata/HoodieBackedTableMetadataWriter.java  |  17 +-
 .../java/org/apache/hudi/table/HoodieTable.java|   6 +-
 .../rollback/BaseRollbackActionExecutor.java   |  25 +-
 .../hudi/table/action/rollback/RollbackUtils.java  |   6 +-
 .../table/upgrade/SixToFiveDowngradeHandler.java   |   9 

[hudi] 28/37: [HUDI-6838] Fix file writers to honor bloom filter configs (#9669)

2023-09-13 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 225c2ab5bd09332aeeffb7a72fcdca0758181155
Author: Y Ethan Guo 
AuthorDate: Mon Sep 11 11:11:22 2023 -0700

[HUDI-6838] Fix file writers to honor bloom filter configs (#9669)
---
 .../org/apache/hudi/config/HoodieIndexConfig.java  | 63 ++
 .../org/apache/hudi/config/HoodieWriteConfig.java  |  8 +--
 .../hudi/common/config/HoodieStorageConfig.java| 41 ++
 .../hudi/io/storage/HoodieFileWriterFactory.java   |  9 ++--
 .../org/apache/spark/sql/hudi/SparkHelpers.scala   |  7 +--
 5 files changed, 70 insertions(+), 58 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java
index c77b9780548..1ed3b1c3054 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java
@@ -18,11 +18,11 @@
 
 package org.apache.hudi.config;
 
-import org.apache.hudi.common.bloom.BloomFilterTypeCode;
 import org.apache.hudi.common.config.ConfigClassProperty;
 import org.apache.hudi.common.config.ConfigGroups;
 import org.apache.hudi.common.config.ConfigProperty;
 import org.apache.hudi.common.config.HoodieConfig;
+import org.apache.hudi.common.config.HoodieStorageConfig;
 import org.apache.hudi.common.engine.EngineType;
 import org.apache.hudi.common.util.StringUtils;
 import org.apache.hudi.exception.HoodieIndexException;
@@ -42,6 +42,10 @@ import java.util.Arrays;
 import java.util.Properties;
 import java.util.stream.Collectors;
 
+import static 
org.apache.hudi.common.config.HoodieStorageConfig.BLOOM_FILTER_DYNAMIC_MAX_ENTRIES;
+import static 
org.apache.hudi.common.config.HoodieStorageConfig.BLOOM_FILTER_FPP_VALUE;
+import static 
org.apache.hudi.common.config.HoodieStorageConfig.BLOOM_FILTER_NUM_ENTRIES_VALUE;
+import static 
org.apache.hudi.common.config.HoodieStorageConfig.BLOOM_FILTER_TYPE;
 import static org.apache.hudi.config.HoodieHBaseIndexConfig.GET_BATCH_SIZE;
 import static org.apache.hudi.config.HoodieHBaseIndexConfig.PUT_BATCH_SIZE;
 import static org.apache.hudi.config.HoodieHBaseIndexConfig.TABLENAME;
@@ -87,29 +91,6 @@ public class HoodieIndexConfig extends HoodieConfig {
   + "It will take precedence over the hoodie.index.type configuration 
if specified");
 
   // * Bloom Index configs *
-  public static final ConfigProperty BLOOM_FILTER_NUM_ENTRIES_VALUE = 
ConfigProperty
-  .key("hoodie.index.bloom.num_entries")
-  .defaultValue("6")
-  .markAdvanced()
-  .withDocumentation("Only applies if index type is BLOOM. "
-  + "This is the number of entries to be stored in the bloom filter. "
-  + "The rationale for the default: Assume the maxParquetFileSize is 
128MB and averageRecordSize is 1kb and "
-  + "hence we approx a total of 130K records in a file. The default 
(6) is roughly half of this approximation. "
-  + "Warning: Setting this very low, will generate a lot of false 
positives and index lookup "
-  + "will have to scan a lot more files than it has to and setting 
this to a very high number will "
-  + "increase the size every base file linearly (roughly 4KB for every 
5 entries). "
-  + "This config is also used with DYNAMIC bloom filter which 
determines the initial size for the bloom.");
-
-  public static final ConfigProperty BLOOM_FILTER_FPP_VALUE = 
ConfigProperty
-  .key("hoodie.index.bloom.fpp")
-  .defaultValue("0.1")
-  .markAdvanced()
-  .withDocumentation("Only applies if index type is BLOOM. "
-  + "Error rate allowed given the number of entries. This is used to 
calculate how many bits should be "
-  + "assigned for the bloom filter and the number of hash functions. 
This is usually set very low (default: 0.1), "
-  + "we like to tradeoff disk space for lower false positives. "
-  + "If the number of entries added to bloom filter exceeds the 
configured value (hoodie.index.bloom.num_entries), "
-  + "then this fpp may not be honored.");
 
   public static final ConfigProperty BLOOM_INDEX_PARALLELISM = 
ConfigProperty
   .key("hoodie.bloom.index.parallelism")
@@ -166,20 +147,6 @@ public class HoodieIndexConfig extends HoodieConfig {
   + "When true, bucketized bloom filtering is enabled. "
   + "This reduces skew seen in sort based bloom index lookup");
 
-  public stat

[hudi] 10/37: [HUDI-6808] SkipCompaction Config should not affect the stream read of the cow table (#9584)

2023-09-13 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 605eb24b226fa7131a3f76c70946369564f630cd
Author: zhuanshenbsj1 <34104400+zhuanshenb...@users.noreply.github.com>
AuthorDate: Mon Sep 4 09:56:52 2023 +0800

[HUDI-6808] SkipCompaction Config should not affect the stream read of the 
cow table (#9584)
---
 .../src/main/java/org/apache/hudi/source/IncrementalInputSplits.java| 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/IncrementalInputSplits.java
 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/IncrementalInputSplits.java
index fd6534d7f76..05d11bf746f 100644
--- 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/IncrementalInputSplits.java
+++ 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/IncrementalInputSplits.java
@@ -603,7 +603,7 @@ public class IncrementalInputSplits implements Serializable 
{
   @VisibleForTesting
   public HoodieTimeline filterInstantsAsPerUserConfigs(HoodieTimeline 
timeline) {
 final HoodieTimeline oriTimeline = timeline;
-if (this.skipCompaction) {
+if (OptionsResolver.isMorTable(this.conf) & this.skipCompaction) {
   // the compaction commit uses 'commit' as action which is tricky
   timeline = timeline.filter(instant -> 
!instant.getAction().equals(HoodieTimeline.COMMIT_ACTION));
 }



[hudi] 02/37: [MINOR] Fix failing schema evolution tests in Flink versions < 1.17 (#9586)

2023-09-13 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 2e7e1b3a7b74091299a883b2a7418e5d16915b21
Author: voonhous 
AuthorDate: Fri Sep 1 09:09:19 2023 +0800

[MINOR] Fix failing schema evolution tests in Flink versions < 1.17 (#9586)

Co-authored-by: voon 
---
 .../apache/hudi/table/ITTestSchemaEvolution.java   | 23 +++---
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git 
a/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestSchemaEvolution.java
 
b/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestSchemaEvolution.java
index 29d142f10c3..172b63b8a88 100644
--- 
a/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestSchemaEvolution.java
+++ 
b/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestSchemaEvolution.java
@@ -181,6 +181,7 @@ public class ITTestSchemaEvolution {
 + "  `partition` string"
 + ") partitioned by (`partition`) with (" + tableOptions + ")"
 );
+// An explicit cast is performed for map-values to prevent implicit 
map.key strings from being truncated/extended based the last row's inferred 
schema
 //language=SQL
 tEnv.executeSql(""
 + "insert into t1 select "
@@ -195,14 +196,14 @@ public class ITTestSchemaEvolution {
 + "  cast(`partition` as string) "
 + "from (values "
 + "  ('id0', 'Indica', 'F', 12, '2000-01-01 00:00:00', cast(null as 
row), map['Indica', 1212], 
array[12], 'par0'),"
-+ "  ('id1', 'Danny', 'M', 23, '2000-01-01 00:00:01', row(1, 's1', '', 
1), map['Danny', 2323], array[23, 23], 'par1'),"
-+ "  ('id2', 'Stephen', 'M', 33, '2000-01-01 00:00:02', row(2, 's2', 
'', 2), map['Stephen', ], array[33], 'par1'),"
-+ "  ('id3', 'Julian', 'M', 53, '2000-01-01 00:00:03', row(3, 's3', 
'', 3), map['Julian', 5353], array[53, 53], 'par2'),"
-+ "  ('id4', 'Fabian', 'M', 31, '2000-01-01 00:00:04', row(4, 's4', 
'', 4), map['Fabian', 3131], array[31], 'par2'),"
-+ "  ('id5', 'Sophia', 'F', 18, '2000-01-01 00:00:05', row(5, 's5', 
'', 5), map['Sophia', 1818], array[18, 18], 'par3'),"
-+ "  ('id6', 'Emma', 'F', 20, '2000-01-01 00:00:06', row(6, 's6', '', 
6), map['Emma', 2020], array[20], 'par3'),"
-+ "  ('id7', 'Bob', 'M', 44, '2000-01-01 00:00:07', row(7, 's7', '', 
7), map['Bob', ], array[44, 44], 'par4'),"
-+ "  ('id8', 'Han', 'M', 56, '2000-01-01 00:00:08', row(8, 's8', '', 
8), map['Han', 5656], array[56, 56, 56], 'par4')"
++ "  ('id1', 'Danny', 'M', 23, '2000-01-01 00:00:01', row(1, 's1', '', 
1), cast(map['Danny', 2323] as map), array[23, 23], 'par1'),"
++ "  ('id2', 'Stephen', 'M', 33, '2000-01-01 00:00:02', row(2, 's2', 
'', 2), cast(map['Stephen', ] as map), array[33], 'par1'),"
++ "  ('id3', 'Julian', 'M', 53, '2000-01-01 00:00:03', row(3, 's3', 
'', 3), cast(map['Julian', 5353] as map), array[53, 53], 'par2'),"
++ "  ('id4', 'Fabian', 'M', 31, '2000-01-01 00:00:04', row(4, 's4', 
'', 4), cast(map['Fabian', 3131] as map), array[31], 'par2'),"
++ "  ('id5', 'Sophia', 'F', 18, '2000-01-01 00:00:05', row(5, 's5', 
'', 5), cast(map['Sophia', 1818] as map), array[18, 18], 'par3'),"
++ "  ('id6', 'Emma', 'F', 20, '2000-01-01 00:00:06', row(6, 's6', '', 
6), cast(map['Emma', 2020] as map), array[20], 'par3'),"
++ "  ('id7', 'Bob', 'M', 44, '2000-01-01 00:00:07', row(7, 's7', '', 
7), cast(map['Bob', ] as map), array[44, 44], 'par4'),"
++ "  ('id8', 'Han', 'M', 56, '2000-01-01 00:00:08', row(8, 's8', '', 
8), cast(map['Han', 5656] as map), array[56, 56, 56], 'par4')"
 + ") as A(uuid, name, gender, age, ts, f_struct, f_map, f_array, 
`partition`)"
 ).await();
   }
@@ -294,11 +295,11 @@ public class ITTestSchemaEvolution {
 + "  cast(new_map_col as map),"
 + "  cast(`partition` as string) "
 + "from (values "
-+ "  ('id1', '23', 'Danny', '', 1.1, '2000-01-01 00:00:01', row(1, 
1, 's1', 11, 't1', 'drop_add1'), map['Danny', 2323.23], array[23, 23, 23], "
++ "  ('id1', '23', 'Danny', '', 1.1, '2000-01-01 00:00:01', row(1, 
1, 's1', 11, 't1', 'drop_add1'), cast(map['Danny', 2323.23] as map), array[23, 23, 23], "
 + "  row(1, '1'), array['1'], Map['k1','v1'], 'par1'),"
-+ "  ('id9', 'unknown', 'Alice', '', 9.9, '2000-01-01 00:00:09', 
row(9, 9, 's9', 99, 't9', 'drop_add9'), map['Alice', .99], array[, 
], "
++ "

[hudi] 06/37: [HUDI-6732] Allow wildcards from Spark-SQL entrypoints for drop partition DDL (#9491)

2023-09-13 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 4bc418449577d8b529216d3405d25f46738ed173
Author: voonhous 
AuthorDate: Fri Sep 1 13:54:27 2023 +0800

[HUDI-6732] Allow wildcards from Spark-SQL entrypoints for drop partition 
DDL (#9491)
---
 .../org/apache/hudi/HoodieSparkSqlWriter.scala |  6 ++--
 .../sql/hudi/TestAlterTableDropPartition.scala | 36 ++
 2 files changed, 40 insertions(+), 2 deletions(-)

diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
index cf78e514dda..6d0ce7d16bf 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
@@ -606,7 +606,8 @@ object HoodieSparkSqlWriter {
*/
   private def resolvePartitionWildcards(partitions: List[String], jsc: 
JavaSparkContext, cfg: HoodieConfig, basePath: String): List[String] = {
 //find out if any of the input partitions have wildcards
-var (wildcardPartitions, fullPartitions) = partitions.partition(partition 
=> partition.contains("*"))
+//note:spark-sql may url-encode special characters (* -> %2A)
+var (wildcardPartitions, fullPartitions) = partitions.partition(partition 
=> partition.matches(".*(\\*|%2A).*"))
 
 if (wildcardPartitions.nonEmpty) {
   //get list of all partitions
@@ -621,7 +622,8 @@ object HoodieSparkSqlWriter {
 //prevent that from happening. Any text inbetween \\Q and \\E is 
considered literal
 //So we start the string with \\Q and end with \\E and then whenever 
we find a * we add \\E before
 //and \\Q after so all other characters besides .* will be enclosed 
between a set of \\Q \\E
-val regexPartition = "^\\Q" + partition.replace("*", "\\E.*\\Q") + 
"\\E$"
+val wildcardToken: String = if (partition.contains("*")) "*" else "%2A"
+val regexPartition = "^\\Q" + partition.replace(wildcardToken, 
"\\E.*\\Q") + "\\E$"
 
 //filter all partitions with the regex and append the result to the 
list of full partitions
 fullPartitions = 
List.concat(fullPartitions,allPartitions.filter(_.matches(regexPartition)))
diff --git 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestAlterTableDropPartition.scala
 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestAlterTableDropPartition.scala
index 2261e83f7f9..b421732d270 100644
--- 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestAlterTableDropPartition.scala
+++ 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestAlterTableDropPartition.scala
@@ -620,4 +620,40 @@ class TestAlterTableDropPartition extends 
HoodieSparkSqlTestBase {
   checkExceptionContain(s"ALTER TABLE $tableName DROP 
PARTITION($partition)")(errMsg)
 }
   }
+
+  test("Test drop partition with wildcards") {
+withRecordType()(withTempDir { tmp =>
+  Seq("cow", "mor").foreach { tableType =>
+val tableName = generateTableName
+spark.sql(
+  s"""
+ |create table $tableName (
+ |  id int,
+ |  name string,
+ |  price double,
+ |  ts long,
+ |  partition_date_col string
+ |) using hudi
+ | location '${tmp.getCanonicalPath}/$tableName'
+ | tblproperties (
+ |  primaryKey ='id',
+ |  type = '$tableType',
+ |  preCombineField = 'ts'
+ | ) partitioned by (partition_date_col)
+ """.stripMargin)
+spark.sql(s"insert into $tableName values " +
+  s"(1, 'a1', 10, 1000, '2023-08-01'), (2, 'a2', 10, 1000, 
'2023-08-02'), (3, 'a3', 10, 1000, '2023-09-01')")
+checkAnswer(s"show partitions $tableName")(
+  Seq("partition_date_col=2023-08-01"),
+  Seq("partition_date_col=2023-08-02"),
+  Seq("partition_date_col=2023-09-01")
+)
+spark.sql(s"alter table $tableName drop 
partition(partition_date_col='2023-08-*')")
+// show partitions will still return all partitions for tests, use 
select distinct as a stop-gap
+checkAnswer(s"select distinct partition_date_col from $tableName")(
+  Seq("2023-09-01")
+)
+  }
+})
+  }
 }



[hudi] 04/37: [MINOR] Update operator name for compact test class (#9583)

2023-09-13 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 15ecee9674ec734cd54bd4ef8198ba3690cef1ee
Author: hehuiyuan <471627...@qq.com>
AuthorDate: Fri Sep 1 09:42:36 2023 +0800

[MINOR] Update operator name for compact test class (#9583)
---
 .../org/apache/hudi/sink/cluster/ITTestHoodieFlinkClustering.java | 4 ++--
 .../org/apache/hudi/sink/compact/ITTestHoodieFlinkCompactor.java  | 8 
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git 
a/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/cluster/ITTestHoodieFlinkClustering.java
 
b/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/cluster/ITTestHoodieFlinkClustering.java
index 18a8aebb8fd..4c817a7927a 100644
--- 
a/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/cluster/ITTestHoodieFlinkClustering.java
+++ 
b/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/cluster/ITTestHoodieFlinkClustering.java
@@ -410,8 +410,8 @@ public class ITTestHoodieFlinkClustering {
 // keep pending clustering, not committing clustering
 dataStream
 .addSink(new DiscardingSink<>())
-.name("clustering_commit")
-.uid("uid_clustering_commit")
+.name("discarding-sink")
+.uid("uid_discarding-sink")
 .setParallelism(1);
 
 env.execute("flink_hudi_clustering");
diff --git 
a/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/compact/ITTestHoodieFlinkCompactor.java
 
b/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/compact/ITTestHoodieFlinkCompactor.java
index b032ad46765..ac2d93a7305 100644
--- 
a/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/compact/ITTestHoodieFlinkCompactor.java
+++ 
b/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/compact/ITTestHoodieFlinkCompactor.java
@@ -175,8 +175,8 @@ public class ITTestHoodieFlinkCompactor {
 new CompactOperator(conf))
 .setParallelism(FlinkMiniCluster.DEFAULT_PARALLELISM)
 .addSink(new CompactionCommitSink(conf))
-.name("clean_commits")
-.uid("uid_clean_commits")
+.name("compaction_commit")
+.uid("uid_compaction_commit")
 .setParallelism(1);
 
 env.execute("flink_hudi_compaction");
@@ -256,8 +256,8 @@ public class ITTestHoodieFlinkCompactor {
 new CompactOperator(conf))
 .setParallelism(FlinkMiniCluster.DEFAULT_PARALLELISM)
 .addSink(new CompactionCommitSink(conf))
-.name("clean_commits")
-.uid("uid_clean_commits")
+.name("compaction_commit")
+.uid("uid_compaction_commit")
 .setParallelism(1);
 
 env.execute("flink_hudi_compaction");



[hudi] 21/30: [MINOR] Add detail exception when instant transition state (#9476)

2023-09-01 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 5e3bf05b282b80227de167bfcd7dd1126c42c374
Author: hehuiyuan <471627...@qq.com>
AuthorDate: Mon Aug 28 09:38:01 2023 +0800

[MINOR] Add detail exception when instant transition state (#9476)
---
 .../org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java
 
b/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java
index dbfe484531a..1a36bb15d57 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java
@@ -599,7 +599,7 @@ public class HoodieActiveTimeline extends 
HoodieDefaultTimeline {
 
   protected void transitionState(HoodieInstant fromInstant, HoodieInstant 
toInstant, Option data,
boolean allowRedundantTransitions) {
-
ValidationUtils.checkArgument(fromInstant.getTimestamp().equals(toInstant.getTimestamp()));
+
ValidationUtils.checkArgument(fromInstant.getTimestamp().equals(toInstant.getTimestamp()),
 String.format("%s and %s are not consistent when transition state.", 
fromInstant, toInstant));
 try {
   if (metaClient.getTimelineLayoutVersion().isNullVersion()) {
 // Re-create the .inflight file by opening a new file and write the 
commit metadata in



[hudi] 26/30: [HUDI-6758] Detecting and skipping Spurious log blocks with MOR reads (#9545)

2023-09-01 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit eed034b5c82053f3bb0ca23621883f68bec8
Author: Sivabalan Narayanan 
AuthorDate: Tue Aug 29 21:33:27 2023 -0400

[HUDI-6758] Detecting and skipping Spurious log blocks with MOR reads 
(#9545)

- Detect and skip duplicate log blocks due to task retries.
- Detection based on block sequence number that keeps
increasing monotonically during rollover.
---
 .../org/apache/hudi/io/HoodieAppendHandle.java |  14 +-
 .../table/log/AbstractHoodieLogRecordReader.java   | 169 ++---
 .../common/table/log/block/HoodieLogBlock.java |   2 +-
 .../common/functional/TestHoodieLogFormat.java | 143 +++--
 4 files changed, 295 insertions(+), 33 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java
index d0819aa8007..65f79c5147e 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java
@@ -129,6 +129,9 @@ public class HoodieAppendHandle extends 
HoodieWriteHandle extends 
HoodieWriteHandle 0) {
-blocks.add(new HoodieDeleteBlock(recordsToDelete.toArray(new 
DeleteRecord[0]), header));
+blocks.add(new HoodieDeleteBlock(recordsToDelete.toArray(new 
DeleteRecord[0]), getUpdatedHeader(header, blockSequenceNumber++, 
taskContextSupplier.getAttemptIdSupplier().get(;
   }
 
   if (blocks.size() > 0) {
@@ -632,6 +635,13 @@ public class HoodieAppendHandle extends 
HoodieWriteHandle 
getUpdatedHeader(Map header, int 
blockSequenceNumber, long attemptNumber) {
+Map updatedHeader = new HashMap<>();
+updatedHeader.putAll(header);
+updatedHeader.put(HeaderMetadataType.BLOCK_SEQUENCE_NUMBER, 
String.valueOf(attemptNumber) + "," + String.valueOf(blockSequenceNumber));
+return updatedHeader;
+  }
+
   private static HoodieLogBlock getBlock(HoodieWriteConfig writeConfig,
  HoodieLogBlock.HoodieLogBlockType 
logDataBlockFormat,
  List records,
diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordReader.java
 
b/hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordReader.java
index 7b1e737610b..94bd68e62c4 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordReader.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordReader.java
@@ -34,6 +34,7 @@ import org.apache.hudi.common.table.log.block.HoodieLogBlock;
 import org.apache.hudi.common.table.timeline.HoodieTimeline;
 import org.apache.hudi.common.util.InternalSchemaCache;
 import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.StringUtils;
 import org.apache.hudi.common.util.collection.ClosableIterator;
 import org.apache.hudi.common.util.collection.CloseableMappingIterator;
 import org.apache.hudi.common.util.collection.Pair;
@@ -65,6 +66,7 @@ import java.util.function.Function;
 import java.util.stream.Collectors;
 
 import static 
org.apache.hudi.common.table.log.block.HoodieCommandBlock.HoodieCommandBlockTypeEnum.ROLLBACK_BLOCK;
+import static 
org.apache.hudi.common.table.log.block.HoodieLogBlock.HeaderMetadataType.BLOCK_SEQUENCE_NUMBER;
 import static 
org.apache.hudi.common.table.log.block.HoodieLogBlock.HeaderMetadataType.COMPACTED_BLOCK_TIMES;
 import static 
org.apache.hudi.common.table.log.block.HoodieLogBlock.HeaderMetadataType.INSTANT_TIME;
 import static 
org.apache.hudi.common.table.log.block.HoodieLogBlock.HeaderMetadataType.TARGET_INSTANT_TIME;
@@ -108,8 +110,6 @@ public abstract class AbstractHoodieLogRecordReader {
   private final TypedProperties payloadProps;
   // Log File Paths
   protected final List logFilePaths;
-  // Read Lazily flag
-  private final boolean readBlocksLazily;
   // Reverse reader - Not implemented yet (NA -> Why do we need ?)
   // but present here for plumbing for future implementation
   private final boolean reverseReader;
@@ -174,7 +174,6 @@ public abstract class AbstractHoodieLogRecordReader {
 this.totalLogFiles.addAndGet(logFilePaths.size());
 this.logFilePaths = logFilePaths;
 this.reverseReader = reverseReader;
-this.readBlocksLazily = readBlocksLazily;
 this.fs = fs;
 this.bufferSize = bufferSize;
 this.instantRange = instantRange;
@@ -224,6 +223,9 @@ public abstract class AbstractHoodieLogRecordReader {
 
   private void scanInternalV1(Option keySpecOpt) {
 currentInstantLogBlocks = new ArrayDeque<>();
+List validLogBlockInstants =

[hudi] 16/30: [HUDI-6735] Adding support for snapshotLoadQuerySplitter for incremental sources. (#9501)

2023-09-01 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 1c16d60fef94bfd82790d9c1d2ba82e25def9a52
Author: harshal 
AuthorDate: Thu Aug 24 22:23:58 2023 +0530

[HUDI-6735] Adding support for snapshotLoadQuerySplitter for incremental 
sources. (#9501)

Snapshot load scan of historical table (having majority of data in archived 
timeline)
causes large batch processing. Adding interface to support breaking 
snapshotload
query into batches which can have commitId as checkpoint .

-

Co-authored-by: Sagar Sumit 
---
 .../hudi/utilities/sources/HoodieIncrSource.java   | 17 -
 .../sources/SnapshotLoadQuerySplitter.java | 78 ++
 .../hudi/utilities/sources/helpers/QueryInfo.java  | 12 
 .../utilities/sources/TestHoodieIncrSource.java| 22 +-
 .../helpers/TestSnapshotQuerySplitterImpl.java | 51 ++
 5 files changed, 174 insertions(+), 6 deletions(-)

diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/HoodieIncrSource.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/HoodieIncrSource.java
index 0141f5ad458..fa316cf806f 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/HoodieIncrSource.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/HoodieIncrSource.java
@@ -23,6 +23,7 @@ import org.apache.hudi.common.config.TypedProperties;
 import org.apache.hudi.common.model.HoodieRecord;
 import 
org.apache.hudi.common.table.timeline.TimelineUtils.HollowCommitHandling;
 import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ReflectionUtils;
 import org.apache.hudi.common.util.collection.Pair;
 import org.apache.hudi.utilities.config.HoodieIncrSourceConfig;
 import org.apache.hudi.utilities.schema.SchemaProvider;
@@ -50,12 +51,14 @@ import static 
org.apache.hudi.common.util.ConfigUtils.getBooleanWithAltKeys;
 import static org.apache.hudi.common.util.ConfigUtils.getIntWithAltKeys;
 import static org.apache.hudi.common.util.ConfigUtils.getStringWithAltKeys;
 import static org.apache.hudi.utilities.UtilHelpers.createRecordMerger;
+import static 
org.apache.hudi.utilities.sources.SnapshotLoadQuerySplitter.Config.SNAPSHOT_LOAD_QUERY_SPLITTER_CLASS_NAME;
 import static 
org.apache.hudi.utilities.sources.helpers.IncrSourceHelper.generateQueryInfo;
 import static 
org.apache.hudi.utilities.sources.helpers.IncrSourceHelper.getHollowCommitHandleMode;
 
 public class HoodieIncrSource extends RowSource {
 
   private static final Logger LOG = 
LoggerFactory.getLogger(HoodieIncrSource.class);
+  private final Option snapshotLoadQuerySplitter;
 
   public static class Config {
 
@@ -128,6 +131,10 @@ public class HoodieIncrSource extends RowSource {
   public HoodieIncrSource(TypedProperties props, JavaSparkContext 
sparkContext, SparkSession sparkSession,
   SchemaProvider schemaProvider) {
 super(props, sparkContext, sparkSession, schemaProvider);
+
+this.snapshotLoadQuerySplitter = 
Option.ofNullable(props.getString(SNAPSHOT_LOAD_QUERY_SPLITTER_CLASS_NAME, 
null))
+.map(className -> (SnapshotLoadQuerySplitter) 
ReflectionUtils.loadClass(className,
+new Class[] {TypedProperties.class}, props));
   }
 
   @Override
@@ -184,9 +191,13 @@ public class HoodieIncrSource extends RowSource {
   .load(srcPath);
 } else {
   // if checkpoint is missing from source table, and if strategy is set to 
READ_UPTO_LATEST_COMMIT, we have to issue snapshot query
-  source = sparkSession.read().format("org.apache.hudi")
-  .option(QUERY_TYPE().key(), 
DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL())
-  .load(srcPath)
+  Dataset snapshot = sparkSession.read().format("org.apache.hudi")
+  .option(DataSourceReadOptions.QUERY_TYPE().key(), 
DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL())
+  .load(srcPath);
+  if (snapshotLoadQuerySplitter.isPresent()) {
+queryInfo = 
snapshotLoadQuerySplitter.get().getNextCheckpoint(snapshot, queryInfo);
+  }
+  source = snapshot
   // add filtering so that only interested records are returned.
   .filter(String.format("%s > '%s'", 
HoodieRecord.COMMIT_TIME_METADATA_FIELD,
   queryInfo.getStartInstant()))
diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/SnapshotLoadQuerySplitter.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/SnapshotLoadQuerySplitter.java
new file mode 100644
index 000..6a13607b1d5
--- /dev/null
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/SnapshotLoadQuerySplitter.java
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor lice

[hudi] 20/30: [MINOR] Add write operation in alter schema commit metadata (#9509)

2023-09-01 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit f4b139a0556a100e55d8e959d7230aad1b382835
Author: Zouxxyy 
AuthorDate: Mon Aug 28 09:25:22 2023 +0800

[MINOR] Add write operation in alter schema commit metadata (#9509)
---
 .../org/apache/spark/sql/hudi/command/Spark30AlterTableCommand.scala | 1 +
 .../org/apache/spark/sql/hudi/command/Spark31AlterTableCommand.scala | 1 +
 .../main/scala/org/apache/spark/sql/hudi/command/AlterTableCommand.scala | 1 +
 3 files changed, 3 insertions(+)

diff --git 
a/hudi-spark-datasource/hudi-spark3.0.x/src/main/scala/org/apache/spark/sql/hudi/command/Spark30AlterTableCommand.scala
 
b/hudi-spark-datasource/hudi-spark3.0.x/src/main/scala/org/apache/spark/sql/hudi/command/Spark30AlterTableCommand.scala
index 22aea4c53e2..13bb66fb74a 100644
--- 
a/hudi-spark-datasource/hudi-spark3.0.x/src/main/scala/org/apache/spark/sql/hudi/command/Spark30AlterTableCommand.scala
+++ 
b/hudi-spark-datasource/hudi-spark3.0.x/src/main/scala/org/apache/spark/sql/hudi/command/Spark30AlterTableCommand.scala
@@ -227,6 +227,7 @@ object Spark30AlterTableCommand extends Logging {
 val commitActionType = 
CommitUtils.getCommitActionType(WriteOperationType.ALTER_SCHEMA, 
metaClient.getTableType)
 val instantTime = HoodieActiveTimeline.createNewInstantTime
 client.startCommitWithTime(instantTime, commitActionType)
+client.setOperationType(WriteOperationType.ALTER_SCHEMA)
 
 val hoodieTable = HoodieSparkTable.create(client.getConfig, 
client.getEngineContext)
 val timeLine = hoodieTable.getActiveTimeline
diff --git 
a/hudi-spark-datasource/hudi-spark3.1.x/src/main/scala/org/apache/spark/sql/hudi/command/Spark31AlterTableCommand.scala
 
b/hudi-spark-datasource/hudi-spark3.1.x/src/main/scala/org/apache/spark/sql/hudi/command/Spark31AlterTableCommand.scala
index a24a5d6b189..52bbe7a5ce7 100644
--- 
a/hudi-spark-datasource/hudi-spark3.1.x/src/main/scala/org/apache/spark/sql/hudi/command/Spark31AlterTableCommand.scala
+++ 
b/hudi-spark-datasource/hudi-spark3.1.x/src/main/scala/org/apache/spark/sql/hudi/command/Spark31AlterTableCommand.scala
@@ -227,6 +227,7 @@ object Spark31AlterTableCommand extends Logging {
 val commitActionType = 
CommitUtils.getCommitActionType(WriteOperationType.ALTER_SCHEMA, 
metaClient.getTableType)
 val instantTime = HoodieActiveTimeline.createNewInstantTime
 client.startCommitWithTime(instantTime, commitActionType)
+client.setOperationType(WriteOperationType.ALTER_SCHEMA)
 
 val hoodieTable = HoodieSparkTable.create(client.getConfig, 
client.getEngineContext)
 val timeLine = hoodieTable.getActiveTimeline
diff --git 
a/hudi-spark-datasource/hudi-spark3.2plus-common/src/main/scala/org/apache/spark/sql/hudi/command/AlterTableCommand.scala
 
b/hudi-spark-datasource/hudi-spark3.2plus-common/src/main/scala/org/apache/spark/sql/hudi/command/AlterTableCommand.scala
index 78972cf239d..b9cd0a2bdbc 100644
--- 
a/hudi-spark-datasource/hudi-spark3.2plus-common/src/main/scala/org/apache/spark/sql/hudi/command/AlterTableCommand.scala
+++ 
b/hudi-spark-datasource/hudi-spark3.2plus-common/src/main/scala/org/apache/spark/sql/hudi/command/AlterTableCommand.scala
@@ -262,6 +262,7 @@ object AlterTableCommand extends Logging {
 val commitActionType = 
CommitUtils.getCommitActionType(WriteOperationType.ALTER_SCHEMA, 
metaClient.getTableType)
 val instantTime = HoodieActiveTimeline.createNewInstantTime
 client.startCommitWithTime(instantTime, commitActionType)
+client.setOperationType(WriteOperationType.ALTER_SCHEMA)
 
 val hoodieTable = HoodieSparkTable.create(client.getConfig, 
client.getEngineContext)
 val timeLine = hoodieTable.getActiveTimeline



[hudi] 13/30: [HUDI-6621] Fix downgrade handler for 0.14.0 (#9467)

2023-09-01 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 0b4c95cdad01a062fc8852a61c05faefb230d3d1
Author: Lokesh Jain 
AuthorDate: Wed Aug 23 18:39:08 2023 +0530

[HUDI-6621] Fix downgrade handler for 0.14.0 (#9467)

- Since the log block version (due to delete block change) has been 
upgraded in 0.14.0, the delete blocks can not be read in 0.13.0 or earlier.
- Similarly the addition of record level index field in metadata table 
leads to column drop error on downgrade. The Jira aims to fix the downgrade 
handler to trigger compaction and delete metadata table if user wishes to 
downgrade from version six (0.14.0) to version 5 (0.13.0).
---
 .../table/upgrade/SixToFiveDowngradeHandler.java   |  53 ++--
 .../table/upgrade/SupportsUpgradeDowngrade.java|   3 +
 .../table/upgrade/FlinkUpgradeDowngradeHelper.java |   7 +
 .../table/upgrade/JavaUpgradeDowngradeHelper.java  |   7 +
 .../table/upgrade/SparkUpgradeDowngradeHelper.java |   7 +
 .../hudi/table/upgrade/TestUpgradeDowngrade.java   |  10 +-
 .../functional/TestSixToFiveDowngradeHandler.scala | 142 +
 7 files changed, 211 insertions(+), 18 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/upgrade/SixToFiveDowngradeHandler.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/upgrade/SixToFiveDowngradeHandler.java
index 228c0f710a8..4793f368f81 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/upgrade/SixToFiveDowngradeHandler.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/upgrade/SixToFiveDowngradeHandler.java
@@ -18,19 +18,26 @@
 
 package org.apache.hudi.table.upgrade;
 
+import org.apache.hudi.client.BaseHoodieWriteClient;
 import org.apache.hudi.common.config.ConfigProperty;
+import org.apache.hudi.common.config.HoodieMetadataConfig;
 import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.common.model.HoodieTableType;
 import org.apache.hudi.common.table.HoodieTableConfig;
 import org.apache.hudi.common.table.HoodieTableMetaClient;
-import org.apache.hudi.common.table.HoodieTableVersion;
 import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
 import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieInstantTimeGenerator;
 import org.apache.hudi.common.table.timeline.HoodieTimeline;
 import org.apache.hudi.common.util.FileIOUtils;
 import org.apache.hudi.common.util.Option;
+import org.apache.hudi.config.HoodieCompactionConfig;
 import org.apache.hudi.config.HoodieWriteConfig;
-import org.apache.hudi.metadata.MetadataPartitionType;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.metadata.HoodieTableMetadataUtil;
 import org.apache.hudi.table.HoodieTable;
+import org.apache.hudi.table.action.compact.CompactionTriggerStrategy;
+import 
org.apache.hudi.table.action.compact.strategy.UnBoundedCompactionStrategy;
 
 import org.apache.hadoop.fs.Path;
 
@@ -39,12 +46,15 @@ import java.util.Map;
 
 import static 
org.apache.hudi.common.table.HoodieTableConfig.TABLE_METADATA_PARTITIONS;
 import static 
org.apache.hudi.common.table.HoodieTableConfig.TABLE_METADATA_PARTITIONS_INFLIGHT;
-import static 
org.apache.hudi.metadata.HoodieTableMetadataUtil.deleteMetadataTablePartition;
 
 /**
  * Downgrade handle to assist in downgrading hoodie table from version 6 to 5.
  * To ensure compatibility, we need recreate the compaction requested file to
  * .aux folder.
+ * Since version 6 includes a new schema field for metadata table(MDT),
+ * the MDT needs to be deleted during downgrade to avoid column drop error.
+ * Also log block version was upgraded in version 6, therefore full compaction 
needs
+ * to be completed during downgrade to avoid both read and future compaction 
failures.
  */
 public class SixToFiveDowngradeHandler implements DowngradeHandler {
 
@@ -52,11 +62,16 @@ public class SixToFiveDowngradeHandler implements 
DowngradeHandler {
   public Map downgrade(HoodieWriteConfig config, 
HoodieEngineContext context, String instantTime, SupportsUpgradeDowngrade 
upgradeDowngradeHelper) {
 final HoodieTable table = upgradeDowngradeHelper.getTable(config, context);
 
-removeRecordIndexIfNeeded(table, context);
+// Since version 6 includes a new schema field for metadata table(MDT), 
the MDT needs to be deleted during downgrade to avoid column drop error.
+HoodieTableMetadataUtil.deleteMetadataTable(config.getBasePath(), context);
+// The log block version has been upgraded in version six so compaction is 
required for downgrade.
+runCompaction(table, context, config, upgradeDowngradeHelper);
+
 syncCompactionRequestedFileToAuxiliaryFolder(table);
 
+HoodieTableMetaClient metaClient = 
HoodieTableMetaClient.reload

[hudi] 30/30: [HUDI-6763] Optimize collect calls (#9561)

2023-09-01 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit d995bb8262cafa22253fa961557bbfcde6369dfb
Author: Tim Brown 
AuthorDate: Wed Aug 30 20:37:23 2023 -0500

[HUDI-6763] Optimize collect calls (#9561)
---
 .../table/action/commit/BaseSparkCommitActionExecutor.java | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git 
a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java
 
b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java
index 7383f428e0a..040cc798747 100644
--- 
a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java
+++ 
b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java
@@ -286,7 +286,9 @@ public abstract class BaseSparkCommitActionExecutor 
extends
 
   @Override
   protected void 
setCommitMetadata(HoodieWriteMetadata> result) {
-
result.setCommitMetadata(Option.of(CommitUtils.buildMetadata(result.getWriteStatuses().map(WriteStatus::getStat).collectAsList(),
+List writeStats = 
result.getWriteStatuses().map(WriteStatus::getStat).collectAsList();
+result.setWriteStats(writeStats);
+result.setCommitMetadata(Option.of(CommitUtils.buildMetadata(writeStats,
 result.getPartitionToReplaceFileIds(),
 extraMetadata, operationType, getSchemaToStoreInCommit(), 
getCommitActionType(;
   }
@@ -294,16 +296,14 @@ public abstract class BaseSparkCommitActionExecutor 
extends
   @Override
   protected void commit(Option> extraMetadata, 
HoodieWriteMetadata> result) {
 context.setJobStatus(this.getClass().getSimpleName(), "Commit write status 
collect: " + config.getTableName());
-commit(extraMetadata, result, 
result.getWriteStatuses().map(WriteStatus::getStat).collectAsList());
-  }
-
-  protected void commit(Option> extraMetadata, 
HoodieWriteMetadata> result, List 
writeStats) {
 String actionType = getCommitActionType();
 LOG.info("Committing " + instantTime + ", action Type " + actionType + ", 
operation Type " + operationType);
 result.setCommitted(true);
-result.setWriteStats(writeStats);
+if (!result.getWriteStats().isPresent()) {
+  
result.setWriteStats(result.getWriteStatuses().map(WriteStatus::getStat).collectAsList());
+}
 // Finalize write
-finalizeWrite(instantTime, writeStats, result);
+finalizeWrite(instantTime, result.getWriteStats().get(), result);
 try {
   HoodieActiveTimeline activeTimeline = table.getActiveTimeline();
   HoodieCommitMetadata metadata = result.getCommitMetadata().get();



[hudi] 19/30: [HUDI-6681] Ensure MOR Column Stats Index skips reading filegroups correctly (#9422)

2023-09-01 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 256957a689e088dcb1b54ced68b742e3aa4221ae
Author: Jon Vexler 
AuthorDate: Sat Aug 26 14:01:02 2023 -0400

[HUDI-6681] Ensure MOR Column Stats Index skips reading filegroups 
correctly (#9422)

- Create tests for MOR col stats index to ensure that filegroups are read 
as expected

Co-authored-by: Jonathan Vexler <=>
---
 .../TestDataSkippingWithMORColstats.java   | 483 +
 1 file changed, 483 insertions(+)

diff --git 
a/hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/functional/TestDataSkippingWithMORColstats.java
 
b/hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/functional/TestDataSkippingWithMORColstats.java
new file mode 100644
index 000..64d6c31c2fa
--- /dev/null
+++ 
b/hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/functional/TestDataSkippingWithMORColstats.java
@@ -0,0 +1,483 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.functional;
+
+import org.apache.hudi.DataSourceReadOptions;
+import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.client.SparkRDDWriteClient;
+import org.apache.hudi.common.config.HoodieMetadataConfig;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.testutils.HoodieTestDataGenerator;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.config.HoodieCompactionConfig;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.testutils.HoodieSparkClientTestBase;
+
+import org.apache.spark.SparkException;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.io.TempDir;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.file.Files;
+import java.nio.file.Path;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Properties;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+import static 
org.apache.hudi.common.testutils.RawTripTestPayload.recordToString;
+import static 
org.apache.hudi.config.HoodieCompactionConfig.INLINE_COMPACT_NUM_DELTA_COMMITS;
+import static org.apache.spark.sql.SaveMode.Append;
+import static org.apache.spark.sql.SaveMode.Overwrite;
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertThrows;
+
+/**
+ * Test mor with colstats enabled in scenarios to ensure that files
+ * are being appropriately read or not read.
+ * The strategy employed is to corrupt targeted base files. If we want
+ * to prove the file is read, we assert that an exception will be thrown.
+ * If we want to prove the file is not read, we expect the read to
+ * successfully execute.
+ */
+public class TestDataSkippingWithMORColstats extends HoodieSparkClientTestBase 
{
+
+  private static String matchCond = "trip_type = 'UBERX'";
+  private static String nonMatchCond = "trip_type = 'BLACK'";
+  private static String[] dropColumns = {"_hoodie_commit_time", 
"_hoodie_commit_seqno",
+  "_hoodie_record_key", "_hoodie_partition_path", "_hoodie_file_name"};
+
+  private Boolean shouldOverwrite;
+  Map options;
+  @TempDir
+  public java.nio.file.Path basePath;
+
+  @BeforeEach
+  public void setUp() throws Exception {
+initSparkContexts();
+dataGen = new HoodieTestDataGenerator();
+shouldOverwrite = true;
+options = getOptions();
+Properties props = new Properties();
+props.putAll(options);
+try {
+  metaClient = HoodieTableMetaClient.initTableAndGetMetaClient(hadoopConf, 

[hudi] 24/30: [HUDI-6726] Fix connection leaks related to file reader and iterator close (#9539)

2023-09-01 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 2009b0f44660f1d1753685a3ea64494d591aebf2
Author: Rajesh Mahindra <76502047+rmahindra...@users.noreply.github.com>
AuthorDate: Mon Aug 28 23:56:52 2023 -0700

[HUDI-6726] Fix connection leaks related to file reader and iterator close 
(#9539)



-

Co-authored-by: rmahindra123 
---
 .../table/action/commit/HoodieMergeHelper.java |   5 +-
 .../io/storage/TestHoodieHFileReaderWriter.java|  10 +-
 .../bootstrap/index/HFileBootstrapIndex.java   |   8 +-
 .../hudi/common/table/TableSchemaResolver.java |   5 +-
 .../table/log/block/HoodieHFileDataBlock.java  |  23 +--
 .../hudi/common/util/queue/SimpleExecutor.java |   6 +-
 .../hudi/io/storage/HoodieAvroHFileReader.java | 173 +++--
 .../apache/hudi/io/storage/HoodieHFileUtils.java   |  24 ++-
 .../hudi/metadata/HoodieBackedTableMetadata.java   |   4 +-
 .../hudi/hadoop/HoodieHFileRecordReader.java   |   8 +-
 10 files changed, 185 insertions(+), 81 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/HoodieMergeHelper.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/HoodieMergeHelper.java
index 4df767b5e41..c1523d564e4 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/HoodieMergeHelper.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/HoodieMergeHelper.java
@@ -123,7 +123,7 @@ public class HoodieMergeHelper extends BaseMergeHelper {
 // In case writer's schema is simply a projection of the reader's one 
we can read
 // the records in the projected schema directly
 recordSchema = isPureProjection ? writerSchema : readerSchema;
-recordIterator = baseFileReader.getRecordIterator(recordSchema);
+recordIterator = (ClosableIterator) 
baseFileReader.getRecordIterator(recordSchema);
   }
 
   boolean isBufferingRecords = 
ExecutorFactory.isBufferingRecords(writeConfig);
@@ -155,6 +155,9 @@ public class HoodieMergeHelper extends BaseMergeHelper {
 executor.awaitTermination();
   } else {
 baseFileReader.close();
+if (bootstrapFileReader != null) {
+  bootstrapFileReader.close();
+}
 mergeHandle.close();
   }
 }
diff --git 
a/hudi-client/hudi-client-common/src/test/java/org/apache/hudi/io/storage/TestHoodieHFileReaderWriter.java
 
b/hudi-client/hudi-client-common/src/test/java/org/apache/hudi/io/storage/TestHoodieHFileReaderWriter.java
index 90ad0fe1a74..0d2eefa0863 100644
--- 
a/hudi-client/hudi-client-common/src/test/java/org/apache/hudi/io/storage/TestHoodieHFileReaderWriter.java
+++ 
b/hudi-client/hudi-client-common/src/test/java/org/apache/hudi/io/storage/TestHoodieHFileReaderWriter.java
@@ -214,8 +214,9 @@ public class TestHoodieHFileReaderWriter extends 
TestHoodieReaderWriterBase {
 byte[] content = FileIOUtils.readAsByteArray(
 fs.open(getFilePath()), (int) 
fs.getFileStatus(getFilePath()).getLen());
 // Reading byte array in HFile format, without actual file path
+Configuration hadoopConf = fs.getConf();
 HoodieAvroHFileReader hfileReader =
-new HoodieAvroHFileReader(fs, new Path(DUMMY_BASE_PATH), content, 
Option.empty());
+new HoodieAvroHFileReader(hadoopConf, new Path(DUMMY_BASE_PATH), new 
CacheConfig(hadoopConf), fs, content, Option.empty());
 Schema avroSchema = 
getSchemaFromResource(TestHoodieReaderWriterBase.class, "/exampleSchema.avsc");
 assertEquals(NUM_RECORDS, hfileReader.getTotalRecords());
 verifySimpleRecords(hfileReader.getRecordIterator(avroSchema));
@@ -420,8 +421,10 @@ public class TestHoodieHFileReaderWriter extends 
TestHoodieReaderWriterBase {
 verifyHFileReader(
 HoodieHFileUtils.createHFileReader(fs, new Path(DUMMY_BASE_PATH), 
content),
 hfilePrefix, true, HFILE_COMPARATOR.getClass(), NUM_RECORDS_FIXTURE);
+
+Configuration hadoopConf = fs.getConf();
 HoodieAvroHFileReader hfileReader =
-new HoodieAvroHFileReader(fs, new Path(DUMMY_BASE_PATH), content, 
Option.empty());
+new HoodieAvroHFileReader(hadoopConf, new Path(DUMMY_BASE_PATH), new 
CacheConfig(hadoopConf), fs, content, Option.empty());
 Schema avroSchema = 
getSchemaFromResource(TestHoodieReaderWriterBase.class, "/exampleSchema.avsc");
 assertEquals(NUM_RECORDS_FIXTURE, hfileReader.getTotalRecords());
 verifySimpleRecords(hfileReader.getRecordIterator(avroSchema));
@@ -429,7 +432,8 @@ public class TestHoodieHFileReaderWriter extends 
TestHoodieReaderWriterBase {
 content = readHFileFromResources(complexHFile);
 verifyHFileReader(HoodieHFileUtils.createHFileReader(fs, new 
Path

[hudi] 05/30: [HUDI-6156] Prevent leaving tmp file in timeline, delete tmp file when rename throw exception (#9483)

2023-09-01 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 0ea1f1b68cbc16138637460f1557de2b9cf6c360
Author: Bingeng Huang <304979...@qq.com>
AuthorDate: Mon Aug 21 19:40:11 2023 +0800

[HUDI-6156] Prevent leaving tmp file in timeline, delete tmp file when 
rename throw exception (#9483)

Co-authored-by: hbg 
---
 .../org/apache/hudi/common/fs/HoodieWrapperFileSystem.java | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/fs/HoodieWrapperFileSystem.java
 
b/hudi-common/src/main/java/org/apache/hudi/common/fs/HoodieWrapperFileSystem.java
index ecba8eff8b5..0789ef4e27f 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/fs/HoodieWrapperFileSystem.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/fs/HoodieWrapperFileSystem.java
@@ -1051,16 +1051,22 @@ public class HoodieWrapperFileSystem extends FileSystem 
{
 throw new HoodieIOException(errorMsg, e);
   }
 
+  boolean renameSuccess = false;
   try {
 if (null != tmpPath) {
-  boolean renameSuccess = fileSystem.rename(tmpPath, fullPath);
-  if (!renameSuccess) {
+  renameSuccess = fileSystem.rename(tmpPath, fullPath);
+}
+  } catch (IOException e) {
+throw new HoodieIOException("Failed to rename " + tmpPath + " to the 
target " + fullPath, e);
+  } finally {
+if (!renameSuccess && null != tmpPath) {
+  try {
 fileSystem.delete(tmpPath, false);
 LOG.warn("Fail to rename " + tmpPath + " to " + fullPath + ", 
target file exists: " + fileSystem.exists(fullPath));
+  } catch (IOException e) {
+throw new HoodieIOException("Failed to delete tmp file " + 
tmpPath, e);
   }
 }
-  } catch (IOException e) {
-throw new HoodieIOException("Failed to rename " + tmpPath + " to the 
target " + fullPath, e);
   }
 }
   }



[hudi] 22/30: [HUDI-4631] Adding retries to spark datasource writes on conflict failures (#6854)

2023-09-01 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 3eb6de6d00b7f71faf74d37ce55f79c3b4e25d60
Author: Sivabalan Narayanan 
AuthorDate: Mon Aug 28 07:17:45 2023 -0400

[HUDI-4631] Adding retries to spark datasource writes on conflict failures 
(#6854)

Added a retry functionality to spark datasource writes automatically incase 
of conflict failures.
User experience w/ multi-writers will be improved with these automatic 
retries.

-

Co-authored-by: Sagar Sumit 
---
 .../org/apache/hudi/config/HoodieLockConfig.java   | 16 --
 .../org/apache/hudi/config/HoodieWriteConfig.java  |  6 ++
 .../org/apache/hudi/HoodieSparkSqlWriter.scala | 40 +++--
 .../apache/hudi/functional/TestCOWDataSource.scala | 66 +-
 4 files changed, 116 insertions(+), 12 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLockConfig.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLockConfig.java
index 1d5b09629e4..b24aecf46c1 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLockConfig.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLockConfig.java
@@ -217,16 +217,24 @@ public class HoodieLockConfig extends HoodieConfig {
   .withDocumentation("Lock provider class name, this should be subclass of 
"
   + "org.apache.hudi.client.transaction.ConflictResolutionStrategy");
 
-  /** @deprecated Use {@link #WRITE_CONFLICT_RESOLUTION_STRATEGY_CLASS_NAME} 
and its methods instead */
+  /**
+   * @deprecated Use {@link #WRITE_CONFLICT_RESOLUTION_STRATEGY_CLASS_NAME} 
and its methods instead
+   */
   @Deprecated
   public static final String WRITE_CONFLICT_RESOLUTION_STRATEGY_CLASS_PROP = 
WRITE_CONFLICT_RESOLUTION_STRATEGY_CLASS_NAME.key();
-  /** @deprecated Use {@link #WRITE_CONFLICT_RESOLUTION_STRATEGY_CLASS_NAME} 
and its methods instead */
+  /**
+   * @deprecated Use {@link #WRITE_CONFLICT_RESOLUTION_STRATEGY_CLASS_NAME} 
and its methods instead
+   */
   @Deprecated
   public static final String DEFAULT_WRITE_CONFLICT_RESOLUTION_STRATEGY_CLASS 
= WRITE_CONFLICT_RESOLUTION_STRATEGY_CLASS_NAME.defaultValue();
-  /** @deprecated Use {@link #LOCK_PROVIDER_CLASS_NAME} and its methods 
instead */
+  /**
+   * @deprecated Use {@link #LOCK_PROVIDER_CLASS_NAME} and its methods instead
+   */
   @Deprecated
   public static final String LOCK_PROVIDER_CLASS_PROP = 
LOCK_PROVIDER_CLASS_NAME.key();
-  /** @deprecated Use {@link #LOCK_PROVIDER_CLASS_NAME} and its methods 
instead */
+  /**
+   * @deprecated Use {@link #LOCK_PROVIDER_CLASS_NAME} and its methods instead
+   */
   @Deprecated
   public static final String DEFAULT_LOCK_PROVIDER_CLASS = 
LOCK_PROVIDER_CLASS_NAME.defaultValue();
 
diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
index ba94d80d674..01b8fa55948 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
@@ -558,6 +558,12 @@ public class HoodieWriteConfig extends HoodieConfig {
   .defaultValue(WriteConcurrencyMode.SINGLE_WRITER.name())
   .withDocumentation(WriteConcurrencyMode.class);
 
+  public static final ConfigProperty NUM_RETRIES_ON_CONFLICT_FAILURES 
= ConfigProperty
+  .key("hoodie.write.num.retries.on.conflict.failures")
+  .defaultValue(0)
+  .sinceVersion("0.13.0")
+  .withDocumentation("Maximum number of times to retry a batch on conflict 
failure.");
+
   public static final ConfigProperty WRITE_SCHEMA_OVERRIDE = 
ConfigProperty
   .key("hoodie.write.schema")
   .noDefaultValue()
diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
index e98d72d8284..57baba29c92 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
@@ -21,7 +21,7 @@ import org.apache.avro.Schema
 import org.apache.avro.generic.GenericData
 import org.apache.hadoop.conf.Configuration
 import org.apache.hadoop.fs.{FileSystem, Path}
-import org.apache.hudi.AutoRecordKeyGenerationUtils.{isAutoGenerateRecordKeys, 
mayBeValidateParamsForAutoGenerationOfRecordKeys}
+import 
org.apache.hudi.AutoRecordKeyGenerationUtils.mayBeValidateParamsForAutoGenerationOfRecordKeys
 import org.ap

[hudi] 23/30: [MINOR] Modify return type description (#9479)

2023-09-01 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit a4f542931c18cdfc76c627f426d14d21044adf98
Author: empcl <1515827...@qq.com>
AuthorDate: Tue Aug 29 13:17:56 2023 +0800

[MINOR] Modify return type description (#9479)
---
 .../java/org/apache/hudi/common/table/view/TableFileSystemView.java   | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/table/view/TableFileSystemView.java
 
b/hudi-common/src/main/java/org/apache/hudi/common/table/view/TableFileSystemView.java
index db6e12cbda6..6fedb8684c9 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/table/view/TableFileSystemView.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/table/view/TableFileSystemView.java
@@ -171,14 +171,14 @@ public interface TableFileSystemView {
   /**
* Return Pending Compaction Operations.
*
-   * @return Pair>
+   * @return Stream>
*/
   Stream> getPendingCompactionOperations();
 
   /**
* Return Pending Compaction Operations.
*
-   * @return Pair>
+   * @return Stream>
*/
   Stream> 
getPendingLogCompactionOperations();
 



[hudi] 07/30: [HUDI-6733] Add flink-metrics-dropwizard to flink bundle (#9499)

2023-09-01 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 18f043185d9b5acf0e3c73838975cd7248c0
Author: StreamingFlames <18889897...@163.com>
AuthorDate: Tue Aug 22 11:40:18 2023 +0800

[HUDI-6733] Add flink-metrics-dropwizard to flink bundle (#9499)
---
 packaging/hudi-flink-bundle/pom.xml | 1 +
 1 file changed, 1 insertion(+)

diff --git a/packaging/hudi-flink-bundle/pom.xml 
b/packaging/hudi-flink-bundle/pom.xml
index dba7b923aec..19d236fca89 100644
--- a/packaging/hudi-flink-bundle/pom.xml
+++ b/packaging/hudi-flink-bundle/pom.xml
@@ -136,6 +136,7 @@
   
org.apache.flink:${flink.hadoop.compatibility.artifactId}
   org.apache.flink:flink-json
   
org.apache.flink:${flink.parquet.artifactId}
+  org.apache.flink:flink-metrics-dropwizard
 
   org.apache.hive:hive-common
   org.apache.hive:hive-service



[hudi] 27/30: [MINOR] Fixing warn log with auto key gen (#9547)

2023-09-01 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 2aaf4027110d40b719a62c4bda74d9453f22f22f
Author: Sivabalan Narayanan 
AuthorDate: Wed Aug 30 11:48:48 2023 -0400

[MINOR] Fixing warn log with auto key gen (#9547)
---
 .../main/scala/org/apache/hudi/AutoRecordKeyGenerationUtils.scala  | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/AutoRecordKeyGenerationUtils.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/AutoRecordKeyGenerationUtils.scala
index 501c563a989..6c1b828f3be 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/AutoRecordKeyGenerationUtils.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/AutoRecordKeyGenerationUtils.scala
@@ -48,10 +48,9 @@ object AutoRecordKeyGenerationUtils {
   if (!parameters.getOrElse(HoodieTableConfig.POPULATE_META_FIELDS.key(), 
HoodieTableConfig.POPULATE_META_FIELDS.defaultValue().toString).toBoolean) {
 throw new HoodieKeyGeneratorException("Disabling " + 
HoodieTableConfig.POPULATE_META_FIELDS.key() + " is not supported with auto 
generation of record keys")
   }
-}
-
-if (hoodieConfig.contains(PRECOMBINE_FIELD.key())) {
-  log.warn("Precombine field " + 
hoodieConfig.getString(PRECOMBINE_FIELD.key()) + " will be ignored with auto 
record key generation enabled")
+  if (hoodieConfig.contains(PRECOMBINE_FIELD.key())) {
+log.warn("Precombine field " + 
hoodieConfig.getString(PRECOMBINE_FIELD.key()) + " will be ignored with auto 
record key generation enabled")
+  }
 }
   }
 



[hudi] 17/30: [HUDI-6445] Triage ci flakiness and some test fies (#9534)

2023-09-01 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit a7690eca670f7c69884fa36770f931663cbb34fc
Author: Sivabalan Narayanan 
AuthorDate: Fri Aug 25 09:54:06 2023 -0400

[HUDI-6445] Triage ci flakiness and some test fies (#9534)

Fixed metrics in tests. (disabled metrics).
Fixed Java tests to use local FS instead of hdfs.
Removed some of parametrized tests for java.

-

Co-authored-by: Sagar Sumit 
---
 .../hudi/client/TestJavaHoodieBackedMetadata.java  |  16 +-
 .../TestHoodieJavaClientOnCopyOnWriteStorage.java  | 185 +
 .../testutils/HoodieJavaClientTestHarness.java | 140 
 .../hudi/testutils/TestHoodieMetadataBase.java |   2 +-
 .../functional/TestHoodieBackedMetadata.java   |  18 +-
 .../client/functional/TestHoodieMetadataBase.java  |   2 +-
 .../realtime/TestHoodieRealtimeRecordReader.java   |   7 +-
 .../apache/hudi/functional/TestBootstrapRead.java  |   2 +-
 .../functional/TestNewHoodieParquetFileFormat.java |   4 +-
 9 files changed, 174 insertions(+), 202 deletions(-)

diff --git 
a/hudi-client/hudi-java-client/src/test/java/org/apache/hudi/client/TestJavaHoodieBackedMetadata.java
 
b/hudi-client/hudi-java-client/src/test/java/org/apache/hudi/client/TestJavaHoodieBackedMetadata.java
index 7226563feaa..b22fa76788d 100644
--- 
a/hudi-client/hudi-java-client/src/test/java/org/apache/hudi/client/TestJavaHoodieBackedMetadata.java
+++ 
b/hudi-client/hudi-java-client/src/test/java/org/apache/hudi/client/TestJavaHoodieBackedMetadata.java
@@ -185,14 +185,10 @@ public class TestJavaHoodieBackedMetadata extends 
TestHoodieMetadataBase {
 
   public static List tableOperationsTestArgs() {
 return asList(
-Arguments.of(COPY_ON_WRITE, true, true),
-Arguments.of(COPY_ON_WRITE, true, false),
-Arguments.of(COPY_ON_WRITE, false, true),
-Arguments.of(COPY_ON_WRITE, false, false),
-Arguments.of(MERGE_ON_READ, true, true),
-Arguments.of(MERGE_ON_READ, true, false),
-Arguments.of(MERGE_ON_READ, false, true),
-Arguments.of(MERGE_ON_READ, false, false)
+Arguments.of(COPY_ON_WRITE, true),
+Arguments.of(COPY_ON_WRITE, false),
+Arguments.of(MERGE_ON_READ, true),
+Arguments.of(MERGE_ON_READ, false)
 );
   }
 
@@ -284,14 +280,14 @@ public class TestJavaHoodieBackedMetadata extends 
TestHoodieMetadataBase {
*/
   @ParameterizedTest
   @MethodSource("tableOperationsTestArgs")
-  public void testTableOperations(HoodieTableType tableType, boolean 
enableFullScan, boolean enableMetrics) throws Exception {
+  public void testTableOperations(HoodieTableType tableType, boolean 
enableFullScan) throws Exception {
 List commitTimeList = new ArrayList<>();
 
commitTimeList.add(Long.parseLong(HoodieActiveTimeline.createNewInstantTime()));
 for (int i = 0; i < 8; i++) {
   long nextCommitTime = 
getNextCommitTime(commitTimeList.get(commitTimeList.size() - 1));
   commitTimeList.add(nextCommitTime);
 }
-init(tableType, true, enableFullScan, enableMetrics, false);
+init(tableType, true, enableFullScan, false, false);
 doWriteInsertAndUpsert(testTable, commitTimeList.get(0).toString(), 
commitTimeList.get(1).toString(), false);
 
 // trigger an upsert
diff --git 
a/hudi-client/hudi-java-client/src/test/java/org/apache/hudi/client/functional/TestHoodieJavaClientOnCopyOnWriteStorage.java
 
b/hudi-client/hudi-java-client/src/test/java/org/apache/hudi/client/functional/TestHoodieJavaClientOnCopyOnWriteStorage.java
index a3a0b726619..211dc0129e6 100644
--- 
a/hudi-client/hudi-java-client/src/test/java/org/apache/hudi/client/functional/TestHoodieJavaClientOnCopyOnWriteStorage.java
+++ 
b/hudi-client/hudi-java-client/src/test/java/org/apache/hudi/client/functional/TestHoodieJavaClientOnCopyOnWriteStorage.java
@@ -150,16 +150,10 @@ public class TestHoodieJavaClientOnCopyOnWriteStorage 
extends HoodieJavaClientTe
 
   private static final String CLUSTERING_FAILURE = "CLUSTERING FAILURE";
 
-  private static Stream populateMetaFieldsParams() {
-return Arrays.stream(new Boolean[][] {{true}, {false}}).map(Arguments::of);
-  }
-
   private static Stream 
rollbackAfterConsistencyCheckFailureParams() {
 return Stream.of(
-Arguments.of(true, true),
-Arguments.of(true, false),
-Arguments.of(false, true),
-Arguments.of(false, false)
+Arguments.of(true),
+Arguments.of(false)
 );
   }
 
@@ -173,56 +167,50 @@ public class TestHoodieJavaClientOnCopyOnWriteStorage 
extends HoodieJavaClientTe
   /**
* Test Auto Commit behavior for HoodieWriteClient insert API.
*/
-  @ParameterizedTest
-  @MethodSource("populateMetaFieldsParams")
-  public void testAutoCommitOnInsert(boolean populateMetaFields) 

[hudi] 29/30: [HUDI-6445] Fixing metrics to use IN-MEMORY type in tests (#9543)

2023-09-01 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 9be80c7bc0377c9f88a8a4fb957a69561d236ea6
Author: Sivabalan Narayanan 
AuthorDate: Wed Aug 30 17:39:54 2023 -0400

[HUDI-6445] Fixing metrics to use IN-MEMORY type in tests (#9543)
---
 .../test/java/org/apache/hudi/testutils/TestHoodieMetadataBase.java | 6 ++
 .../apache/hudi/metadata/SparkHoodieBackedTableMetadataWriter.java  | 3 ++-
 .../org/apache/hudi/client/functional/TestHoodieMetadataBase.java   | 6 ++
 3 files changed, 6 insertions(+), 9 deletions(-)

diff --git 
a/hudi-client/hudi-java-client/src/test/java/org/apache/hudi/testutils/TestHoodieMetadataBase.java
 
b/hudi-client/hudi-java-client/src/test/java/org/apache/hudi/testutils/TestHoodieMetadataBase.java
index e7f13991add..18f872bd86d 100644
--- 
a/hudi-client/hudi-java-client/src/test/java/org/apache/hudi/testutils/TestHoodieMetadataBase.java
+++ 
b/hudi-client/hudi-java-client/src/test/java/org/apache/hudi/testutils/TestHoodieMetadataBase.java
@@ -35,12 +35,12 @@ import org.apache.hudi.config.HoodieCompactionConfig;
 import org.apache.hudi.config.HoodieIndexConfig;
 import org.apache.hudi.config.HoodieWriteConfig;
 import org.apache.hudi.config.metrics.HoodieMetricsConfig;
-import org.apache.hudi.config.metrics.HoodieMetricsGraphiteConfig;
 import org.apache.hudi.index.HoodieIndex;
 import org.apache.hudi.metadata.HoodieMetadataWriteUtils;
 import org.apache.hudi.metadata.HoodieTableMetadata;
 import org.apache.hudi.metadata.HoodieTableMetadataWriter;
 import org.apache.hudi.metadata.JavaHoodieBackedTableMetadataWriter;
+import org.apache.hudi.metrics.MetricsReporterType;
 import org.apache.hudi.table.HoodieJavaTable;
 import org.apache.hudi.table.HoodieTable;
 
@@ -303,9 +303,7 @@ public class TestHoodieMetadataBase extends 
HoodieJavaClientTestHarness {
 .ignoreSpuriousDeletes(validateMetadataPayloadConsistency)
 .build())
 .withMetricsConfig(HoodieMetricsConfig.newBuilder().on(enableMetrics)
-.withExecutorMetrics(enableMetrics).build())
-.withMetricsGraphiteConfig(HoodieMetricsGraphiteConfig.newBuilder()
-.usePrefix("unit-test").build())
+
.withExecutorMetrics(enableMetrics).withReporterType(MetricsReporterType.INMEMORY.name()).build())
 .withRollbackUsingMarkers(useRollbackUsingMarkers)
 .withProperties(properties);
   }
diff --git 
a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/metadata/SparkHoodieBackedTableMetadataWriter.java
 
b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/metadata/SparkHoodieBackedTableMetadataWriter.java
index f01547e01a9..15b527a0fe3 100644
--- 
a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/metadata/SparkHoodieBackedTableMetadataWriter.java
+++ 
b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/metadata/SparkHoodieBackedTableMetadataWriter.java
@@ -33,6 +33,7 @@ import org.apache.hudi.common.util.Option;
 import org.apache.hudi.config.HoodieWriteConfig;
 import org.apache.hudi.data.HoodieJavaRDD;
 import org.apache.hudi.metrics.DistributedRegistry;
+import org.apache.hudi.metrics.MetricsReporterType;
 
 import org.apache.hadoop.conf.Configuration;
 import org.apache.spark.api.java.JavaRDD;
@@ -98,7 +99,7 @@ public class SparkHoodieBackedTableMetadataWriter extends 
HoodieBackedTableMetad
   protected void initRegistry() {
 if (metadataWriteConfig.isMetricsOn()) {
   Registry registry;
-  if (metadataWriteConfig.isExecutorMetricsEnabled()) {
+  if (metadataWriteConfig.isExecutorMetricsEnabled() && 
metadataWriteConfig.getMetricsReporterType() != MetricsReporterType.INMEMORY) {
 registry = Registry.getRegistry("HoodieMetadata", 
DistributedRegistry.class.getName());
 HoodieSparkEngineContext sparkEngineContext = 
(HoodieSparkEngineContext) engineContext;
 ((DistributedRegistry) 
registry).register(sparkEngineContext.getJavaSparkContext());
diff --git 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestHoodieMetadataBase.java
 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestHoodieMetadataBase.java
index e0a00c24e92..f8e3750f6a5 100644
--- 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestHoodieMetadataBase.java
+++ 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestHoodieMetadataBase.java
@@ -35,12 +35,12 @@ import org.apache.hudi.config.HoodieCompactionConfig;
 import org.apache.hudi.config.HoodieIndexConfig;
 import org.apache.hudi.config.HoodieWriteConfig;
 import org.apache.hudi.config.metrics.HoodieMetricsConfig;
-import org.apache.hudi.config.metrics.HoodieMetricsGraphiteConfig;
 import org.apache.hudi.index.HoodieIndex;
 import org.apache.hudi.metadata.Hoodi

[hudi] 08/30: [HUDI-6731] BigQuerySyncTool: add flag to allow for read optimized sync for MoR tables (#9488)

2023-09-01 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 1ff0a7f2eb195bb99ee84513653c18983eabeb68
Author: Tim Brown 
AuthorDate: Tue Aug 22 01:48:59 2023 -0500

[HUDI-6731] BigQuerySyncTool: add flag to allow for read optimized sync for 
MoR tables (#9488)
---
 .../main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncTool.java| 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git 
a/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncTool.java 
b/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncTool.java
index e0f5ace6c3a..47aa342dad0 100644
--- a/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncTool.java
+++ b/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncTool.java
@@ -72,9 +72,9 @@ public class BigQuerySyncTool extends HoodieSyncTool {
 try (HoodieBigQuerySyncClient bqSyncClient = new 
HoodieBigQuerySyncClient(config)) {
   switch (bqSyncClient.getTableType()) {
 case COPY_ON_WRITE:
-  syncCoWTable(bqSyncClient);
-  break;
 case MERGE_ON_READ:
+  syncTable(bqSyncClient);
+  break;
 default:
   throw new UnsupportedOperationException(bqSyncClient.getTableType() 
+ " table type is not supported yet.");
   }
@@ -91,7 +91,7 @@ public class BigQuerySyncTool extends HoodieSyncTool {
 return false;
   }
 
-  private void syncCoWTable(HoodieBigQuerySyncClient bqSyncClient) {
+  private void syncTable(HoodieBigQuerySyncClient bqSyncClient) {
 ValidationUtils.checkState(bqSyncClient.getTableType() == 
HoodieTableType.COPY_ON_WRITE);
 LOG.info("Sync hoodie table " + snapshotViewName + " at base path " + 
bqSyncClient.getBasePath());
 



[hudi] 28/30: [HUDI-3727] Add metrics for async indexer (#9559)

2023-09-01 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit db2129ebb625637038ba6dea3834b0c6d5bcf55a
Author: Sagar Sumit 
AuthorDate: Thu Aug 31 03:04:01 2023 +0530

[HUDI-3727] Add metrics for async indexer (#9559)
---
 .../apache/hudi/metadata/HoodieMetadataWriteUtils.java   |  1 -
 .../hudi/table/action/index/RunIndexActionExecutor.java  | 16 +++-
 .../org/apache/hudi/metadata/HoodieMetadataMetrics.java  |  3 ++-
 3 files changed, 17 insertions(+), 3 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataWriteUtils.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataWriteUtils.java
index 2078896987d..e73f6fb7bc3 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataWriteUtils.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataWriteUtils.java
@@ -68,7 +68,6 @@ public class HoodieMetadataWriteUtils {
   // eventually depend on the number of file groups selected for each 
partition (See estimateFileGroupCount function)
   private static final long MDT_MAX_HFILE_SIZE_BYTES = 10 * 1024 * 1024 * 
1024L; // 10GB
 
-
   /**
* Create a {@code HoodieWriteConfig} to use for the Metadata Table.  This 
is used by async
* indexer only.
diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/index/RunIndexActionExecutor.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/index/RunIndexActionExecutor.java
index 9b91167899c..461c525a1d5 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/index/RunIndexActionExecutor.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/index/RunIndexActionExecutor.java
@@ -27,6 +27,7 @@ import org.apache.hudi.avro.model.HoodieRestoreMetadata;
 import org.apache.hudi.avro.model.HoodieRollbackMetadata;
 import org.apache.hudi.client.transaction.TransactionManager;
 import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.common.metrics.Registry;
 import org.apache.hudi.common.model.HoodieCommitMetadata;
 import org.apache.hudi.common.table.HoodieTableConfig;
 import org.apache.hudi.common.table.HoodieTableMetaClient;
@@ -35,11 +36,13 @@ import org.apache.hudi.common.table.timeline.HoodieTimeline;
 import org.apache.hudi.common.table.timeline.TimelineMetadataUtils;
 import org.apache.hudi.common.util.CleanerUtils;
 import org.apache.hudi.common.util.CollectionUtils;
+import org.apache.hudi.common.util.HoodieTimer;
 import org.apache.hudi.common.util.Option;
 import org.apache.hudi.common.util.StringUtils;
 import org.apache.hudi.config.HoodieWriteConfig;
 import org.apache.hudi.exception.HoodieIndexException;
 import org.apache.hudi.exception.HoodieMetadataException;
+import org.apache.hudi.metadata.HoodieMetadataMetrics;
 import org.apache.hudi.metadata.HoodieTableMetadataWriter;
 import org.apache.hudi.metadata.MetadataPartitionType;
 import org.apache.hudi.table.HoodieTable;
@@ -90,6 +93,8 @@ public class RunIndexActionExecutor extends 
BaseActionExecutor metrics;
+
   // we use this to update the latest instant in data timeline that has been 
indexed in metadata table
   // this needs to be volatile as it can be updated in the IndexingCheckTask 
spawned by this executor
   // assumption is that only one indexer can execute at a time
@@ -100,6 +105,11 @@ public class RunIndexActionExecutor extends 
BaseActionExecutor table, String instantTime) {
 super(context, config, table, instantTime);
 this.txnManager = new TransactionManager(config, 
table.getMetaClient().getFs());
+if (config.getMetadataConfig().enableMetrics()) {
+  this.metrics = Option.of(new 
HoodieMetadataMetrics(Registry.getRegistry("HoodieIndexer")));
+} else {
+  this.metrics = Option.empty();
+}
   }
 
   @Override
@@ -143,7 +153,9 @@ public class RunIndexActionExecutor extends 
BaseActionExecutor 
m.updateMetrics(HoodieMetadataMetrics.INITIALIZE_STR, timer.endTimer()));
 
   // get remaining instants to catchup
   List instantsToCatchup = 
getInstantsToCatchup(indexUptoInstant);
@@ -167,7 +179,7 @@ public class RunIndexActionExecutor extends 
BaseActionExecutor 
entry.getMetadataPartitionPath()).collect(Collectors.toList()).toArray()));
+  .map(entry -> 
entry.getMetadataPartitionPath()).collect(Collectors.toList()).toArray()));
 }
   } else {
 String indexUptoInstant = fileIndexPartitionInfo.getIndexUptoInstant();
@@ -275,7 +287,9 @@ public class RunIndexActionExecutor extends 
BaseActionExecutor 
m.updateMetrics(HoodieMetadataMetrics.ASYNC_INDEXER_CATCHUP_TIME, 
timer.endTimer()));
 } catch (Exception e) {
   indexingCatchupTaskFutur

[hudi] 25/30: [MINOR] Fix AWS refactor bug by adding skipTableArchive arg (#9563)

2023-09-01 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 89a3443173d26a7f6314894cb2aab28f4615f7bf
Author: Tim Brown 
AuthorDate: Tue Aug 29 14:18:55 2023 -0500

[MINOR] Fix AWS refactor bug by adding skipTableArchive arg (#9563)
---
 .../src/main/java/org/apache/hudi/aws/sync/AWSGlueCatalogSyncClient.java | 1 +
 1 file changed, 1 insertion(+)

diff --git 
a/hudi-aws/src/main/java/org/apache/hudi/aws/sync/AWSGlueCatalogSyncClient.java 
b/hudi-aws/src/main/java/org/apache/hudi/aws/sync/AWSGlueCatalogSyncClient.java
index bbf96dc221d..d45cc76a6bc 100644
--- 
a/hudi-aws/src/main/java/org/apache/hudi/aws/sync/AWSGlueCatalogSyncClient.java
+++ 
b/hudi-aws/src/main/java/org/apache/hudi/aws/sync/AWSGlueCatalogSyncClient.java
@@ -607,6 +607,7 @@ public class AWSGlueCatalogSyncClient extends 
HoodieSyncClient {
 
   UpdateTableRequest request =  
UpdateTableRequest.builder().databaseName(databaseName)
   .tableInput(updatedTableInput)
+  .skipArchive(skipTableArchive)
   .build();
   awsGlue.updateTable(request);
   return true;



[hudi] 06/30: [HUDI-6683][FOLLOW-UP] Json & Avro Kafka Source Minor Refactor & Added null Kafka Key test cases (#9459)

2023-09-01 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 2127d3d2c4a6898fbbf7acdd91f38769bd059e1e
Author: Prathit malik <53890994+prathi...@users.noreply.github.com>
AuthorDate: Tue Aug 22 06:31:47 2023 +0530

[HUDI-6683][FOLLOW-UP] Json & Avro Kafka Source Minor Refactor & Added null 
Kafka Key test cases (#9459)
---
 .../hudi/utilities/sources/JsonKafkaSource.java|  2 +-
 .../utilities/sources/helpers/AvroConvertor.java   | 11 
 .../utilities/sources/TestAvroKafkaSource.java | 30 ++
 .../utilities/sources/TestJsonKafkaSource.java | 14 ++
 .../utilities/testutils/UtilitiesTestBase.java |  9 +++
 5 files changed, 60 insertions(+), 6 deletions(-)

diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JsonKafkaSource.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JsonKafkaSource.java
index f31c9b7e542..eb67abfee3a 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JsonKafkaSource.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JsonKafkaSource.java
@@ -81,7 +81,7 @@ public class JsonKafkaSource extends KafkaSource {
 ObjectMapper om = new ObjectMapper();
 partitionIterator.forEachRemaining(consumerRecord -> {
   String recordValue = consumerRecord.value().toString();
-  String recordKey = consumerRecord.key().toString();
+  String recordKey = StringUtils.objToString(consumerRecord.key());
   try {
 ObjectNode jsonNode = (ObjectNode) om.readTree(recordValue);
 jsonNode.put(KAFKA_SOURCE_OFFSET_COLUMN, consumerRecord.offset());
diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/AvroConvertor.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/AvroConvertor.java
index 89191cb465c..f9c35bd3b6e 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/AvroConvertor.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/AvroConvertor.java
@@ -19,6 +19,7 @@
 package org.apache.hudi.utilities.sources.helpers;
 
 import org.apache.hudi.avro.MercifulJsonConverter;
+import org.apache.hudi.common.util.StringUtils;
 import org.apache.hudi.internal.schema.HoodieSchemaException;
 
 import com.google.protobuf.Message;
@@ -171,16 +172,16 @@ public class AvroConvertor implements Serializable {
*/
   public GenericRecord withKafkaFieldsAppended(ConsumerRecord consumerRecord) {
 initSchema();
-GenericRecord record = (GenericRecord) consumerRecord.value();
+GenericRecord recordValue = (GenericRecord) consumerRecord.value();
 GenericRecordBuilder recordBuilder = new GenericRecordBuilder(this.schema);
-for (Schema.Field field :  record.getSchema().getFields()) {
-  recordBuilder.set(field, record.get(field.name()));
+for (Schema.Field field :  recordValue.getSchema().getFields()) {
+  recordBuilder.set(field, recordValue.get(field.name()));
 }
-
+String recordKey = StringUtils.objToString(consumerRecord.key());
 recordBuilder.set(KAFKA_SOURCE_OFFSET_COLUMN, consumerRecord.offset());
 recordBuilder.set(KAFKA_SOURCE_PARTITION_COLUMN, 
consumerRecord.partition());
 recordBuilder.set(KAFKA_SOURCE_TIMESTAMP_COLUMN, 
consumerRecord.timestamp());
-recordBuilder.set(KAFKA_SOURCE_KEY_COLUMN, 
consumerRecord.key().toString());
+recordBuilder.set(KAFKA_SOURCE_KEY_COLUMN, recordKey);
 return recordBuilder.build();
   }
 
diff --git 
a/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestAvroKafkaSource.java
 
b/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestAvroKafkaSource.java
index 2632f72659b..16ec4545665 100644
--- 
a/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestAvroKafkaSource.java
+++ 
b/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestAvroKafkaSource.java
@@ -62,6 +62,7 @@ import static 
org.apache.hudi.utilities.schema.KafkaOffsetPostProcessor.KAFKA_SO
 import static 
org.apache.hudi.utilities.schema.KafkaOffsetPostProcessor.KAFKA_SOURCE_TIMESTAMP_COLUMN;
 import static 
org.apache.hudi.utilities.schema.KafkaOffsetPostProcessor.KAFKA_SOURCE_KEY_COLUMN;
 import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertNull;
 import static org.mockito.Mockito.mock;
 
 public class TestAvroKafkaSource extends SparkClientFunctionalTestHarness {
@@ -113,6 +114,17 @@ public class TestAvroKafkaSource extends 
SparkClientFunctionalTestHarness {
 }
   }
 
+  void sendMessagesToKafkaWithNullKafkaKey(String topic, int count, int 
numPartitions) {
+List genericRecords = dataGen.generateGenericRecords(count);
+Properties config = getProducerProperties();
+

[hudi] 04/30: [MINOR] Close record readers after use during tests (#9457)

2023-09-01 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 6a6bfd7c1e0a08fdb14324d477cb6f44d834f40f
Author: voonhous 
AuthorDate: Sun Aug 20 09:45:51 2023 +0800

[MINOR] Close record readers after use during tests (#9457)
---
 .../test/java/org/apache/hudi/testutils/HoodieMergeOnReadTestUtils.java  | 1 +
 1 file changed, 1 insertion(+)

diff --git 
a/hudi-client/hudi-client-common/src/test/java/org/apache/hudi/testutils/HoodieMergeOnReadTestUtils.java
 
b/hudi-client/hudi-client-common/src/test/java/org/apache/hudi/testutils/HoodieMergeOnReadTestUtils.java
index 6f787db6069..7185115a4d5 100644
--- 
a/hudi-client/hudi-client-common/src/test/java/org/apache/hudi/testutils/HoodieMergeOnReadTestUtils.java
+++ 
b/hudi-client/hudi-client-common/src/test/java/org/apache/hudi/testutils/HoodieMergeOnReadTestUtils.java
@@ -166,6 +166,7 @@ public class HoodieMergeOnReadTestUtils {
   .forEach(fieldsPair -> newRecord.set(fieldsPair.getKey(), 
values[fieldsPair.getValue().pos()]));
   records.add(newRecord.build());
 }
+recordReader.close();
   }
 } catch (IOException ie) {
   LOG.error("Read records error", ie);



[hudi] 14/30: [HUDI-6718] Check Timeline Before Transitioning Inflight Clean in Multiwriter Scenario (#9468)

2023-09-01 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 802d75b285bac354b2b106fd72f79498c1e389cb
Author: Jon Vexler 
AuthorDate: Wed Aug 23 22:30:41 2023 -0400

[HUDI-6718] Check Timeline Before Transitioning Inflight Clean in 
Multiwriter Scenario (#9468)

- If two cleans start at nearly the same time, they will both attempt to 
execute the same clean instances. This does not cause any data corruption, but 
will cause a writer to fail when they attempt to create the commit in the 
timeline. This is because the commit will have already been written by the 
first writer. Now, we check the timeline before transitioning state.

Co-authored-by: Jonathan Vexler <=>
---
 .../hudi/table/action/clean/CleanActionExecutor.java   | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java
index 05e1056324a..c931e7bce9d 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java
@@ -261,8 +261,10 @@ public class CleanActionExecutor extends 
BaseActionExecutor extends 
BaseActionExecutor 0 ? 
cleanMetadataList.get(cleanMetadataList.size() - 1) : null;
   }
+
+  private void checkIfOtherWriterCommitted(HoodieInstant hoodieInstant, 
HoodieIOException e) {
+table.getMetaClient().reloadActiveTimeline();
+if 
(table.getCleanTimeline().filterCompletedInstants().containsInstant(hoodieInstant.getTimestamp()))
 {
+  LOG.warn("Clean operation was completed by another writer for instant: " 
+ hoodieInstant);
+} else {
+  LOG.error("Failed to perform previous clean operation, instant: " + 
hoodieInstant, e);
+  throw e;
+}
+  }
 }



[hudi] 18/30: [HUDI-6754] Fix record reader tests in hudi-hadoop-mr (#9535)

2023-09-01 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 0d8c34f24da769cd9b0be5f764f897654f9b2b9c
Author: Sagar Sumit 
AuthorDate: Sat Aug 26 01:53:54 2023 +0530

[HUDI-6754] Fix record reader tests in hudi-hadoop-mr (#9535)
---
 .../realtime/AbstractRealtimeRecordReader.java |  1 -
 .../hive/TestHoodieCombineHiveInputFormat.java | 23 +---
 .../TestHoodieMergeOnReadSnapshotReader.java   |  6 +++
 .../realtime/TestHoodieRealtimeRecordReader.java   | 44 +--
 .../hudi/hadoop/testutils/InputFormatTestUtil.java | 63 +++---
 5 files changed, 81 insertions(+), 56 deletions(-)

diff --git 
a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/AbstractRealtimeRecordReader.java
 
b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/AbstractRealtimeRecordReader.java
index 04a05a1d6f0..3cd2a5d05d9 100644
--- 
a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/AbstractRealtimeRecordReader.java
+++ 
b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/AbstractRealtimeRecordReader.java
@@ -133,7 +133,6 @@ public abstract class AbstractRealtimeRecordReader {
   LOG.warn("fall to init HiveAvroSerializer to support payload merge", e);
   this.supportPayload = false;
 }
-
   }
 
   /**
diff --git 
a/hudi-hadoop-mr/src/test/java/org/apache/hudi/hadoop/hive/TestHoodieCombineHiveInputFormat.java
 
b/hudi-hadoop-mr/src/test/java/org/apache/hudi/hadoop/hive/TestHoodieCombineHiveInputFormat.java
index e8c286d8ab7..22e5389a930 100644
--- 
a/hudi-hadoop-mr/src/test/java/org/apache/hudi/hadoop/hive/TestHoodieCombineHiveInputFormat.java
+++ 
b/hudi-hadoop-mr/src/test/java/org/apache/hudi/hadoop/hive/TestHoodieCombineHiveInputFormat.java
@@ -53,6 +53,7 @@ import org.apache.hadoop.mapred.InputSplit;
 import org.apache.hadoop.mapred.JobConf;
 import org.apache.hadoop.mapred.RecordReader;
 import org.junit.jupiter.api.AfterAll;
+import org.junit.jupiter.api.AfterEach;
 import org.junit.jupiter.api.BeforeAll;
 import org.junit.jupiter.api.BeforeEach;
 import org.junit.jupiter.api.Disabled;
@@ -84,8 +85,11 @@ public class TestHoodieCombineHiveInputFormat extends 
HoodieCommonTestHarness {
   }
 
   @AfterAll
-  public static void tearDownClass() {
+  public static void tearDownClass() throws IOException {
 hdfsTestService.stop();
+if (fs != null) {
+  fs.close();
+}
   }
 
   @BeforeEach
@@ -93,6 +97,13 @@ public class TestHoodieCombineHiveInputFormat extends 
HoodieCommonTestHarness {
 assertTrue(fs.mkdirs(new Path(tempDir.toAbsolutePath().toString(;
   }
 
+  @AfterEach
+  public void tearDown() throws IOException {
+if (fs != null) {
+  fs.delete(new Path(tempDir.toAbsolutePath().toString()), true);
+}
+  }
+
   @Test
   public void multiPartitionReadersRealtimeCombineHoodieInputFormat() throws 
Exception {
 // test for HUDI-1718
@@ -154,8 +165,8 @@ public class TestHoodieCombineHiveInputFormat extends 
HoodieCommonTestHarness {
 ArrayWritable arrayWritable = recordReader.createValue();
 int counter = 0;
 
-HoodieCombineRealtimeHiveSplit hiveSplit = 
(HoodieCombineRealtimeHiveSplit)splits[0];
-HoodieCombineRealtimeFileSplit fileSplit = 
(HoodieCombineRealtimeFileSplit)hiveSplit.getInputSplitShim();
+HoodieCombineRealtimeHiveSplit hiveSplit = 
(HoodieCombineRealtimeHiveSplit) splits[0];
+HoodieCombineRealtimeFileSplit fileSplit = 
(HoodieCombineRealtimeFileSplit) hiveSplit.getInputSplitShim();
 List realtimeFileSplits = fileSplit.getRealtimeFileSplits();
 
 while (recordReader.next(nullWritable, arrayWritable)) {
@@ -268,8 +279,8 @@ public class TestHoodieCombineHiveInputFormat extends 
HoodieCommonTestHarness {
 // insert 1000 update records to log file 2
 // now fileid0, fileid1 has no log files, fileid2 has log file
 HoodieLogFormat.Writer writer =
-InputFormatTestUtil.writeDataBlockToLogFile(partitionDir, fs, 
schema, "fileid2", commitTime, newCommitTime,
-numRecords, numRecords, 0);
+InputFormatTestUtil.writeDataBlockToLogFile(partitionDir, fs, schema, 
"fileid2", commitTime, newCommitTime,
+numRecords, numRecords, 0);
 writer.close();
 
 TableDesc tblDesc = Utilities.defaultTd;
@@ -304,7 +315,7 @@ public class TestHoodieCombineHiveInputFormat extends 
HoodieCommonTestHarness {
 // Since the SPLIT_SIZE is 3, we should create only 1 split with all 3 
file groups
 assertEquals(1, splits.length);
 RecordReader recordReader =
-combineHiveInputFormat.getRecordReader(splits[0], jobConf, null);
+combineHiveInputFormat.getRecordReader(splits[0], jobConf, null);
 NullWritable nullWritable = recordReader.createKey();
 ArrayWritable arrayWritable = recordReader.createValue();
 int counter = 0;
diff --git 

[hudi] 12/30: [HUDI-4115] Adding support for schema while loading spark dataset in S3/GCS source (#9502)

2023-09-01 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit df90640116c7c6123e2faa883b954732bccba55b
Author: harshal 
AuthorDate: Wed Aug 23 13:20:09 2023 +0530

[HUDI-4115] Adding support for schema while loading spark dataset in S3/GCS 
source (#9502)

`CloudObjectsSelectorCommon` now takes optional schemaProvider.
Spark datasource read will use `schemaProvider` schema
instead of inferred schema if `schemaProvider` is there .

-

Co-authored-by: Sagar Sumit 
---
 .../sources/GcsEventsHoodieIncrSource.java |  5 +++-
 .../sources/S3EventsHoodieIncrSource.java  |  5 +++-
 .../sources/helpers/CloudDataFetcher.java  |  6 ++--
 .../helpers/CloudObjectsSelectorCommon.java| 17 ++-
 .../sources/TestGcsEventsHoodieIncrSource.java | 34 +++---
 .../sources/TestS3EventsHoodieIncrSource.java  | 28 +-
 .../helpers/TestCloudObjectsSelectorCommon.java| 17 +++
 .../test/resources/schema/sample_data_schema.avsc  | 27 +
 .../src/test/resources/schema/sample_gcs_data.avsc | 31 
 9 files changed, 147 insertions(+), 23 deletions(-)

diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/GcsEventsHoodieIncrSource.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/GcsEventsHoodieIncrSource.java
index 6eb9a7fdbf7..891881095fd 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/GcsEventsHoodieIncrSource.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/GcsEventsHoodieIncrSource.java
@@ -113,6 +113,8 @@ public class GcsEventsHoodieIncrSource extends 
HoodieIncrSource {
   private final GcsObjectMetadataFetcher gcsObjectMetadataFetcher;
   private final CloudDataFetcher gcsObjectDataFetcher;
   private final QueryRunner queryRunner;
+  private final Option schemaProvider;
+
 
   public static final String GCS_OBJECT_KEY = "name";
   public static final String GCS_OBJECT_SIZE = "size";
@@ -142,6 +144,7 @@ public class GcsEventsHoodieIncrSource extends 
HoodieIncrSource {
 this.gcsObjectMetadataFetcher = gcsObjectMetadataFetcher;
 this.gcsObjectDataFetcher = gcsObjectDataFetcher;
 this.queryRunner = queryRunner;
+this.schemaProvider = Option.ofNullable(schemaProvider);
 
 LOG.info("srcPath: " + srcPath);
 LOG.info("missingCheckpointStrategy: " + missingCheckpointStrategy);
@@ -186,7 +189,7 @@ public class GcsEventsHoodieIncrSource extends 
HoodieIncrSource {
   private Pair>, String> extractData(QueryInfo queryInfo, 
Dataset cloudObjectMetadataDF) {
 List cloudObjectMetadata = 
gcsObjectMetadataFetcher.getGcsObjectMetadata(sparkContext, 
cloudObjectMetadataDF, checkIfFileExists);
 LOG.info("Total number of files to process :" + 
cloudObjectMetadata.size());
-Option> fileDataRows = 
gcsObjectDataFetcher.getCloudObjectDataDF(sparkSession, cloudObjectMetadata, 
props);
+Option> fileDataRows = 
gcsObjectDataFetcher.getCloudObjectDataDF(sparkSession, cloudObjectMetadata, 
props, schemaProvider);
 return Pair.of(fileDataRows, queryInfo.getEndInstant());
   }
 
diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java
index 927a8fc3ebb..4b9be847c75 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java
@@ -78,6 +78,8 @@ public class S3EventsHoodieIncrSource extends 
HoodieIncrSource {
   private final QueryRunner queryRunner;
   private final CloudDataFetcher cloudDataFetcher;
 
+  private final Option schemaProvider;
+
   public static class Config {
 // control whether we do existence check for files before consuming them
 @Deprecated
@@ -135,6 +137,7 @@ public class S3EventsHoodieIncrSource extends 
HoodieIncrSource {
 this.missingCheckpointStrategy = getMissingCheckpointStrategy(props);
 this.queryRunner = queryRunner;
 this.cloudDataFetcher = cloudDataFetcher;
+this.schemaProvider = Option.ofNullable(schemaProvider);
   }
 
   @Override
@@ -181,7 +184,7 @@ public class S3EventsHoodieIncrSource extends 
HoodieIncrSource {
 .collectAsList();
 LOG.info("Total number of files to process :" + 
cloudObjectMetadata.size());
 
-Option> datasetOption = 
cloudDataFetcher.getCloudObjectDataDF(sparkSession, cloudObjectMetadata, props);
+Option> datasetOption = 
cloudDataFetcher.getCloudObjectDataDF(sparkSession, cloudObjectMetadata, props, 
schemaProvider);
 return Pair.of(datasetOption, checkPointAndDatase

[hudi] 15/30: [HUDI-6741] Timeline server bug when multiple tables registered with metadata table enabled (#9511)

2023-09-01 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 8d0e813967a29077cca52fca74e468db0cb2bc24
Author: Tim Brown 
AuthorDate: Thu Aug 24 10:58:19 2023 -0500

[HUDI-6741] Timeline server bug when multiple tables registered with 
metadata table enabled (#9511)
---
 .../client/embedded/EmbeddedTimelineService.java   |  2 +-
 .../java/org/apache/hudi/table/HoodieTable.java|  4 +-
 .../TestRemoteFileSystemViewWithMetadataTable.java | 63 --
 .../common/table/view/FileSystemViewManager.java   | 27 +-
 4 files changed, 63 insertions(+), 33 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/embedded/EmbeddedTimelineService.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/embedded/EmbeddedTimelineService.java
index c79942524f1..7d794366ba0 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/embedded/EmbeddedTimelineService.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/embedded/EmbeddedTimelineService.java
@@ -70,7 +70,7 @@ public class EmbeddedTimelineService {
   // Reset to default if set to Remote
   builder.withStorageType(FileSystemViewStorageType.MEMORY);
 }
-return FileSystemViewManager.createViewManager(context, 
writeConfig.getMetadataConfig(), builder.build(), 
writeConfig.getCommonConfig(), basePath);
+return FileSystemViewManager.createViewManagerWithTableMetadata(context, 
writeConfig.getMetadataConfig(), builder.build(), 
writeConfig.getCommonConfig());
   }
 
   public void startServer() throws IOException {
diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java
index 59fa69de2e6..f1de637edf5 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java
@@ -145,7 +145,7 @@ public abstract class HoodieTable implements 
Serializable {
 .build();
 this.metadata = HoodieTableMetadata.create(context, metadataConfig, 
config.getBasePath());
 
-this.viewManager = FileSystemViewManager.createViewManager(context, 
config.getMetadataConfig(), config.getViewStorageConfig(), 
config.getCommonConfig(), () -> metadata);
+this.viewManager = FileSystemViewManager.createViewManager(context, 
config.getMetadataConfig(), config.getViewStorageConfig(), 
config.getCommonConfig(), unused -> metadata);
 this.metaClient = metaClient;
 this.index = getIndex(config, context);
 this.storageLayout = getStorageLayout(config);
@@ -164,7 +164,7 @@ public abstract class HoodieTable implements 
Serializable {
 
   private synchronized FileSystemViewManager getViewManager() {
 if (null == viewManager) {
-  viewManager = FileSystemViewManager.createViewManager(getContext(), 
config.getMetadataConfig(), config.getViewStorageConfig(), 
config.getCommonConfig(), () -> metadata);
+  viewManager = FileSystemViewManager.createViewManager(getContext(), 
config.getMetadataConfig(), config.getViewStorageConfig(), 
config.getCommonConfig(), unused -> metadata);
 }
 return viewManager;
   }
diff --git 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestRemoteFileSystemViewWithMetadataTable.java
 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestRemoteFileSystemViewWithMetadataTable.java
index a6e304daaa4..adb47cc0694 100644
--- 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestRemoteFileSystemViewWithMetadataTable.java
+++ 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestRemoteFileSystemViewWithMetadataTable.java
@@ -36,6 +36,7 @@ import 
org.apache.hudi.common.table.view.FileSystemViewStorageConfig;
 import org.apache.hudi.common.table.view.FileSystemViewStorageType;
 import org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView;
 import org.apache.hudi.common.testutils.HoodieTestDataGenerator;
+import org.apache.hudi.common.testutils.HoodieTestUtils;
 import org.apache.hudi.common.util.Option;
 import org.apache.hudi.common.util.collection.Pair;
 import org.apache.hudi.config.HoodieCompactionConfig;
@@ -57,9 +58,11 @@ import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
 import java.io.IOException;
+import java.nio.file.Files;
 import java.util.ArrayList;
 import java.util.Collections;
 import java.util.List;
+import java.util.Properties;
 import java.util.concurrent.Callable;
 import java.util.concurrent.ExecutorService;
 import java.util.concurrent.Executors;
@@ -83,7 +86,6 @@ public class TestRemoteFileSystemViewWithMetadataTable 
extends Hoodi

[hudi] 09/30: [HUDI-6729] Fix get partition values from path for non-string type partition column (#9484)

2023-09-01 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit ff6b70f545800b431a52dff23f490f3034ce7484
Author: Wechar Yu 
AuthorDate: Wed Aug 23 08:56:53 2023 +0800

[HUDI-6729] Fix get partition values from path for non-string type 
partition column (#9484)

* reuse HoodieSparkUtils#parsePartitionColumnValues to support multi spark 
versions
* assert parsed partition values from path
* throw exception instead of return empty InternalRow when encounter 
exception in HoodieBaseRelation#getPartitionColumnsAsInternalRowInternal
---
 .../scala/org/apache/hudi/HoodieBaseRelation.scala | 51 ++---
 .../TestGetPartitionValuesFromPath.scala   | 53 ++
 2 files changed, 76 insertions(+), 28 deletions(-)

diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala
index 0f7eb27fd04..9ace93ed495 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala
@@ -32,8 +32,8 @@ import org.apache.hudi.common.config.{ConfigProperty, 
HoodieMetadataConfig, Seri
 import org.apache.hudi.common.fs.FSUtils
 import org.apache.hudi.common.fs.FSUtils.getRelativePartitionPath
 import org.apache.hudi.common.model.{FileSlice, HoodieFileFormat, HoodieRecord}
-import org.apache.hudi.common.table.timeline.{HoodieTimeline, TimelineUtils}
-import 
org.apache.hudi.common.table.timeline.TimelineUtils.{HollowCommitHandling, 
validateTimestampAsOf, handleHollowCommitIfNeeded}
+import org.apache.hudi.common.table.timeline.HoodieTimeline
+import 
org.apache.hudi.common.table.timeline.TimelineUtils.validateTimestampAsOf
 import org.apache.hudi.common.table.view.HoodieTableFileSystemView
 import org.apache.hudi.common.table.{HoodieTableConfig, HoodieTableMetaClient, 
TableSchemaResolver}
 import org.apache.hudi.common.util.StringUtils.isNullOrEmpty
@@ -41,6 +41,7 @@ import org.apache.hudi.common.util.ValidationUtils.checkState
 import org.apache.hudi.common.util.{ConfigUtils, StringUtils}
 import org.apache.hudi.config.HoodieBootstrapConfig.DATA_QUERIES_ONLY
 import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.hudi.exception.HoodieException
 import org.apache.hudi.hadoop.CachingPath
 import org.apache.hudi.internal.schema.convert.AvroInternalSchemaConverter
 import org.apache.hudi.internal.schema.utils.{InternalSchemaUtils, SerDeHelper}
@@ -54,6 +55,7 @@ import 
org.apache.spark.sql.HoodieCatalystExpressionUtils.{convertToCatalystExpr
 import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.catalyst.analysis.Resolver
 import org.apache.spark.sql.catalyst.expressions.{Expression, 
SubqueryExpression}
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
 import org.apache.spark.sql.execution.FileRelation
 import org.apache.spark.sql.execution.datasources._
 import org.apache.spark.sql.execution.datasources.orc.OrcFileFormat
@@ -62,7 +64,6 @@ import org.apache.spark.sql.hudi.HoodieSqlCommonUtils
 import org.apache.spark.sql.sources.{BaseRelation, Filter, PrunedFilteredScan}
 import org.apache.spark.sql.types.StructType
 import org.apache.spark.sql.{Row, SQLContext, SparkSession}
-import org.apache.spark.unsafe.types.UTF8String
 
 import java.net.URI
 import scala.collection.JavaConverters._
@@ -482,32 +483,26 @@ abstract class HoodieBaseRelation(val sqlContext: 
SQLContext,
 
   protected def getPartitionColumnsAsInternalRowInternal(file: FileStatus, 
basePath: Path,
  
extractPartitionValuesFromPartitionPath: Boolean): InternalRow = {
-try {
-  val tableConfig = metaClient.getTableConfig
-  if (extractPartitionValuesFromPartitionPath) {
-val tablePathWithoutScheme = 
CachingPath.getPathWithoutSchemeAndAuthority(basePath)
-val partitionPathWithoutScheme = 
CachingPath.getPathWithoutSchemeAndAuthority(file.getPath.getParent)
-val relativePath = new 
URI(tablePathWithoutScheme.toString).relativize(new 
URI(partitionPathWithoutScheme.toString)).toString
-val hiveStylePartitioningEnabled = 
tableConfig.getHiveStylePartitioningEnable.toBoolean
-if (hiveStylePartitioningEnabled) {
-  val partitionSpec = PartitioningUtils.parsePathFragment(relativePath)
-  
InternalRow.fromSeq(partitionColumns.map(partitionSpec(_)).map(UTF8String.fromString))
-} else {
-  if (partitionColumns.length == 1) {
-InternalRow.fromSeq(Seq(UTF8String.fromString(relativePath)))
-  } else {
-val parts = relativePath.split("/")
-assert(parts.size == partitionColu

[hudi] 10/30: [HUDI-6692] Don't default to bulk insert on nonpkless table if recordkey is omitted (#9444)

2023-09-01 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 5f4bcc8f434bc5646fee007732605beea4f66644
Author: Jon Vexler 
AuthorDate: Tue Aug 22 23:40:08 2023 -0400

[HUDI-6692] Don't default to bulk insert on nonpkless table if recordkey is 
omitted (#9444)

- If a write to a table with a pk was missing the recordkey field in 
options it could default to bulk insert because it was using the pre-merging 
properties. Now it uses the post merging properties for the recordkey field.

-

Co-authored-by: Jonathan Vexler <=>
---
 .../scala/org/apache/hudi/HoodieSparkSqlWriter.scala |  2 +-
 .../apache/hudi/functional/TestCOWDataSource.scala   | 20 ++--
 2 files changed, 19 insertions(+), 3 deletions(-)

diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
index 1387b3e2205..e98d72d8284 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
@@ -438,7 +438,7 @@ object HoodieSparkSqlWriter {
   operation
 } else {
   // if no record key, and no meta fields, we should treat it as append 
only workload and make bulk_insert as operation type.
-  if 
(!paramsWithoutDefaults.containsKey(DataSourceWriteOptions.RECORDKEY_FIELD.key())
+  if (!hoodieConfig.contains(DataSourceWriteOptions.RECORDKEY_FIELD.key())
 && !paramsWithoutDefaults.containsKey(OPERATION.key()) && 
!df.schema.fieldNames.contains(HoodieRecord.RECORD_KEY_METADATA_FIELD)) {
 log.warn(s"Choosing BULK_INSERT as the operation type since auto 
record key generation is applicable")
 operation = WriteOperationType.BULK_INSERT
diff --git 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala
 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala
index ad443ff87a1..bb36b9cdd27 100644
--- 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala
+++ 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala
@@ -26,9 +26,9 @@ import org.apache.hudi.client.common.HoodieSparkEngineContext
 import org.apache.hudi.common.config.{HoodieCommonConfig, HoodieMetadataConfig}
 import 
org.apache.hudi.common.config.TimestampKeyGeneratorConfig.{TIMESTAMP_INPUT_DATE_FORMAT,
 TIMESTAMP_OUTPUT_DATE_FORMAT, TIMESTAMP_TIMEZONE_FORMAT, TIMESTAMP_TYPE_FIELD}
 import org.apache.hudi.common.fs.FSUtils
-import org.apache.hudi.common.model.HoodieRecord
+import org.apache.hudi.common.model.{HoodieRecord, WriteOperationType}
 import org.apache.hudi.common.model.HoodieRecord.HoodieRecordType
-import org.apache.hudi.common.table.timeline.HoodieInstant
+import org.apache.hudi.common.table.timeline.{HoodieInstant, TimelineUtils}
 import org.apache.hudi.common.table.{HoodieTableMetaClient, 
TableSchemaResolver}
 import org.apache.hudi.common.testutils.HoodieTestDataGenerator
 import 
org.apache.hudi.common.testutils.RawTripTestPayload.{deleteRecordsToStrings, 
recordsToStrings}
@@ -261,6 +261,7 @@ class TestCOWDataSource extends HoodieSparkClientTestBase 
with ScalaAssertionSup
 // this write should succeed even w/o setting any param for record key, 
partition path since table config will be re-used.
 writeToHudi(optsWithNoRepeatedTableConfig, inputDF)
 
spark.read.format("org.apache.hudi").options(readOpts).load(basePath).count()
+assertLastCommitIsUpsert()
   }
 
   @Test
@@ -298,6 +299,7 @@ class TestCOWDataSource extends HoodieSparkClientTestBase 
with ScalaAssertionSup
 // this write should succeed even w/o though we don't set key gen 
explicitly.
 writeToHudi(optsWithNoRepeatedTableConfig, inputDF)
 
spark.read.format("org.apache.hudi").options(readOpts).load(basePath).count()
+assertLastCommitIsUpsert()
   }
 
   @Test
@@ -334,6 +336,7 @@ class TestCOWDataSource extends HoodieSparkClientTestBase 
with ScalaAssertionSup
 // this write should succeed even w/o though we set key gen explicitly, 
its the default
 writeToHudi(optsWithNoRepeatedTableConfig, inputDF)
 
spark.read.format("org.apache.hudi").options(readOpts).load(basePath).count()
+assertLastCommitIsUpsert()
   }
 
   private def writeToHudi(opts: Map[String, String], df: Dataset[Row]): Unit = 
{
@@ -1648,6 +1651,19 @@ class TestCOWDataSource extends 
HoodieSparkClientTestBase with ScalaAssertionSup
   }
 }
   }
+
+  def assertLastCommitIsUpsert(): Boolean = {
+val metaClient = HoodieTableMeta

[hudi] 11/30: [HUDI-6549] Add support for comma separated path format for spark.read.load (#9503)

2023-09-01 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 55855cd68887c40f3666b854273722f2e7e8d430
Author: harshal 
AuthorDate: Wed Aug 23 12:16:47 2023 +0530

[HUDI-6549] Add support for comma separated path format for spark.read.load 
(#9503)
---
 .../sources/helpers/CloudObjectsSelectorCommon.java  | 11 ++-
 .../utilities/sources/helpers/CloudStoreIngestionConfig.java | 12 
 .../sources/helpers/TestCloudObjectsSelectorCommon.java  |  1 +
 3 files changed, 23 insertions(+), 1 deletion(-)

diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsSelectorCommon.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsSelectorCommon.java
index 4b95cc159cc..6791b47b129 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsSelectorCommon.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsSelectorCommon.java
@@ -53,6 +53,7 @@ import static 
org.apache.hudi.common.util.CollectionUtils.isNullOrEmpty;
 import static org.apache.hudi.common.util.ConfigUtils.containsConfigProperty;
 import static org.apache.hudi.common.util.ConfigUtils.getStringWithAltKeys;
 import static 
org.apache.hudi.utilities.config.CloudSourceConfig.PATH_BASED_PARTITION_FIELDS;
+import static 
org.apache.hudi.utilities.sources.helpers.CloudStoreIngestionConfig.SPARK_DATASOURCE_READER_COMMA_SEPARATED_PATH_FORMAT;
 import static org.apache.spark.sql.functions.input_file_name;
 import static org.apache.spark.sql.functions.split;
 
@@ -181,7 +182,15 @@ public class CloudObjectsSelectorCommon {
 totalSize *= 1.1;
 long parquetMaxFileSize = props.getLong(PARQUET_MAX_FILE_SIZE.key(), 
Long.parseLong(PARQUET_MAX_FILE_SIZE.defaultValue()));
 int numPartitions = (int) Math.max(totalSize / parquetMaxFileSize, 1);
-Dataset dataset = reader.load(paths.toArray(new 
String[cloudObjectMetadata.size()])).coalesce(numPartitions);
+boolean isCommaSeparatedPathFormat = 
props.getBoolean(SPARK_DATASOURCE_READER_COMMA_SEPARATED_PATH_FORMAT, false);
+
+Dataset dataset;
+if (isCommaSeparatedPathFormat) {
+  dataset = reader.load(String.join(",", paths));
+} else {
+  dataset = reader.load(paths.toArray(new 
String[cloudObjectMetadata.size()]));
+}
+dataset = dataset.coalesce(numPartitions);
 
 // add partition column from source path if configured
 if (containsConfigProperty(props, PATH_BASED_PARTITION_FIELDS)) {
diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudStoreIngestionConfig.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudStoreIngestionConfig.java
index fc8591e0cb9..66b94177b7b 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudStoreIngestionConfig.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudStoreIngestionConfig.java
@@ -102,4 +102,16 @@ public class CloudStoreIngestionConfig {
*/
   @Deprecated
   public static final String DATAFILE_FORMAT = 
CloudSourceConfig.DATAFILE_FORMAT.key();
+
+  /**
+   * A comma delimited list of path-based partition fields in the source file 
structure
+   */
+  public static final String PATH_BASED_PARTITION_FIELDS = 
"hoodie.deltastreamer.source.cloud.data.partition.fields.from.path";
+
+  /**
+   * boolean value for specifying path format in load args of 
spark.read.format("..").load("a.xml,b.xml,c.xml"),
+   * set true if path format needs to be comma separated string value, if 
false it's passed as array of strings like
+   * spark.read.format("..").load(new String[]{a.xml,b.xml,c.xml})
+   */
+  public static final String 
SPARK_DATASOURCE_READER_COMMA_SEPARATED_PATH_FORMAT = 
"hoodie.deltastreamer.source.cloud.data.reader.comma.separated.path.format";
 }
diff --git 
a/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/helpers/TestCloudObjectsSelectorCommon.java
 
b/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/helpers/TestCloudObjectsSelectorCommon.java
index dd467146d51..13818d98c76 100644
--- 
a/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/helpers/TestCloudObjectsSelectorCommon.java
+++ 
b/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/helpers/TestCloudObjectsSelectorCommon.java
@@ -79,6 +79,7 @@ public class TestCloudObjectsSelectorCommon extends 
HoodieSparkClientTestHarness
   public void partitionKeyNotPresentInPath() {
 List input = Collections.singletonList(new 
CloudObjectMetadata("src/test/resources/data/partitioned/country=US/state=CA/data.json",
 1));
 TypedProperties properties = new TypedProperties();
+
properties.put("hoodie.d

[hudi] 03/30: [MINOR] StreamerUtil#getTableConfig should check whether hoodie.properties exists (#9464)

2023-09-01 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 544e999c005446c3c98c53e78daa73b2abbfd5ea
Author: Nicholas Jiang 
AuthorDate: Fri Aug 18 10:03:12 2023 +0800

[MINOR] StreamerUtil#getTableConfig should check whether hoodie.properties 
exists (#9464)
---
 .../hudi-flink/src/main/java/org/apache/hudi/util/StreamerUtil.java | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/StreamerUtil.java
 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/StreamerUtil.java
index 4912c0abf03..842e732abd4 100644
--- 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/StreamerUtil.java
+++ 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/StreamerUtil.java
@@ -312,7 +312,7 @@ public class StreamerUtil {
 FileSystem fs = FSUtils.getFs(basePath, hadoopConf);
 Path metaPath = new Path(basePath, HoodieTableMetaClient.METAFOLDER_NAME);
 try {
-  if (fs.exists(metaPath)) {
+  if (fs.exists(new Path(metaPath, 
HoodieTableConfig.HOODIE_PROPERTIES_FILE))) {
 return Option.of(new HoodieTableConfig(fs, metaPath.toString(), null, 
null));
   }
 } catch (IOException e) {



[hudi] branch release-0.14.0 updated (9bc6a28010c -> d995bb8262c)

2023-09-01 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a change to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 9bc6a28010c [MINOR] Fix build on master (#9452)
 new be3a7004cf8 [HUDI-6587] Check incomplete commit for time travel query 
(#9280)
 new d600e98de63 [HUDI-6476][FOLLOW-UP] Path filter by FileStatus to avoid 
additional fs request (#9366)
 new 544e999c005 [MINOR] StreamerUtil#getTableConfig should check whether 
hoodie.properties exists (#9464)
 new 6a6bfd7c1e0 [MINOR] Close record readers after use during tests (#9457)
 new 0ea1f1b68cb [HUDI-6156] Prevent leaving tmp file in timeline, delete 
tmp file when rename throw exception (#9483)
 new 2127d3d2c4a [HUDI-6683][FOLLOW-UP] Json & Avro Kafka Source Minor 
Refactor & Added null Kafka Key test cases (#9459)
 new 18f0431 [HUDI-6733] Add flink-metrics-dropwizard to flink bundle 
(#9499)
 new 1ff0a7f2eb1 [HUDI-6731] BigQuerySyncTool: add flag to allow for read 
optimized sync for MoR tables (#9488)
 new ff6b70f5458 [HUDI-6729] Fix get partition values from path for 
non-string type partition column (#9484)
 new 5f4bcc8f434 [HUDI-6692] Don't default to bulk insert on nonpkless 
table if recordkey is omitted (#9444)
 new 55855cd6888 [HUDI-6549] Add support for comma separated path format 
for spark.read.load (#9503)
 new df90640116c [HUDI-4115] Adding support for schema while loading spark 
dataset in S3/GCS source (#9502)
 new 0b4c95cdad0 [HUDI-6621] Fix downgrade handler for 0.14.0 (#9467)
 new 802d75b285b [HUDI-6718] Check Timeline Before Transitioning Inflight 
Clean in Multiwriter Scenario (#9468)
 new 8d0e813967a [HUDI-6741] Timeline server bug when multiple tables 
registered with metadata table enabled (#9511)
 new 1c16d60fef9 [HUDI-6735] Adding support for snapshotLoadQuerySplitter 
for incremental sources. (#9501)
 new a7690eca670 [HUDI-6445] Triage ci flakiness and some test fies (#9534)
 new 0d8c34f24da [HUDI-6754] Fix record reader tests in hudi-hadoop-mr 
(#9535)
 new 256957a689e [HUDI-6681] Ensure MOR Column Stats Index skips reading 
filegroups correctly (#9422)
 new f4b139a0556 [MINOR] Add write operation in alter schema commit 
metadata (#9509)
 new 5e3bf05b282 [MINOR] Add detail exception when instant transition state 
(#9476)
 new 3eb6de6d00b [HUDI-4631] Adding retries to spark datasource writes on 
conflict failures (#6854)
 new a4f542931c1 [MINOR] Modify return type description (#9479)
 new 2009b0f4466 [HUDI-6726] Fix connection leaks related to file reader 
and iterator close (#9539)
 new 89a3443173d [MINOR] Fix AWS refactor bug by adding skipTableArchive 
arg (#9563)
 new eed034b5c82 [HUDI-6758] Detecting and skipping Spurious log blocks 
with MOR reads (#9545)
 new 2aaf4027110 [MINOR] Fixing warn log with auto key gen (#9547)
 new db2129ebb62 [HUDI-3727] Add metrics for async indexer (#9559)
 new 9be80c7bc03 [HUDI-6445] Fixing metrics to use IN-MEMORY type in tests 
(#9543)
 new d995bb8262c [HUDI-6763] Optimize collect calls (#9561)

The 30 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 .../hudi/aws/sync/AWSGlueCatalogSyncClient.java|   1 +
 .../client/embedded/EmbeddedTimelineService.java   |   2 +-
 .../org/apache/hudi/config/HoodieLockConfig.java   |  16 +-
 .../org/apache/hudi/config/HoodieWriteConfig.java  |   6 +
 .../org/apache/hudi/io/HoodieAppendHandle.java |  14 +-
 .../hudi/metadata/HoodieMetadataWriteUtils.java|   1 -
 .../java/org/apache/hudi/table/HoodieTable.java|   4 +-
 .../table/action/clean/CleanActionExecutor.java|  14 +-
 .../table/action/commit/HoodieMergeHelper.java |   5 +-
 .../table/action/index/RunIndexActionExecutor.java |  16 +-
 .../table/upgrade/SixToFiveDowngradeHandler.java   |  53 ++-
 .../table/upgrade/SupportsUpgradeDowngrade.java|   3 +
 .../io/storage/TestHoodieHFileReaderWriter.java|  10 +-
 .../hudi/testutils/HoodieMergeOnReadTestUtils.java |   1 +
 .../table/upgrade/FlinkUpgradeDowngradeHelper.java |   7 +
 .../table/upgrade/JavaUpgradeDowngradeHelper.java  |   7 +
 .../hudi/client/TestJavaHoodieBackedMetadata.java  |  16 +-
 .../TestHoodieJavaClientOnCopyOnWriteStorage.java  | 185 
 .../testutils/HoodieJavaClientTestHarness.java | 140 +++---
 .../hudi/testutils/TestHoodieMetadataBase.java |   6 +-
 .../SparkHoodieBackedTableMetadataWriter.java  |   3 +-
 .../commit/BaseSparkCommitActionExecutor.java  |  14 +-
 .../table/upgrade/SparkUpgradeDowngradeHelper.java |   7 +
 .../functional/TestHoodieBackedMetadata.java   |  18 +-
 .../client/functional/TestHoodieMetadataBase.java  |   6 +-
 .../TestRemoteFileSyst

[hudi] 01/30: [HUDI-6587] Check incomplete commit for time travel query (#9280)

2023-09-01 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit be3a7004cf8c46595b49291b2b643848eb29424c
Author: Shiyan Xu <2701446+xushi...@users.noreply.github.com>
AuthorDate: Tue Aug 8 17:13:38 2023 -0500

[HUDI-6587] Check incomplete commit for time travel query (#9280)
---
 .../org/apache/hudi/BaseHoodieTableFileIndex.java  |   5 +
 .../hudi/common/table/timeline/TimelineUtils.java  |  30 +++-
 .../hudi/exception/HoodieTimeTravelException.java  |  29 
 .../hudi/hadoop/HoodieROTablePathFilter.java   |  14 +-
 .../scala/org/apache/hudi/HoodieBaseRelation.scala |   5 +-
 .../hudi/functional/TestTimeTravelQuery.scala  | 182 +++--
 6 files changed, 173 insertions(+), 92 deletions(-)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java 
b/hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java
index 3a24ef4dd2f..7ba20795790 100644
--- a/hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java
+++ b/hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java
@@ -61,6 +61,7 @@ import java.util.stream.Collectors;
 
 import static 
org.apache.hudi.common.config.HoodieMetadataConfig.DEFAULT_METADATA_ENABLE_FOR_READERS;
 import static org.apache.hudi.common.config.HoodieMetadataConfig.ENABLE;
+import static 
org.apache.hudi.common.table.timeline.TimelineUtils.validateTimestampAsOf;
 import static org.apache.hudi.common.util.CollectionUtils.combine;
 import static org.apache.hudi.hadoop.CachingPath.createRelativePathUnsafe;
 
@@ -243,6 +244,10 @@ public abstract class BaseHoodieTableFileIndex implements 
AutoCloseable {
   return Collections.emptyMap();
 }
 
+if (specifiedQueryInstant.isPresent() && !shouldIncludePendingCommits) {
+  validateTimestampAsOf(metaClient, specifiedQueryInstant.get());
+}
+
 FileStatus[] allFiles = listPartitionPathFiles(partitions);
 HoodieTimeline activeTimeline = getActiveTimeline();
 Option latestInstant = activeTimeline.lastInstant();
diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimelineUtils.java
 
b/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimelineUtils.java
index 14a03ce60ef..a763f4d9053 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimelineUtils.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimelineUtils.java
@@ -30,6 +30,7 @@ import org.apache.hudi.common.util.CollectionUtils;
 import org.apache.hudi.common.util.Option;
 import org.apache.hudi.exception.HoodieException;
 import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.exception.HoodieTimeTravelException;
 
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
@@ -47,9 +48,11 @@ import static 
org.apache.hudi.common.config.HoodieCommonConfig.INCREMENTAL_READ_
 import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.COMMIT_ACTION;
 import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.DELTA_COMMIT_ACTION;
 import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.GREATER_THAN;
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.GREATER_THAN_OR_EQUALS;
 import static org.apache.hudi.common.table.timeline.HoodieTimeline.LESSER_THAN;
 import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.REPLACE_COMMIT_ACTION;
 import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.SAVEPOINT_ACTION;
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.compareTimestamps;
 
 /**
  * TimelineUtils provides a common way to query incremental meta-data changes 
for a hoodie table.
@@ -244,8 +247,8 @@ public class TimelineUtils {
 if (lastMaxCompletionTime.isPresent()) {
   // Get 'hollow' instants that have less instant time than 
exclusiveStartInstantTime but with greater commit completion time
   HoodieDefaultTimeline hollowInstantsTimeline = (HoodieDefaultTimeline) 
timeline.getCommitsTimeline()
-  .filter(s -> HoodieTimeline.compareTimestamps(s.getTimestamp(), 
LESSER_THAN, exclusiveStartInstantTime))
-  .filter(s -> 
HoodieTimeline.compareTimestamps(s.getStateTransitionTime(), GREATER_THAN, 
lastMaxCompletionTime.get()));
+  .filter(s -> compareTimestamps(s.getTimestamp(), LESSER_THAN, 
exclusiveStartInstantTime))
+  .filter(s -> compareTimestamps(s.getStateTransitionTime(), 
GREATER_THAN, lastMaxCompletionTime.get()));
   if (!hollowInstantsTimeline.empty()) {
 return timelineSinceLastSync.mergeTimeline(hollowInstantsTimeline);
   }
@@ -315,6 +318,29 @@ public class TimelineUtils {
 }
   }
 
+  /**
+   * Validate user-specified timestamp of time travel query against incomplete 
commit's timestamp.
+   *
+   * @throws HoodieException when time travel query

[hudi] 02/30: [HUDI-6476][FOLLOW-UP] Path filter by FileStatus to avoid additional fs request (#9366)

2023-09-01 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit d600e98de63a7a877fd460ee0caca93265fc3bc5
Author: Wechar Yu 
AuthorDate: Fri Aug 18 09:43:48 2023 +0800

[HUDI-6476][FOLLOW-UP] Path filter by FileStatus to avoid additional fs 
request (#9366)
---
 .../metadata/FileSystemBackedTableMetadata.java| 95 ++
 1 file changed, 41 insertions(+), 54 deletions(-)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
 
b/hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
index b4a4da01977..8ea9861734a 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
@@ -54,6 +54,7 @@ import java.util.List;
 import java.util.Map;
 import java.util.concurrent.CopyOnWriteArrayList;
 import java.util.stream.Collectors;
+import java.util.stream.Stream;
 
 /**
  * Implementation of {@link HoodieTableMetadata} based file-system-backed 
table metadata.
@@ -167,66 +168,52 @@ public class FileSystemBackedTableMetadata extends 
AbstractHoodieTableMetadata {
   // TODO: Get the parallelism from HoodieWriteConfig
   int listingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, 
pathsToList.size());
 
-  // List all directories in parallel
+  // List all directories in parallel:
+  // if current dictionary contains PartitionMetadata, add it to result
+  // if current dictionary does not contain PartitionMetadata, add its 
subdirectory to queue to be processed.
   engineContext.setJobStatus(this.getClass().getSimpleName(), "Listing all 
partitions with prefix " + relativePathPrefix);
-  List dirToFileListing = engineContext.flatMap(pathsToList, 
path -> {
+  // result below holds a list of pair. first entry in the pair optionally 
holds the deduced list of partitions.
+  // and second entry holds optionally a directory path to be processed 
further.
+  List, Option>> result = 
engineContext.flatMap(pathsToList, path -> {
 FileSystem fileSystem = path.getFileSystem(hadoopConf.get());
-return Arrays.stream(fileSystem.listStatus(path));
+if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, path)) {
+  return 
Stream.of(Pair.of(Option.of(FSUtils.getRelativePartitionPath(dataBasePath.get(),
 path)), Option.empty()));
+}
+return Arrays.stream(fileSystem.listStatus(path))
+.filter(status -> status.isDirectory() && 
!status.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME))
+.map(status -> Pair.of(Option.empty(), 
Option.of(status.getPath(;
   }, listingParallelism);
   pathsToList.clear();
 
-  // if current dictionary contains PartitionMetadata, add it to result
-  // if current dictionary does not contain PartitionMetadata, add it to 
queue to be processed.
-  int fileListingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, 
dirToFileListing.size());
-  if (!dirToFileListing.isEmpty()) {
-// result below holds a list of pair. first entry in the pair 
optionally holds the deduced list of partitions.
-// and second entry holds optionally a directory path to be processed 
further.
-engineContext.setJobStatus(this.getClass().getSimpleName(), 
"Processing listed partitions");
-List, Option>> result = 
engineContext.map(dirToFileListing, fileStatus -> {
-  FileSystem fileSystem = 
fileStatus.getPath().getFileSystem(hadoopConf.get());
-  if (fileStatus.isDirectory()) {
-if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, 
fileStatus.getPath())) {
-  return 
Pair.of(Option.of(FSUtils.getRelativePartitionPath(dataBasePath.get(), 
fileStatus.getPath())), Option.empty());
-} else if 
(!fileStatus.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)) 
{
-  return Pair.of(Option.empty(), Option.of(fileStatus.getPath()));
-}
-  } else if 
(fileStatus.getPath().getName().startsWith(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX))
 {
-String partitionName = 
FSUtils.getRelativePartitionPath(dataBasePath.get(), 
fileStatus.getPath().getParent());
-return Pair.of(Option.of(partitionName), Option.empty());
-  }
-  return Pair.of(Option.empty(), Option.empty());
-}, fileListingParallelism);
-
-partitionPaths.addAll(result.stream().filter(entry -> 
entry.getKey().isPresent())
-.map(entry -> entry.getKey().get())
-.filter(relativePartitionPath -> fullBoundExpr instanceof 
Predicates.TrueExpression
-|| (Boolean) fullBoundEx

svn commit: r63492 - in /dev/hudi/hudi-0.14.0-rc1: ./ hudi-0.14.0-rc1.src.tgz hudi-0.14.0-rc1.src.tgz.asc hudi-0.14.0-rc1.src.tgz.sha512

2023-08-18 Thread pwason
Author: pwason
Date: Fri Aug 18 09:03:38 2023
New Revision: 63492

Log:
Add Apache Hudi 0.14.0-rc1 source release


Added:
dev/hudi/hudi-0.14.0-rc1/
dev/hudi/hudi-0.14.0-rc1/hudi-0.14.0-rc1.src.tgz   (with props)
dev/hudi/hudi-0.14.0-rc1/hudi-0.14.0-rc1.src.tgz.asc
dev/hudi/hudi-0.14.0-rc1/hudi-0.14.0-rc1.src.tgz.sha512

Added: dev/hudi/hudi-0.14.0-rc1/hudi-0.14.0-rc1.src.tgz
==
Binary file - no diff available.

Propchange: dev/hudi/hudi-0.14.0-rc1/hudi-0.14.0-rc1.src.tgz
--
svn:mime-type = application/octet-stream

Added: dev/hudi/hudi-0.14.0-rc1/hudi-0.14.0-rc1.src.tgz.asc
==
--- dev/hudi/hudi-0.14.0-rc1/hudi-0.14.0-rc1.src.tgz.asc (added)
+++ dev/hudi/hudi-0.14.0-rc1/hudi-0.14.0-rc1.src.tgz.asc Fri Aug 18 09:03:38 
2023
@@ -0,0 +1,16 @@
+-BEGIN PGP SIGNATURE-
+
+iQIzBAABCAAdFiEEdcV0Tp5c1cSOGcCCxNhY1zudsbgFAmTfKo8ACgkQxNhY1zud
+sbi80w/+LhNcb1nn43s4Qpkr2HrXlN4F/yceYmKoMD7oDmPmHshTJ+ebxrAcaAnu
++yS+pSgDVmW8TvG1lHpPphFnhH/dUV1F8AwGbd7n9kEdbtg5dR1ACDeQMHz5J6Bo
+2JO2NjQBJvAD++jA/yc2FIPgz7AUY+hSiV7Dc8Cn3L7IuaIcy4avuEsVNlSujbMs
+kq51l4IaW7l97jj6Iq5++l/Ym6zaVV6EfEQVuJ2PgRgzHmqSrXPjZbs8Z4j8Ju+V
+/hYDSHIuzwZ1Od0zSKSqB8lUfJxj0BDJuIj3CoNv/dNJMfzcjMKRl9XWQdtFMX3i
+d/87kqq3S6X0ZSQS4hSOBedy81x0N8QKDmOsIy8zllrxUkyndYuoIdj8k3AcAP95
+/gT5q41vVTYDPCsXPPTZEwZDpckZj3GmryE381XPioQlJAcVdnXZ5myFqckjcSrv
+QrfkTwK71hCmz4oGhgxQDhy4fAG7P7W1BhhsXXkyUjBJ8j55Oqbq97J60a/p8xZg
+hyQigZ0yAFanqTxhkQxwk4L4nMmuTfqw8iTB+xRBOnNxXMTS56Z0kavu/KwdxZv7
+aYmlVr8aIXdWcxtEZNebBLEqTXCn9fJAF+hO+SdbQaTDwufg365KzDTBrEVRhPsE
+KdZqujgMEfdMx1vhQ3moCP0zEBWeZRDQB53/Uer6mjw3FG17yFM=
+=2a3T
+-END PGP SIGNATURE-

Added: dev/hudi/hudi-0.14.0-rc1/hudi-0.14.0-rc1.src.tgz.sha512
==
--- dev/hudi/hudi-0.14.0-rc1/hudi-0.14.0-rc1.src.tgz.sha512 (added)
+++ dev/hudi/hudi-0.14.0-rc1/hudi-0.14.0-rc1.src.tgz.sha512 Fri Aug 18 09:03:38 
2023
@@ -0,0 +1 @@
+5daf928f4b11306e63a6d4f5b2402d9646f602acee69002e8da97142aec903f8ccd624fb7234c039ec0584dd685a41e46341d20a6fd8a6b2f2df4cdd9ad5ffa3
  hudi-0.14.0-rc1.src.tgz




[hudi] annotated tag release-0.14.0-rc1 updated (9bc6a28010c -> 4d225a6e598)

2023-08-18 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a change to annotated tag release-0.14.0-rc1
in repository https://gitbox.apache.org/repos/asf/hudi.git


*** WARNING: tag release-0.14.0-rc1 was modified! ***

from 9bc6a28010c (commit)
  to 4d225a6e598 (tag)
 tagging 9bc6a28010c3fde4ef27312c3c14580caca703fa (commit)
 replaces hoodie-0.4.7
  by Prashant Wason
  on Fri Aug 18 01:58:05 2023 -0700

- Log -
0.14.0
-BEGIN PGP SIGNATURE-

iQIzBAABCAAdFiEEdcV0Tp5c1cSOGcCCxNhY1zudsbgFAmTfMp0ACgkQxNhY1zud
sbhu3g/9FsJkVyobNDVAbrwGjwPsEB8o24vBFOdqsooK6H2ae6sbAeGfn0rvMQkV
fuJOXfHctfK0g1VO+wyWf8Pf0Dca1wopGGrVMmwxl70NtRj9nHQ0o9sqEfylYWzq
Elfjxfp3OsYFNG+Mj5Sb+nb6gQEvyDFugch1+tSuLMBaDVk8iSDB7/UJfYAqnWtL
7XSdJ4QtzPQKaNCEMlvFRsICJM33jNRcxegDC9IniE6ANE8VYS0AIepNzU8FO7jJ
Fer9rYZJ2U9IA9KQPczm0YnrBH4eS3FoiZRjsHK2xUJm1w5Ib46Koknqv9NRQ1qE
MmkcwMWoNZABzA5gCgcV1BgeBbvBTBk+9GgGQrxmZLsLHsrHbCx+oM/w1n91hCEa
Tj3r5X9Psj7Y7IuB+wZS7zi90C5QFHfSMCOlZG1nNPTreHJT4y92MOVNSxfuw4ex
6If2pZcqu+MOQkT6ig9s7HaObFpOCF8GFPN0S38R6dvTuID4VBFctEmhNePW2JDB
XahQd/BVfXxZONxDw3cPSlcqrP8DsLLQbnEB3sbL3lX2pG9uSeDRwXJXFPGCsi7W
Uz4PLKjRQ7X+JKIcnUOx0KdPELNObAh5+b+UkgpSW2/sWOTfbQb8O1rzYuM89XHz
OSOnPSnfCC/5Tj71g4VAvkmA6UBAFxtVokNhwRQ9GM9J7Ilm6u4=
=SEhD
-END PGP SIGNATURE-
---


No new revisions were added by this update.

Summary of changes:



[hudi] 10/29: [HUDI-6663] New Parquet File Format remove broadcast to fix performance issue for complex file slices (#9409)

2023-08-18 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit b27b1f688aad236598c546c55062b4f69d973ad0
Author: Jon Vexler 
AuthorDate: Fri Aug 11 02:50:10 2023 -0700

[HUDI-6663] New Parquet File Format remove broadcast to fix performance 
issue for complex file slices (#9409)
---
 .../src/main/scala/org/apache/hudi/HoodieFileIndex.scala   | 10 +-
 .../org/apache/hudi/NewHoodieParquetFileFormatUtils.scala  |  2 +-
 .../main/scala/org/apache/hudi/PartitionFileSliceMapping.scala |  7 +++
 .../datasources/parquet/NewHoodieParquetFileFormat.scala   |  8 
 4 files changed, 13 insertions(+), 14 deletions(-)

diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
index 1193b75bfdf..8a7c06b1d15 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
@@ -104,7 +104,7 @@ case class HoodieFileIndex(spark: SparkSession,
 
   override def rootPaths: Seq[Path] = getQueryPaths.asScala
 
-  var shouldBroadcast: Boolean = false
+  var shouldEmbedFileSlices: Boolean = false
 
   /**
* Returns the FileStatus for all the base files (excluding log files). This 
should be used only for
@@ -148,7 +148,7 @@ case class HoodieFileIndex(spark: SparkSession,
   override def listFiles(partitionFilters: Seq[Expression], dataFilters: 
Seq[Expression]): Seq[PartitionDirectory] = {
 val prunedPartitionsAndFilteredFileSlices = filterFileSlices(dataFilters, 
partitionFilters).map {
   case (partitionOpt, fileSlices) =>
-if (shouldBroadcast) {
+if (shouldEmbedFileSlices) {
   val baseFileStatusesAndLogFileOnly: Seq[FileStatus] = 
fileSlices.map(slice => {
 if (slice.getBaseFile.isPresent) {
   slice.getBaseFile.get().getFileStatus
@@ -162,7 +162,7 @@ case class HoodieFileIndex(spark: SparkSession,
 || (f.getBaseFile.isPresent && 
f.getBaseFile.get().getBootstrapBaseFile.isPresent)).
 foldLeft(Map[String, FileSlice]()) { (m, f) => m + (f.getFileId -> 
f) }
   if (c.nonEmpty) {
-PartitionDirectory(new 
PartitionFileSliceMapping(InternalRow.fromSeq(partitionOpt.get.values), 
spark.sparkContext.broadcast(c)), baseFileStatusesAndLogFileOnly)
+PartitionDirectory(new 
PartitionFileSliceMapping(InternalRow.fromSeq(partitionOpt.get.values), c), 
baseFileStatusesAndLogFileOnly)
   } else {
 PartitionDirectory(InternalRow.fromSeq(partitionOpt.get.values), 
baseFileStatusesAndLogFileOnly)
   }
@@ -187,7 +187,7 @@ case class HoodieFileIndex(spark: SparkSession,
 
 if (shouldReadAsPartitionedTable()) {
   prunedPartitionsAndFilteredFileSlices
-} else if (shouldBroadcast) {
+} else if (shouldEmbedFileSlices) {
   assert(partitionSchema.isEmpty)
   prunedPartitionsAndFilteredFileSlices
 }else {
@@ -274,7 +274,7 @@ case class HoodieFileIndex(spark: SparkSession,
 // Prune the partition path by the partition filters
 // NOTE: Non-partitioned tables are assumed to consist from a single 
partition
 //   encompassing the whole table
-val prunedPartitions = if (shouldBroadcast) {
+val prunedPartitions = if (shouldEmbedFileSlices) {
   
listMatchingPartitionPaths(convertFilterForTimestampKeyGenerator(metaClient, 
partitionFilters))
 } else {
   listMatchingPartitionPaths(partitionFilters)
diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/NewHoodieParquetFileFormatUtils.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/NewHoodieParquetFileFormatUtils.scala
index 5dd85c973b6..34214be1bd2 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/NewHoodieParquetFileFormatUtils.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/NewHoodieParquetFileFormatUtils.scala
@@ -198,7 +198,7 @@ class NewHoodieParquetFileFormatUtils(val sqlContext: 
SQLContext,
 } else {
   Seq.empty
 }
-fileIndex.shouldBroadcast = true
+fileIndex.shouldEmbedFileSlices = true
 HadoopFsRelation(
   location = fileIndex,
   partitionSchema = fileIndex.partitionSchema,
diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/PartitionFileSliceMapping.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/PartitionFileSliceMapping.scala
index c9468e2d601..1e639f0daab 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/PartitionFileSliceMapping.scala
+++ 
b/hudi-spark-datasource/hudi-s

[hudi] 02/29: [MINOR] Infer the preCombine field only if the value is not null (#9447)

2023-08-18 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 3db4745a23d1c9df46881d40852824352089e477
Author: Danny Chan 
AuthorDate: Tue Aug 15 17:03:06 2023 +0800

[MINOR] Infer the preCombine field only if the value is not null (#9447)

Table created by Spark may not have the preCombine field set up.
---
 .../hudi-flink/src/main/java/org/apache/hudi/util/CompactionUtil.java | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/CompactionUtil.java
 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/CompactionUtil.java
index 63a00dd10c3..d14262f02e0 100644
--- 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/CompactionUtil.java
+++ 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/CompactionUtil.java
@@ -128,7 +128,9 @@ public class CompactionUtil {
*/
   public static void setPreCombineField(Configuration conf, 
HoodieTableMetaClient metaClient) {
 String preCombineField = metaClient.getTableConfig().getPreCombineField();
-conf.setString(FlinkOptions.PRECOMBINE_FIELD, preCombineField);
+if (preCombineField != null) {
+  conf.setString(FlinkOptions.PRECOMBINE_FIELD, preCombineField);
+}
   }
 
   /**



[hudi] 03/29: [HUDI-5361] Propagate all hoodie configs from spark sqlconf (#8327)

2023-08-18 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 510ff1753a4dd1c34628d022577ffd33267c95cc
Author: Jon Vexler 
AuthorDate: Thu Aug 10 09:46:33 2023 -0700

[HUDI-5361] Propagate all hoodie configs from spark sqlconf (#8327)
---
 .../src/main/scala/org/apache/hudi/DefaultSource.scala | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala
index 5ecf250eaab..5a0b0a53d33 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala
@@ -102,8 +102,7 @@ class DefaultSource extends RelationProvider
   )
 } else {
   Map()
-}) ++ DataSourceOptionsHelper.parametersWithReadDefaults(optParams +
-  (DATA_QUERIES_ONLY.key() -> sqlContext.getConf(DATA_QUERIES_ONLY.key(), 
optParams.getOrElse(DATA_QUERIES_ONLY.key(), 
DATA_QUERIES_ONLY.defaultValue()
+}) ++ 
DataSourceOptionsHelper.parametersWithReadDefaults(sqlContext.getAllConfs.filter(k
 => k._1.startsWith("hoodie.")) ++ optParams)
 
 // Get the table base path
 val tablePath = if (globPaths.nonEmpty) {



[hudi] 07/29: [HUDI-6680] Fixing the info log to fetch column value by name instead of index (#9421)

2023-08-18 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 0dca5aaceb3a1992f048232199698c75ff7d7678
Author: lokesh-lingarajan-0310 
<84048984+lokesh-lingarajan-0...@users.noreply.github.com>
AuthorDate: Thu Aug 10 19:55:23 2023 -0700

[HUDI-6680] Fixing the info log to fetch column value by name instead of 
index (#9421)

Co-authored-by: Lokesh Lingarajan 

---
 .../org/apache/hudi/utilities/sources/helpers/IncrSourceHelper.java | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/IncrSourceHelper.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/IncrSourceHelper.java
index 6b10e4cbef0..19383933bd9 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/IncrSourceHelper.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/IncrSourceHelper.java
@@ -217,7 +217,7 @@ public class IncrSourceHelper {
   row = collectedRows.select(queryInfo.getOrderColumn(), 
queryInfo.getKeyColumn(), CUMULATIVE_COLUMN_NAME).orderBy(
   col(queryInfo.getOrderColumn()).desc(), 
col(queryInfo.getKeyColumn()).desc()).first();
 }
-LOG.info("Processed batch size: " + row.getLong(2) + " bytes");
+LOG.info("Processed batch size: " + 
row.get(row.fieldIndex(CUMULATIVE_COLUMN_NAME)) + " bytes");
 sourceData.unpersist();
 return Pair.of(new CloudObjectIncrCheckpoint(row.getString(0), 
row.getString(1)), collectedRows);
   }



[hudi] branch release-0.14.0 updated (d32bdbd8240 -> 9bc6a28010c)

2023-08-18 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a change to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git


from d32bdbd8240 [MINOR] Fix consistent hashing bucket index FT failure 
(#9401)
 new 8f07023948e [HUDI-6675] Fix Clean action will delete the whole table 
(#9413)
 new 3db4745a23d [MINOR] Infer the preCombine field only if the value is 
not null (#9447)
 new 510ff1753a4 [HUDI-5361] Propagate all hoodie configs from spark 
sqlconf (#8327)
 new 89b8ae02bf4 [HUDI-6679] Fix initialization of metadata table 
partitions upon failure (#9419)
 new b8d0424c2c8 [MINOR] asyncService log prompt incomplete (#9407)
 new 81a458aa33c [MINOR] Increase CI timeout for UT FT other modules to 4 
hours (#9423)
 new 0dca5aaceb3 [HUDI-6680] Fixing the info log to fetch column value by 
name instead of index (#9421)
 new e2a78d3fb43 [MINOR] Unify class name of Spark Procedure (#9414)
 new d70c15f4041 [HUDI-6670] Fix timeline check in metadata table validator 
(#9405)
 new b27b1f688aa [HUDI-6663] New Parquet File Format remove broadcast to 
fix performance issue for complex file slices (#9409)
 new 612d02b35a0 [HUDI-6553] Speedup column stats and bloom index creation 
on large datasets. (#9223)
 new b335d00a22b [HUDI-6674] Add rollback info from metadata table in 
timeline commands (#9411)
 new c7f0e6902fa [HUDI-6690] Generate test jars for hudi-utilities and 
hudi-hive-sync modules (#9297)
 new 529fc04488b Duplicate switch branch in HoodieInputFormatUtils (#9438)
 new 1726b828578 [HUDI-6214] Enabling compaction by default for batch 
writes with MOR table (#8718)
 new 6b848f028ec [HUDI-6676] Add command for CreateHoodieTableLike (#9412)
 new 97f21f85e95 [HUDI-6683] Added kafka key as part of hudi metadata 
columns for Json & Avro KafkaSource (#9403)
 new d6358a9d602 [HUDI-6694] Fix log file CLI around command blocks (#9445)
 new b10f52d85d3 [HUDI-6689] Add record index validation in MDT validator 
(#9437)
 new b8dc3a58220 Handling empty commits after s3 applyFilter api (#9433)
 new a58ff06f20e [HUDI-6688] Fix partition validation to only consider 
commits in metadata table validator (#9436)
 new da699fea98d [HUDI-6553][FOLLOW-UP] Introduces Tuple3 for 
HoodieTableMetadataUtil (#9449)
 new 2c9024e4fad [HUDI-6673] Fix Incremental Query Syntax - Spark SQL Core 
Flow Test (#9410)
 new 77bf4357ed7 [HUDI-6683][FOLLOW-UP] Rename kafka record value variable 
in JsonKafkaSource and replace casting to String by calling toString (#9451)
 new 2538f544507 [HUDI-6359] Spark offline compaction/clustering will never 
rollback when both requested and inflight states exist (#8944)
 new 90e3378207d [HUDI-6704] Fix Flink metadata table update (#9456)
 new 20b4438377b [MINOR] Fix sql core flow test (#9461)
 new 6ffd4d5705a [MINOR] Fix meta client instantiation and some incorrect 
configs (#9463)
 new 9bc6a28010c [MINOR] Fix build on master (#9452)

The 29 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 azure-pipelines-20230430.yml   |   2 +-
 .../test-suite/multi-writer-local-3.properties |   4 +-
 .../config/test-suite/test-clustering.properties   |   4 +-
 ...essive-clean-archival-inline-compact.properties |   4 +-
 .../apache/hudi/cli/HoodieTableHeaderFields.java   |   1 -
 .../hudi/cli/commands/HoodieLogFileCommand.java|  70 --
 .../apache/hudi/cli/commands/TimelineCommand.java  |  99 
 .../cli/commands/TestHoodieLogFileCommand.java |  33 ++-
 .../org/apache/hudi/async/HoodieAsyncService.java  |   4 +-
 .../hudi/client/BaseHoodieTableServiceClient.java  |  19 +-
 .../apache/hudi/client/BaseHoodieWriteClient.java  |  29 +--
 .../metadata/HoodieBackedTableMetadataWriter.java  |  13 +-
 .../java/org/apache/hudi/table/HoodieTable.java|  22 --
 .../table/action/clean/CleanActionExecutor.java|  10 +-
 .../hudi/client/HoodieFlinkTableServiceClient.java |  13 +-
 .../apache/hudi/client/HoodieFlinkWriteClient.java |   5 -
 .../spark/sql/HoodieCatalystPlansUtils.scala   |   7 +
 .../org/apache/spark/sql/hudi/SparkAdapter.scala   |   8 +-
 .../functional/TestHoodieBackedMetadata.java   | 123 -
 .../java/org/apache/hudi/table/TestCleaner.java|  51 
 .../apache/hudi/common/util/collection/Tuple3.java |  71 ++
 .../hudi/metadata/HoodieMetadataPayload.java   |  19 +-
 .../hudi/metadata/HoodieTableMetadataUtil.java | 228 ++---
 .../hudi/source/stats/ColumnStatsIndices.java  |  17 +-
 .../java/org/apache/hudi/util/CompactionUtil.java  |   4 +-
 .../hudi/hadoop/utils/HoodieInputFormatUtils.java  |   2 -
 .../hudi/integ/testsuite/HoodieTestSuiteJob.

[hudi] 19/29: [HUDI-6689] Add record index validation in MDT validator (#9437)

2023-08-18 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit b10f52d85d3aac562141e92a01749dad7ada5e7e
Author: Y Ethan Guo 
AuthorDate: Tue Aug 15 09:40:43 2023 -0700

[HUDI-6689] Add record index validation in MDT validator (#9437)

This PR adds the validation of record index in MDT validator 
(`HoodieMetadataTableValidator`).  The following validation modes are added:
- Record index count validation (with CLI config 
`--validate-record-index-count`): validate the number of entries in the record 
index, which should be equal to the number of record keys in the latest 
snapshot of the table.
- Record index content validation (with CLI config 
`--validate-record-index-content`): validate the content of the record index so 
that each record key should have the correct location, and there is no 
additional or missing entry.  Two more configs are added for this mode: (1) 
`--num-record-index-error-samples`: number of error samples to show for record 
index validation when there are mismatches, (2) `--record-index-parallelism`: 
parallelism for joining record index entries with data [...]
---
 .../hudi/metadata/HoodieMetadataPayload.java   |  19 +-
 .../hudi/metadata/HoodieTableMetadataUtil.java |  71 +-
 .../utilities/HoodieMetadataTableValidator.java| 272 +++--
 3 files changed, 319 insertions(+), 43 deletions(-)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java 
b/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java
index 8d5114a76bc..04ffc98e840 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java
@@ -158,7 +158,7 @@ public class HoodieMetadataPayload implements 
HoodieRecordPayload> 
convertMissingPartitionRecords(HoodieEngineContext engineContext,
-
List deletedPartitions, Map> filesAdded,
-
Map> filesDeleted, String instantTime) {
+   
 List deletedPartitions, Map> filesAdded,
+   
 Map> filesDeleted, String 
instantTime) {
 List records = new LinkedList<>();
 int[] fileDeleteCount = {0};
 int[] filesAddedCount = {0};
@@ -1069,8 +1073,8 @@ public class HoodieTableMetadataUtil {
   }
 
   private static Stream 
translateWriteStatToColumnStats(HoodieWriteStat writeStat,
- 
HoodieTableMetaClient datasetMetaClient,
- 
List columnsToIndex) {
+  
HoodieTableMetaClient datasetMetaClient,
+  
List columnsToIndex) {
 if (writeStat instanceof HoodieDeltaWriteStat && ((HoodieDeltaWriteStat) 
writeStat).getColumnStats().isPresent()) {
   Map> columnRangeMap = 
((HoodieDeltaWriteStat) writeStat).getColumnStats().get();
   Collection> 
columnRangeMetadataList = columnRangeMap.values();
@@ -1332,7 +1336,7 @@ public class HoodieTableMetadataUtil {
*/
   public static boolean isIndexingCommit(String instantTime) {
 return instantTime.length() == MILLIS_INSTANT_ID_LENGTH + 
OperationSuffix.METADATA_INDEXER.getSuffix().length()
-&& 
instantTime.endsWith(OperationSuffix.METADATA_INDEXER.getSuffix());
+&& instantTime.endsWith(OperationSuffix.METADATA_INDEXER.getSuffix());
   }
 
   /**
@@ -1457,7 +1461,7 @@ public class HoodieTableMetadataUtil {
 
 if (backup) {
   final Path metadataPartitionBackupPath = new 
Path(metadataTablePartitionPath.getParent().getParent(),
-  String.format(".metadata_%s_%s", 
partitionType.getPartitionPath(), HoodieActiveTimeline.createNewInstantTime()));
+  String.format(".metadata_%s_%s", partitionType.getPartitionPath(), 
HoodieActiveTimeline.createNewInstantTime()));
   LOG.info(String.format("Backing up MDT partition %s to %s before 
deletion", partitionType, metadataPartitionBackupPath));
   try {
 if (fs.rename(metadataTablePartitionPath, 
metadataPartitionBackupPath)) {
@@ -1586,7 +1590,7 @@ public class HoodieTableMetadataUtil {
* @return The estimated number of file groups.
*/
   public static int estimateFileGroupCount(MetadataPartitionType 
partitionType, long recordCount, int averageRecordSize, int minFileGroupC

[hudi] 06/29: [MINOR] Increase CI timeout for UT FT other modules to 4 hours (#9423)

2023-08-18 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 81a458aa33c112be9dd24f9cde2913cb40dd7bac
Author: Sagar Sumit 
AuthorDate: Fri Aug 11 08:12:38 2023 +0530

[MINOR] Increase CI timeout for UT FT other modules to 4 hours (#9423)
---
 azure-pipelines-20230430.yml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/azure-pipelines-20230430.yml b/azure-pipelines-20230430.yml
index 75c231b74dc..2da5ab0d4f9 100644
--- a/azure-pipelines-20230430.yml
+++ b/azure-pipelines-20230430.yml
@@ -188,7 +188,7 @@ stages:
 displayName: Top 100 long-running testcases
   - job: UT_FT_4
 displayName: UT FT other modules
-timeoutInMinutes: '180'
+timeoutInMinutes: '240'
 steps:
   - task: Maven@4
 displayName: maven install



[hudi] 27/29: [MINOR] Fix sql core flow test (#9461)

2023-08-18 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 20b4438377ba4421d8c161a67ea72874b46daf72
Author: Shiyan Xu <2701446+xushi...@users.noreply.github.com>
AuthorDate: Thu Aug 17 01:30:29 2023 -0500

[MINOR] Fix sql core flow test (#9461)
---
 .../scala/org/apache/hudi/functional/TestSparkSqlCoreFlow.scala | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestSparkSqlCoreFlow.scala
 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestSparkSqlCoreFlow.scala
index daf10956b69..7510204bac4 100644
--- 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestSparkSqlCoreFlow.scala
+++ 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestSparkSqlCoreFlow.scala
@@ -19,7 +19,7 @@
 
 package org.apache.hudi.functional
 
-import org.apache.hudi.DataSourceReadOptions.{QUERY_TYPE_INCREMENTAL_OPT_VAL, 
QUERY_TYPE_READ_OPTIMIZED_OPT_VAL}
+import 
org.apache.hudi.DataSourceReadOptions.{QUERY_TYPE_READ_OPTIMIZED_OPT_VAL, 
QUERY_TYPE_SNAPSHOT_OPT_VAL}
 import org.apache.hudi.HoodieDataSourceHelpers.{hasNewCommits, latestCommit, 
listCommitsSince}
 import org.apache.hudi.common.config.HoodieMetadataConfig
 import org.apache.hudi.common.fs.FSUtils
@@ -185,8 +185,8 @@ class TestSparkSqlCoreFlow extends HoodieSparkSqlTestBase {
 
   def doSnapshotRead(tableName: String, isMetadataEnabledOnRead: Boolean): 
sql.DataFrame = {
 try {
-  spark.sql("set hoodie.datasource.query.type=\"snapshot\"")
-  spark.sql(s"set 
hoodie.metadata.enable=${String.valueOf(isMetadataEnabledOnRead)}")
+  spark.sql(s"set 
hoodie.datasource.query.type=$QUERY_TYPE_SNAPSHOT_OPT_VAL")
+  spark.sql(s"set hoodie.metadata.enable=$isMetadataEnabledOnRead")
   spark.sql(s"select * from $tableName")
 } finally {
   spark.conf.unset("hoodie.datasource.query.type")



[hudi] 23/29: [HUDI-6673] Fix Incremental Query Syntax - Spark SQL Core Flow Test (#9410)

2023-08-18 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 2c9024e4fad3254424874889aaffb9523d310423
Author: Jon Vexler 
AuthorDate: Tue Aug 15 12:15:07 2023 -0700

[HUDI-6673] Fix Incremental Query Syntax - Spark SQL Core Flow Test (#9410)

Co-authored-by: Jonathan Vexler <=>
---
 .../test/scala/org/apache/hudi/functional/TestSparkSqlCoreFlow.scala  | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestSparkSqlCoreFlow.scala
 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestSparkSqlCoreFlow.scala
index fa883cd3eb2..daf10956b69 100644
--- 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestSparkSqlCoreFlow.scala
+++ 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestSparkSqlCoreFlow.scala
@@ -125,7 +125,7 @@ class TestSparkSqlCoreFlow extends HoodieSparkSqlTestBase {
 // we have 2 commits, try pulling the first commit (which is not the 
latest)
 //HUDI-5266
 val firstCommit = listCommitsSince(fs, tableBasePath, "000").get(0)
-val hoodieIncViewDf1 = spark.sql(s"select * from 
hudi_table_changes('$tableName', 'earliest', '$firstCommit')")
+val hoodieIncViewDf1 = spark.sql(s"select * from 
hudi_table_changes('$tableName', 'latest_state', 'earliest', '$firstCommit')")
 
 assertEquals(100, hoodieIncViewDf1.count()) // 100 initial inserts must be 
pulled
 var countsPerCommit = 
hoodieIncViewDf1.groupBy("_hoodie_commit_time").count().collect()
@@ -137,7 +137,7 @@ class TestSparkSqlCoreFlow extends HoodieSparkSqlTestBase {
 
 //another incremental query with commit2 and commit3
 //HUDI-5266
-val hoodieIncViewDf2 = spark.sql(s"select * from 
hudi_table_changes('$tableName', '$commitInstantTime2', '$commitInstantTime3')")
+val hoodieIncViewDf2 = spark.sql(s"select * from 
hudi_table_changes('$tableName', 'latest_state', '$commitInstantTime2', 
'$commitInstantTime3')")
 
 assertEquals(uniqueKeyCnt2, hoodieIncViewDf2.count()) // 60 records must 
be pulled
 countsPerCommit = 
hoodieIncViewDf2.groupBy("_hoodie_commit_time").count().collect()



[hudi] 18/29: [HUDI-6694] Fix log file CLI around command blocks (#9445)

2023-08-18 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit d6358a9d602d4e62caf81a08b9f644f8e606088b
Author: Y Ethan Guo 
AuthorDate: Tue Aug 15 09:38:59 2023 -0700

[HUDI-6694] Fix log file CLI around command blocks (#9445)

This commit fixes the log file CLI commands when the log file contains 
command blocks like rollback commands. The commit also adds the "File Path" 
column to the output for show logfile metadata CLI so it's easier to see the 
corresponding file path.
---
 .../hudi/cli/commands/HoodieLogFileCommand.java| 70 +++---
 .../cli/commands/TestHoodieLogFileCommand.java | 33 +++---
 2 files changed, 75 insertions(+), 28 deletions(-)

diff --git 
a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/HoodieLogFileCommand.java 
b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/HoodieLogFileCommand.java
index cf36a704c7d..9a510bd466a 100644
--- 
a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/HoodieLogFileCommand.java
+++ 
b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/HoodieLogFileCommand.java
@@ -51,6 +51,7 @@ import org.apache.hadoop.fs.FileStatus;
 import org.apache.hadoop.fs.FileSystem;
 import org.apache.hadoop.fs.Path;
 import org.apache.parquet.avro.AvroSchemaConverter;
+import org.apache.parquet.schema.MessageType;
 import org.springframework.shell.standard.ShellComponent;
 import org.springframework.shell.standard.ShellMethod;
 import org.springframework.shell.standard.ShellOption;
@@ -91,15 +92,27 @@ public class HoodieLogFileCommand {
 FileSystem fs = HoodieCLI.getTableMetaClient().getFs();
 List logFilePaths = FSUtils.getGlobStatusExcludingMetaFolder(fs, 
new Path(logFilePathPattern)).stream()
 .map(status -> 
status.getPath().toString()).collect(Collectors.toList());
-Map, Map>, Integer>>> commitCountAndMetadata =
+Map, 
Tuple2,
+Map>, Integer>>> commitCountAndMetadata =
 new HashMap<>();
 int numCorruptBlocks = 0;
 int dummyInstantTimeCount = 0;
+String basePath = 
HoodieCLI.getTableMetaClient().getBasePathV2().toString();
 
 for (String logFilePath : logFilePaths) {
-  FileStatus[] fsStatus = fs.listStatus(new Path(logFilePath));
-  Schema writerSchema = new AvroSchemaConverter()
-  
.convert(Objects.requireNonNull(TableSchemaResolver.readSchemaFromLogFile(fs, 
new Path(logFilePath;
+  Path path = new Path(logFilePath);
+  String pathString = path.toString();
+  String fileName;
+  if (pathString.contains(basePath)) {
+String[] split = pathString.split(basePath);
+fileName = split[split.length - 1];
+  } else {
+fileName = path.getName();
+  }
+  FileStatus[] fsStatus = fs.listStatus(path);
+  MessageType schema = TableSchemaResolver.readSchemaFromLogFile(fs, path);
+  Schema writerSchema = schema != null
+  ? new AvroSchemaConverter().convert(Objects.requireNonNull(schema)) 
: null;
   Reader reader = HoodieLogFormat.newReader(fs, new 
HoodieLogFile(fsStatus[0].getPath()), writerSchema);
 
   // read the avro blocks
@@ -133,12 +146,15 @@ public class HoodieLogFileCommand {
 }
 if (commitCountAndMetadata.containsKey(instantTime)) {
   commitCountAndMetadata.get(instantTime).add(
-  new Tuple3<>(n.getBlockType(), new 
Tuple2<>(n.getLogBlockHeader(), n.getLogBlockFooter()), recordCount.get()));
+  new Tuple3<>(new Tuple2<>(fileName, n.getBlockType()),
+  new Tuple2<>(n.getLogBlockHeader(), n.getLogBlockFooter()), 
recordCount.get()));
 } else {
-  List, Map>, Integer>> list =
+  List, 
Tuple2,
+  Map>, Integer>> list =
   new ArrayList<>();
   list.add(
-  new Tuple3<>(n.getBlockType(), new 
Tuple2<>(n.getLogBlockHeader(), n.getLogBlockFooter()), recordCount.get()));
+  new Tuple3<>(new Tuple2<>(fileName, n.getBlockType()),
+  new Tuple2<>(n.getLogBlockHeader(), n.getLogBlockFooter()), 
recordCount.get()));
   commitCountAndMetadata.put(instantTime, list);
 }
   }
@@ -146,22 +162,27 @@ public class HoodieLogFileCommand {
 }
 List rows = new ArrayList<>();
 ObjectMapper objectMapper = new ObjectMapper();
-for (Map.Entry, Map>, 
Integer>>> entry : commitCountAndMetadata
+for (Map.Entry, 
Tuple2,
+Map>, Integer>>> entry : 
commitCountAndMetadata
 .entrySet()) {
   String instantTime = entry.getKey();
-  for (Tuple3, 
Map>, Integer> tuple3 : entry
+  for (Tuple3, 
Tuple2,
+  Map>, Integer> tuple3 : entry
   .getValue()) {
-Co

[hudi] 29/29: [MINOR] Fix build on master (#9452)

2023-08-18 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 9bc6a28010c3fde4ef27312c3c14580caca703fa
Author: Y Ethan Guo 
AuthorDate: Tue Aug 15 13:05:16 2023 -0700

[MINOR] Fix build on master (#9452)
---
 .../src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java  | 1 -
 1 file changed, 1 deletion(-)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
 
b/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
index a957ee8f8a8..861f8fc8ddd 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
@@ -71,7 +71,6 @@ import org.apache.avro.generic.IndexedRecord;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.fs.FileSystem;
 import org.apache.hadoop.fs.Path;
-import org.apache.hadoop.conf.Configuration;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 



[hudi] 28/29: [MINOR] Fix meta client instantiation and some incorrect configs (#9463)

2023-08-18 Thread pwason
This is an automated email from the ASF dual-hosted git repository.

pwason pushed a commit to branch release-0.14.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 6ffd4d5705a6b6dc3251050dc3c7f652e0ce7a20
Author: Jon Vexler 
AuthorDate: Thu Aug 17 04:30:08 2023 -0400

[MINOR] Fix meta client instantiation and some incorrect configs (#9463)

Co-authored-by: Jonathan Vexler <=>
---
 docker/demo/config/test-suite/multi-writer-local-3.properties | 4 ++--
 docker/demo/config/test-suite/test-clustering.properties  | 4 ++--
 .../test-metadata-aggressive-clean-archival-inline-compact.properties | 4 ++--
 .../main/java/org/apache/hudi/integ/testsuite/HoodieTestSuiteJob.java | 2 ++
 4 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/docker/demo/config/test-suite/multi-writer-local-3.properties 
b/docker/demo/config/test-suite/multi-writer-local-3.properties
index 2da3880803a..c937bf76a7f 100644
--- a/docker/demo/config/test-suite/multi-writer-local-3.properties
+++ b/docker/demo/config/test-suite/multi-writer-local-3.properties
@@ -36,8 +36,8 @@ 
hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.InProcessLock
 hoodie.streamer.source.dfs.root=/tmp/hudi/input3
 hoodie.streamer.schemaprovider.target.schema.file=file:/tmp/source.avsc
 hoodie.streamer.schemaprovider.source.schema.file=file:/tmp/source.avsc
-hoodie.streamer.keygen.timebased.timestamp.type=UNIX_TIMESTAMP
-hoodie.streamer.keygen.timebased.output.dateformat=/MM/dd
+hoodie.keygen.timebased.timestamp.type=UNIX_TIMESTAMP
+hoodie.keygen.timebased.output.dateformat=/MM/dd
 hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://hiveserver:1/
 hoodie.datasource.hive_sync.database=testdb
 hoodie.datasource.hive_sync.table=table1
diff --git a/docker/demo/config/test-suite/test-clustering.properties 
b/docker/demo/config/test-suite/test-clustering.properties
index a266cc13fa8..68c347edc20 100644
--- a/docker/demo/config/test-suite/test-clustering.properties
+++ b/docker/demo/config/test-suite/test-clustering.properties
@@ -38,8 +38,8 @@ 
hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run
 
hoodie.streamer.source.dfs.root=/user/hive/warehouse/hudi-integ-test-suite/input
 
hoodie.streamer.schemaprovider.target.schema.file=file:/var/hoodie/ws/docker/demo/config/test-suite/source.avsc
 
hoodie.streamer.schemaprovider.source.schema.file=file:/var/hoodie/ws/docker/demo/config/test-suite/source.avsc
-hoodie.streamer.keygen.timebased.timestamp.type=UNIX_TIMESTAMP
-hoodie.streamer.keygen.timebased.output.dateformat=/MM/dd
+hoodie.keygen.timebased.timestamp.type=UNIX_TIMESTAMP
+hoodie.keygen.timebased.output.dateformat=/MM/dd
 hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://hiveserver:1/
 hoodie.datasource.hive_sync.database=testdb
 hoodie.datasource.hive_sync.table=table1
diff --git 
a/docker/demo/config/test-suite/test-metadata-aggressive-clean-archival-inline-compact.properties
 
b/docker/demo/config/test-suite/test-metadata-aggressive-clean-archival-inline-compact.properties
index 7001ac484ab..ea509a69fc7 100644
--- 
a/docker/demo/config/test-suite/test-metadata-aggressive-clean-archival-inline-compact.properties
+++ 
b/docker/demo/config/test-suite/test-metadata-aggressive-clean-archival-inline-compact.properties
@@ -38,8 +38,8 @@ hoodie.datasource.write.partitionpath.field=timestamp
 
hoodie.streamer.source.dfs.root=/user/hive/warehouse/hudi-integ-test-suite/input
 
hoodie.streamer.schemaprovider.target.schema.file=file:/var/hoodie/ws/docker/demo/config/test-suite/source.avsc
 
hoodie.streamer.schemaprovider.source.schema.file=file:/var/hoodie/ws/docker/demo/config/test-suite/source.avsc
-hoodie.streamer.keygen.timebased.timestamp.type=UNIX_TIMESTAMP
-hoodie.streamer.keygen.timebased.output.dateformat=/MM/dd
+hoodie.keygen.timebased.timestamp.type=UNIX_TIMESTAMP
+hoodie.keygen.timebased.output.dateformat=/MM/dd
 hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://hiveserver:1/
 hoodie.datasource.hive_sync.database=testdb
 hoodie.datasource.hive_sync.table=table1
diff --git 
a/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/HoodieTestSuiteJob.java
 
b/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/HoodieTestSuiteJob.java
index 8ef2232bdc0..d50915d26e2 100644
--- 
a/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/HoodieTestSuiteJob.java
+++ 
b/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/HoodieTestSuiteJob.java
@@ -18,6 +18,7 @@
 
 package org.apache.hudi.integ.testsuite;
 
+import org.apache.hudi.DataSourceWriteOptions;
 import org.apache.hudi.common.config.HoodieMetadataConfig;
 import org.apache.hudi.common.config.TypedProperties;
 import org.apache.hudi.common.fs.FSUtils;
@@ -120,6 +121,7 @@ public class HoodieTestSuiteJob {
   metaClient = HoodieTableMetaClient.withPropertyBuilder()
   .setTableType(cfg.tableType)
   .setTableName(cfg.targetTab

  1   2   >