Re: [PR] [HUDI-7704] Unify test client storage classes with duplicate code [hudi]

2024-05-14 Thread via GitHub


hudi-bot commented on PR #11152:
URL: https://github.com/apache/hudi/pull/11152#issuecomment-2111712497

   
   ## CI report:
   
   * c468c73abe81ebc9a50506d404f366b4eec6c091 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23941)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7526] Fix constructors for bulkinsert sort partitioners to ensure we could use it as user defined partitioners [hudi]

2024-05-14 Thread via GitHub


hudi-bot commented on PR #10942:
URL: https://github.com/apache/hudi/pull/10942#issuecomment-2111711975

   
   ## CI report:
   
   * 55fb13f601452d13f9ef2984e8ecef3ea95d7de6 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23943)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6563] Supports flink lookup join [hudi]

2024-05-14 Thread via GitHub


hudi-bot commented on PR #9228:
URL: https://github.com/apache/hudi/pull/9228#issuecomment-2111710044

   
   ## CI report:
   
   * d1a5bab5d4b004463f9f985951ba313bd3ec5e3c Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23942)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7526] Fix constructors for bulkinsert sort partitioners to ensure we could use it as user defined partitioners [hudi]

2024-05-14 Thread via GitHub


hudi-bot commented on PR #10942:
URL: https://github.com/apache/hudi/pull/10942#issuecomment-2111635895

   
   ## CI report:
   
   * ea11f68c1778f9ec23eab6a887076e51f60caa0b Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23060)
 
   * 55fb13f601452d13f9ef2984e8ecef3ea95d7de6 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23943)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6563] Supports flink lookup join [hudi]

2024-05-14 Thread via GitHub


hudi-bot commented on PR #9228:
URL: https://github.com/apache/hudi/pull/9228#issuecomment-2111634643

   
   ## CI report:
   
   * 5a71eed0818abf0296db312277ac11e8c38e3f76 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23940)
 
   * d1a5bab5d4b004463f9f985951ba313bd3ec5e3c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23942)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-7762) Optimizing Hudi Table Check with Delta Lake by Refining Class Name Checks In Spark3.5

2024-05-14 Thread Ma Jian (Jira)
Ma Jian created HUDI-7762:
-

 Summary: Optimizing Hudi Table Check with Delta Lake by Refining 
Class Name Checks In Spark3.5
 Key: HUDI-7762
 URL: https://issues.apache.org/jira/browse/HUDI-7762
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Ma Jian


In Hudi, the Spark3_5Adapter calls v2.v1Table which in turn invokes the logic 
within Delta. When executed on a Delta table, this may result in an error. 
Therefore, the logic to determine whether it is a Hudi operation has been 
altered to class name checks to prevent errors during Delta Lake executions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7704] Unify test client storage classes with duplicate code [hudi]

2024-05-14 Thread via GitHub


hudi-bot commented on PR #11152:
URL: https://github.com/apache/hudi/pull/11152#issuecomment-2111594080

   
   ## CI report:
   
   * 0b6a222addb8158cd6981f22bfd131f3fb176939 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23855)
 
   * c468c73abe81ebc9a50506d404f366b4eec6c091 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23941)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7526] Fix constructors for bulkinsert sort partitioners to ensure we could use it as user defined partitioners [hudi]

2024-05-14 Thread via GitHub


hudi-bot commented on PR #10942:
URL: https://github.com/apache/hudi/pull/10942#issuecomment-2111593773

   
   ## CI report:
   
   * ea11f68c1778f9ec23eab6a887076e51f60caa0b Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23060)
 
   * 55fb13f601452d13f9ef2984e8ecef3ea95d7de6 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6563] Supports flink lookup join [hudi]

2024-05-14 Thread via GitHub


hudi-bot commented on PR #9228:
URL: https://github.com/apache/hudi/pull/9228#issuecomment-2111592447

   
   ## CI report:
   
   * 5a71eed0818abf0296db312277ac11e8c38e3f76 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23940)
 
   * d1a5bab5d4b004463f9f985951ba313bd3ec5e3c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7704] Unify test client storage classes with duplicate code [hudi]

2024-05-14 Thread via GitHub


hudi-bot commented on PR #11152:
URL: https://github.com/apache/hudi/pull/11152#issuecomment-2111585341

   
   ## CI report:
   
   * 0b6a222addb8158cd6981f22bfd131f3fb176939 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23855)
 
   * c468c73abe81ebc9a50506d404f366b4eec6c091 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6563] Supports flink lookup join [hudi]

2024-05-14 Thread via GitHub


hudi-bot commented on PR #9228:
URL: https://github.com/apache/hudi/pull/9228#issuecomment-2111527768

   
   ## CI report:
   
   * 5a71eed0818abf0296db312277ac11e8c38e3f76 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23940)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Partition TTL [hudi]

2024-05-14 Thread via GitHub


xicm commented on issue #11223:
URL: https://github.com/apache/hudi/issues/11223#issuecomment-2111505302

   FileNotFoundException
PartitionTTLStrategy list all partitions from filesystem, clean actiion may 
running in TaskManager at the same time. 
   
   There is such a situation,a partition exists when PartitionTTLStrategy list 
partitions from BasePath, but when list file status in this partition, this 
partition may be cleaned by TaskManager. So there will be a 
FileNotFoundException.
   
   Can we add a try catch here
   
   
https://github.com/apache/hudi/blob/5b0d67bc79852b16eb8de12e74c8087abba13bb3/hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java#L174-L177


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] Partition TTL [hudi]

2024-05-14 Thread via GitHub


xicm opened a new issue, #11223:
URL: https://github.com/apache/hudi/issues/11223

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   I cherry picked the partition ttl to 0.14.1 and ran into two problems.
   
   1. FileNotFoundException
   ```
   org.apache.flink.util.FlinkException: Global failure triggered by 
OperatorCoordinator for 'bucket_write: default_database.flink_ttlb' (operator 
7ae18ef4aed1795ee7024c7b7a66a1f1).
at 
org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder$LazyInitializedCoordinatorContext.failJob(OperatorCoordinatorHolder.java:600)
at 
org.apache.hudi.sink.StreamWriteOperatorCoordinator.lambda$start$0(StreamWriteOperatorCoordinator.java:203)
at 
org.apache.hudi.sink.utils.NonThrownExecutor.handleException(NonThrownExecutor.java:142)
at 
org.apache.hudi.sink.utils.NonThrownExecutor.lambda$wrapAction$0(NonThrownExecutor.java:133)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
   Caused by: org.apache.hudi.exception.HoodieException: Executor executes 
action [commits the instant 20240515082359670] error
... 6 more
   Caused by: org.apache.hudi.exception.HoodieException: Error fetching 
partition paths from metadata table
at 
org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:303)
at 
org.apache.hudi.table.action.ttl.strategy.PartitionTTLStrategy.getPartitionPathsForTTL(PartitionTTLStrategy.java:69)
at 
org.apache.hudi.table.action.ttl.strategy.KeepByTimeStrategy.getExpiredPartitionPaths(KeepByTimeStrategy.java:61)
at 
org.apache.hudi.table.action.commit.FlinkPartitionTTLActionExecutor.execute(FlinkPartitionTTLActionExecutor.java:60)
at 
org.apache.hudi.table.HoodieFlinkCopyOnWriteTable.managePartitionTTL(HoodieFlinkCopyOnWriteTable.java:403)
at 
org.apache.hudi.client.BaseHoodieTableServiceClient.runTableServicesInline(BaseHoodieTableServiceClient.java:588)
at 
org.apache.hudi.client.BaseHoodieWriteClient.runTableServicesInline(BaseHoodieWriteClient.java:585)
at 
org.apache.hudi.client.BaseHoodieWriteClient.commitStats(BaseHoodieWriteClient.java:252)
at 
org.apache.hudi.client.HoodieFlinkWriteClient.commit(HoodieFlinkWriteClient.java:112)
at 
org.apache.hudi.client.HoodieFlinkWriteClient.commit(HoodieFlinkWriteClient.java:75)
at 
org.apache.hudi.client.BaseHoodieWriteClient.commit(BaseHoodieWriteClient.java:201)
at 
org.apache.hudi.sink.StreamWriteOperatorCoordinator.doCommit(StreamWriteOperatorCoordinator.java:610)
at 
org.apache.hudi.sink.StreamWriteOperatorCoordinator.commitInstant(StreamWriteOperatorCoordinator.java:586)
at 
org.apache.hudi.sink.StreamWriteOperatorCoordinator.lambda$notifyCheckpointComplete$4(StreamWriteOperatorCoordinator.java:268)
at 
org.apache.hudi.sink.utils.NonThrownExecutor.lambda$wrapAction$0(NonThrownExecutor.java:130)
... 3 more
   Caused by: org.apache.hudi.exception.HoodieException: 
org.apache.hudi.exception.HoodieException: Error occurs when executing flatMap
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at 
java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:593)
at 
java.util.concurrent.ForkJoinTask.reportException(ForkJoinTask.java:677)
at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:735)
at 
java.util.stream.ReduceOps$ReduceOp.evaluateParallel(ReduceOps.java:714)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
at 
java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
at 
org.apache.hudi.client.common.HoodieFlinkEngineContext.flatMap(HoodieFlinkEngineContext.java:140)
at 
org.apache.hudi.metadata.FileSystemBackedTableMetadata.getPartitionPathWithPathPrefixUsingFilterExpression(FileSystemBackedTableMetadata.java:178)
at 
org.apache.hudi.metadata.FileSystemBackedTableMetadata.getPartitionPathWithPathPrefix(FileSystemBackedTableMetadata.java:142)
at 
org.apache.hudi.metadata.FileSystemBackedTableMetadata.lambda$getPartitionPathWithPath

Re: [PR] [HUDI-1517] create marker file for every log file [hudi]

2024-05-14 Thread via GitHub


danny0405 commented on PR #11187:
URL: https://github.com/apache/hudi/pull/11187#issuecomment-2111481019

   yeah, you are right.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-1517] create marker file for every log file [hudi]

2024-05-14 Thread via GitHub


KnightChess commented on PR #11187:
URL: https://github.com/apache/hudi/pull/11187#issuecomment-2111477580

   @danny0405 if a fileGroup only have base file, the normal task and 
speculation task will all create a new file in different seq number like 
https://github.com/apache/hudi/issues/10803#issuecomment-1977980685 said


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]xxx.parquet is not a Parquet file [hudi]

2024-05-14 Thread via GitHub


MrAladdin commented on issue #11178:
URL: https://github.com/apache/hudi/issues/11178#issuecomment-2111453760

   @xushiyan I need your help to answer the question I replied to you above, 
thank you.
   
   2、I have a question: When using Spark Structured Streaming to write data, 
the number of hfile files under .hoodie/metadata/record_index is twice the 
amount set by .option("hoodie.metadata.record.index.min.filegroup.count", 
"720"), but when using offline Spark DataFrame for batch data writing, each 
submission will generate a corresponding number of hfile, leading to an 
excessively large number of hfiles under record_index. What is the reason for 
this, and how can we better control the number of hfile files under 
.hoodie/metadata/record_index and what is the most reasonable setting for the 
size of each hfile? Also, what are the specific parameter names involved?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6563] Supports flink lookup join [hudi]

2024-05-14 Thread via GitHub


hudi-bot commented on PR #9228:
URL: https://github.com/apache/hudi/pull/9228#issuecomment-2111446460

   
   ## CI report:
   
   * ccc385bdd3c239d7490ba4ccc99f0621d5c283d6 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23917)
 
   * 5a71eed0818abf0296db312277ac11e8c38e3f76 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23940)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6563] Supports flink lookup join [hudi]

2024-05-14 Thread via GitHub


hudi-bot commented on PR #9228:
URL: https://github.com/apache/hudi/pull/9228#issuecomment-2111440931

   
   ## CI report:
   
   * ccc385bdd3c239d7490ba4ccc99f0621d5c283d6 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23917)
 
   * 5a71eed0818abf0296db312277ac11e8c38e3f76 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7758] Only consider files in Hudi partitions when initializing MDT [hudi]

2024-05-14 Thread via GitHub


danny0405 commented on code in PR #11219:
URL: https://github.com/apache/hudi/pull/11219#discussion_r1600832686


##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##
@@ -2000,16 +2000,15 @@ public DirectoryInfo(String relativePath, 
List pathInfos, Strin
   // Pre-allocate with the maximum length possible
   filenameToSizeMap = new HashMap<>(pathInfos.size());
 
+  // Presence of partition meta file implies this is a HUDI partition
+  isHoodiePartition = pathInfos.stream().anyMatch(status -> 
status.getPath().getName().startsWith(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX));

Review Comment:
   And can you clarify what kind of unexpected parquets would cause issue here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7758] Only consider files in Hudi partitions when initializing MDT [hudi]

2024-05-14 Thread via GitHub


danny0405 commented on code in PR #11219:
URL: https://github.com/apache/hudi/pull/11219#discussion_r1600832292


##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##
@@ -2000,16 +2000,15 @@ public DirectoryInfo(String relativePath, 
List pathInfos, Strin
   // Pre-allocate with the maximum length possible
   filenameToSizeMap = new HashMap<>(pathInfos.size());
 
+  // Presence of partition meta file implies this is a HUDI partition
+  isHoodiePartition = pathInfos.stream().anyMatch(status -> 
status.getPath().getName().startsWith(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX));

Review Comment:
   Can we fix the `FSUtils.isDataFile` instead? The check for log file uses the 
regex pattern match, we should fix the base file check to be in line with the 
log file.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-1234] DO NOT MERGE testing removing timestamp from clustering [hudi]

2024-05-14 Thread via GitHub


hudi-bot commented on PR #11222:
URL: https://github.com/apache/hudi/pull/11222#issuecomment-2111390424

   
   ## CI report:
   
   * 873159bf2e25e0c1a36480159b19399b8aa4fa5d Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23936)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6613] implement inmemory file index to allow for glob paths [hudi]

2024-05-14 Thread via GitHub


hudi-bot commented on PR #10062:
URL: https://github.com/apache/hudi/pull/10062#issuecomment-2111389366

   
   ## CI report:
   
   * 2cf4e2b968d44e10937dc2a46ee4865a8d8cee67 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23937)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) 26/28: [HUDI-7636] Make StoragePath Serializable (#11049)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 667f93b1fff97fac8803878ad451ffc5ca59401a
Author: Y Ethan Guo 
AuthorDate: Wed Apr 17 21:39:28 2024 -0700

[HUDI-7636] Make StoragePath Serializable (#11049)
---
 .../java/org/apache/hudi/storage/StoragePath.java  | 14 +--
 .../apache/hudi/io/storage/TestStoragePath.java| 28 +-
 2 files changed, 39 insertions(+), 3 deletions(-)

diff --git a/hudi-io/src/main/java/org/apache/hudi/storage/StoragePath.java 
b/hudi-io/src/main/java/org/apache/hudi/storage/StoragePath.java
index f3a88f7c89b..24bf77e76ad 100644
--- a/hudi-io/src/main/java/org/apache/hudi/storage/StoragePath.java
+++ b/hudi-io/src/main/java/org/apache/hudi/storage/StoragePath.java
@@ -23,6 +23,9 @@ import org.apache.hudi.ApiMaturityLevel;
 import org.apache.hudi.PublicAPIClass;
 import org.apache.hudi.PublicAPIMethod;
 
+import java.io.IOException;
+import java.io.ObjectInputStream;
+import java.io.ObjectOutputStream;
 import java.io.Serializable;
 import java.net.URI;
 import java.net.URISyntaxException;
@@ -33,12 +36,11 @@ import java.net.URISyntaxException;
  * The APIs are mainly based on {@code org.apache.hadoop.fs.Path} class.
  */
 @PublicAPIClass(maturity = ApiMaturityLevel.EVOLVING)
-// StoragePath
 public class StoragePath implements Comparable, Serializable {
   public static final char SEPARATOR_CHAR = '/';
   public static final char COLON_CHAR = ':';
   public static final String SEPARATOR = "" + SEPARATOR_CHAR;
-  private final URI uri;
+  private URI uri;
   private transient volatile StoragePath cachedParent;
   private transient volatile String cachedName;
   private transient volatile String uriString;
@@ -306,4 +308,12 @@ public class StoragePath implements 
Comparable, Serializable {
 }
 return path.substring(0, indexOfLastSlash);
   }
+
+  private void writeObject(ObjectOutputStream out) throws IOException {
+out.writeObject(uri);
+  }
+
+  private void readObject(ObjectInputStream in) throws IOException, 
ClassNotFoundException {
+uri = (URI) in.readObject();
+  }
 }
diff --git 
a/hudi-io/src/test/java/org/apache/hudi/io/storage/TestStoragePath.java 
b/hudi-io/src/test/java/org/apache/hudi/io/storage/TestStoragePath.java
index 9195ebec9fd..e7ce6ecc838 100644
--- a/hudi-io/src/test/java/org/apache/hudi/io/storage/TestStoragePath.java
+++ b/hudi-io/src/test/java/org/apache/hudi/io/storage/TestStoragePath.java
@@ -22,7 +22,14 @@ package org.apache.hudi.io.storage;
 import org.apache.hudi.storage.StoragePath;
 
 import org.junit.jupiter.api.Test;
-
+import org.junit.jupiter.params.ParameterizedTest;
+import org.junit.jupiter.params.provider.ValueSource;
+
+import java.io.ByteArrayInputStream;
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.io.ObjectInputStream;
+import java.io.ObjectOutputStream;
 import java.net.URI;
 import java.net.URISyntaxException;
 import java.util.Arrays;
@@ -197,6 +204,25 @@ public class TestStoragePath {
 () -> new StoragePath("a").makeQualified(defaultUri));
   }
 
+  @ParameterizedTest
+  @ValueSource(strings = {
+  "/x/y/1.file#bar",
+  "s3://foo/bar/1%2F2%2F3",
+  "hdfs://host1/a/b/c"
+  })
+  public void testSerializability(String pathStr) throws IOException, 
ClassNotFoundException {
+StoragePath path = new StoragePath(pathStr);
+try (ByteArrayOutputStream baos = new ByteArrayOutputStream();
+ ObjectOutputStream oos = new ObjectOutputStream(baos)) {
+  oos.writeObject(path);
+  try (ByteArrayInputStream bais = new 
ByteArrayInputStream(baos.toByteArray());
+   ObjectInputStream ois = new ObjectInputStream(bais)) {
+StoragePath deserialized = (StoragePath) ois.readObject();
+assertEquals(path.toUri(), deserialized.toUri());
+  }
+}
+  }
+
   @Test
   public void testEquals() {
 assertEquals(new StoragePath("/foo"), new StoragePath("/foo"));



(hudi) 21/28: [HUDI-7625] Avoid unnecessary rewrite for metadata table (#11038)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit a233bbecb3ac9fe6f4fd5732857f87df435b8e1e
Author: Danny Chan 
AuthorDate: Wed Apr 17 14:37:28 2024 +0800

[HUDI-7625] Avoid unnecessary rewrite for metadata table (#11038)
---
 .../src/main/java/org/apache/hudi/io/HoodieMergeHandle.java | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
index 749b08c3e7e..3f9aa2981c1 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
@@ -332,7 +332,11 @@ public class HoodieMergeHandle extends 
HoodieWriteHandle
* Go through an old record. Here if we detect a newer version shows up, we 
write the new one to the file.
*/
   public void write(HoodieRecord oldRecord) {
-Schema oldSchema = config.populateMetaFields() ? writeSchemaWithMetaFields 
: writeSchema;
+// Use schema with metadata files no matter whether 
'hoodie.populate.meta.fields' is enabled
+// to avoid unnecessary rewrite. Even with metadata table(whereas the 
option 'hoodie.populate.meta.fields' is configured as false),
+// the record is deserialized with schema including metadata fields,
+// see HoodieMergeHelper#runMerge for more details.
+Schema oldSchema = writeSchemaWithMetaFields;
 Schema newSchema = preserveMetadata ? writeSchemaWithMetaFields : 
writeSchema;
 boolean copyOldRecord = true;
 String key = oldRecord.getRecordKey(oldSchema, keyGeneratorOpt);



(hudi) 11/28: [HUDI-7378] Fix Spark SQL DML with custom key generator (#10615)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 13beb355176c93574908751b0cb0b22bf6899fbb
Author: Y Ethan Guo 
AuthorDate: Fri Apr 12 22:51:03 2024 -0700

[HUDI-7378] Fix Spark SQL DML with custom key generator (#10615)
---
 .../factory/HoodieSparkKeyGeneratorFactory.java|   4 +
 .../org/apache/hudi/util/SparkKeyGenUtils.scala|  16 +-
 .../scala/org/apache/hudi/HoodieWriterUtils.scala  |  20 +-
 .../spark/sql/hudi/ProvidesHoodieConfig.scala  |  60 ++-
 .../spark/sql/hudi/TestProvidesHoodieConfig.scala  |  79 +++
 .../hudi/command/MergeIntoHoodieTableCommand.scala |   5 +-
 .../TestSparkSqlWithCustomKeyGenerator.scala   | 571 +
 7 files changed, 742 insertions(+), 13 deletions(-)

diff --git 
a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/factory/HoodieSparkKeyGeneratorFactory.java
 
b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/factory/HoodieSparkKeyGeneratorFactory.java
index 1ea5adcd6b4..dcc2eaec9eb 100644
--- 
a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/factory/HoodieSparkKeyGeneratorFactory.java
+++ 
b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/factory/HoodieSparkKeyGeneratorFactory.java
@@ -79,6 +79,10 @@ public class HoodieSparkKeyGeneratorFactory {
 
   public static KeyGenerator createKeyGenerator(TypedProperties props) throws 
IOException {
 String keyGeneratorClass = getKeyGeneratorClassName(props);
+return createKeyGenerator(keyGeneratorClass, props);
+  }
+
+  public static KeyGenerator createKeyGenerator(String keyGeneratorClass, 
TypedProperties props) throws IOException {
 boolean autoRecordKeyGen = 
KeyGenUtils.isAutoGeneratedRecordKeysEnabled(props)
 //Need to prevent overwriting the keygen for spark sql merge into 
because we need to extract
 //the recordkey from the meta cols if it exists. Sql keygen will use 
pkless keygen if needed.
diff --git 
a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/util/SparkKeyGenUtils.scala
 
b/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/util/SparkKeyGenUtils.scala
index 7b91ae5a728..bd094464096 100644
--- 
a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/util/SparkKeyGenUtils.scala
+++ 
b/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/util/SparkKeyGenUtils.scala
@@ -21,8 +21,8 @@ import org.apache.hudi.common.config.TypedProperties
 import org.apache.hudi.common.util.StringUtils
 import org.apache.hudi.common.util.ValidationUtils.checkArgument
 import org.apache.hudi.keygen.constant.KeyGeneratorOptions
-import org.apache.hudi.keygen.{AutoRecordKeyGeneratorWrapper, 
AutoRecordGenWrapperKeyGenerator, CustomAvroKeyGenerator, CustomKeyGenerator, 
GlobalAvroDeleteKeyGenerator, GlobalDeleteKeyGenerator, KeyGenerator, 
NonpartitionedAvroKeyGenerator, NonpartitionedKeyGenerator}
 import org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory
+import org.apache.hudi.keygen.{AutoRecordKeyGeneratorWrapper, 
CustomAvroKeyGenerator, CustomKeyGenerator, GlobalAvroDeleteKeyGenerator, 
GlobalDeleteKeyGenerator, KeyGenerator, NonpartitionedAvroKeyGenerator, 
NonpartitionedKeyGenerator}
 
 object SparkKeyGenUtils {
 
@@ -35,6 +35,20 @@ object SparkKeyGenUtils {
 getPartitionColumns(keyGenerator, props)
   }
 
+  /**
+   * @param KeyGenClassNameOption key generator class name if present.
+   * @param props config properties.
+   * @return partition column names only, concatenated by ","
+   */
+  def getPartitionColumns(KeyGenClassNameOption: Option[String], props: 
TypedProperties): String = {
+val keyGenerator = if (KeyGenClassNameOption.isEmpty) {
+  HoodieSparkKeyGeneratorFactory.createKeyGenerator(props)
+} else {
+  
HoodieSparkKeyGeneratorFactory.createKeyGenerator(KeyGenClassNameOption.get, 
props)
+}
+getPartitionColumns(keyGenerator, props)
+  }
+
   /**
* @param keyGen key generator class name
* @return partition columns
diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieWriterUtils.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieWriterUtils.scala
index 0a4ef7a3d63..fade5957210 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieWriterUtils.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieWriterUtils.scala
@@ -197,8 +197,26 @@ object HoodieWriterUtils {
   
diffConfigs.append(s"KeyGenerator:\t$datasourceKeyGen\t$tableConfigKeyGen\n")
 }
 
+// Please note that the validation of partition path fields needs the 
key generator class
+// for the table, since the custom key generator expects a different 
format of
+// the value of the write config 
"hoodie.datasource.write

(hudi) 18/28: [MINOR] Remove redundant lines in StreamSync and TestStreamSyncUnitTests (#11027)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit d34ba818f75f65eec9af63d50779c7887a0af455
Author: Y Ethan Guo 
AuthorDate: Mon Apr 15 21:41:41 2024 -0700

[MINOR] Remove redundant lines in StreamSync and TestStreamSyncUnitTests 
(#11027)
---
 .../apache/hudi/utilities/streamer/StreamSync.java   |  4 
 .../utilities/streamer/TestStreamSyncUnitTests.java  | 20 
 2 files changed, 24 deletions(-)

diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java
index 2b0d94da74a..7e0b97ef570 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java
@@ -278,7 +278,6 @@ public class StreamSync implements Serializable, Closeable {
 this.formatAdapter = formatAdapter;
 this.transformer = transformer;
 this.useRowWriter = useRowWriter;
-
   }
 
   @Deprecated
@@ -500,7 +499,6 @@ public class StreamSync implements Serializable, Closeable {
* @return Pair Input data read from upstream 
source, and boolean is true if empty.
* @throws Exception in case of any Exception
*/
-
   public InputBatch readFromSource(String instantTime, HoodieTableMetaClient 
metaClient) throws IOException {
 // Retrieve the previous round checkpoints, if any
 Option resumeCheckpointStr = Option.empty();
@@ -563,7 +561,6 @@ public class StreamSync implements Serializable, Closeable {
 // handle empty batch with change in checkpoint
 hoodieSparkContext.setJobStatus(this.getClass().getSimpleName(), "Checking 
if input is empty: " + cfg.targetTableName);
 
-
 if (useRowWriter) { // no additional processing required for row writer.
   return inputBatch;
 } else {
@@ -1297,5 +1294,4 @@ public class StreamSync implements Serializable, 
Closeable {
   return writeStatusRDD;
 }
   }
-
 }
diff --git 
a/hudi-utilities/src/test/java/org/apache/hudi/utilities/streamer/TestStreamSyncUnitTests.java
 
b/hudi-utilities/src/test/java/org/apache/hudi/utilities/streamer/TestStreamSyncUnitTests.java
index 99148eb4b07..c0169ae64b8 100644
--- 
a/hudi-utilities/src/test/java/org/apache/hudi/utilities/streamer/TestStreamSyncUnitTests.java
+++ 
b/hudi-utilities/src/test/java/org/apache/hudi/utilities/streamer/TestStreamSyncUnitTests.java
@@ -17,25 +17,6 @@
  * under the License.
  */
 
-/*
- * Licensed to the Apache Software Foundation (ASF) under one
- * or more contributor license agreements.  See the NOTICE file
- * distributed with this work for additional information
- * regarding copyright ownership.  The ASF licenses this file
- * to you under the Apache License, Version 2.0 (the
- * "License"); you may not use this file except in compliance
- * with the License.  You may obtain a copy of the License at
- *
- *   http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing,
- * software distributed under the License is distributed on an
- * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- * KIND, either express or implied.  See the License for the
- * specific language governing permissions and limitations
- * under the License.
- */
-
 package org.apache.hudi.utilities.streamer;
 
 import org.apache.hudi.DataSourceWriteOptions;
@@ -75,7 +56,6 @@ import static org.mockito.Mockito.verify;
 import static org.mockito.Mockito.when;
 
 public class TestStreamSyncUnitTests {
-
   @ParameterizedTest
   @MethodSource("testCasesFetchNextBatchFromSource")
   void testFetchNextBatchFromSource(Boolean useRowWriter, Boolean 
hasTransformer, Boolean hasSchemaProvider,



(hudi) 27/28: [HUDI-7635] Add default block size and openSeekable APIs to HoodieStorage (#11048)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit c8c52e252fb8b2e273cab089359e4d025bf9fe08
Author: Y Ethan Guo 
AuthorDate: Tue May 14 17:02:25 2024 -0700

[HUDI-7635] Add default block size and openSeekable APIs to HoodieStorage 
(#11048)

This PR adds `getDefaultBlockSize` and `openSeekable` APIs to
`HoodieStorage` and implements these APIs in `HoodieHadoopStorage`.
The implementation follows the same logic of creating seekable input
stream for log file reading, and `openSeekable` will be used by the log
reading logic.

A few util methods are moved from `FSUtils` and
`HoodieLogFileReader` classes to `HadoopFSUtilsclass`.
---
 .../java/org/apache/hudi/common/fs/FSUtils.java| 18 -
 .../hudi/common/table/log/HoodieLogFileReader.java | 75 +-
 .../org/apache/hudi/hadoop/fs/HadoopFSUtils.java   | 90 ++
 .../hudi/storage/hadoop/HoodieHadoopStorage.java   | 13 
 .../org/apache/hudi/storage/HoodieStorage.java | 30 
 .../hudi/io/storage/TestHoodieStorageBase.java | 43 +++
 6 files changed, 179 insertions(+), 90 deletions(-)

diff --git a/hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java 
b/hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java
index 292c2b41946..1b51fd78bfa 100644
--- a/hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java
+++ b/hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java
@@ -667,24 +667,6 @@ public class FSUtils {
 return fs.getUri() + fullPartitionPath.toUri().getRawPath();
   }
 
-  /**
-   * This is due to HUDI-140 GCS has a different behavior for detecting EOF 
during seek().
-   *
-   * @param fs fileSystem instance.
-   * @return true if the inputstream or the wrapped one is of type 
GoogleHadoopFSInputStream
-   */
-  public static boolean isGCSFileSystem(FileSystem fs) {
-return fs.getScheme().equals(StorageSchemes.GCS.getScheme());
-  }
-
-  /**
-   * Chdfs will throw {@code IOException} instead of {@code EOFException}. It 
will cause error in isBlockCorrupted().
-   * Wrapped by {@code BoundedFsDataInputStream}, to check whether the desired 
offset is out of the file size in advance.
-   */
-  public static boolean isCHDFileSystem(FileSystem fs) {
-return StorageSchemes.CHDFS.getScheme().equals(fs.getScheme());
-  }
-
   public static Configuration registerFileSystem(Path file, Configuration 
conf) {
 Configuration returnConf = new Configuration(conf);
 String scheme = HadoopFSUtils.getFs(file.toString(), conf).getScheme();
diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java
 
b/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java
index c1daf5e32d1..062e3639073 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java
@@ -37,20 +37,15 @@ import org.apache.hudi.common.util.Option;
 import org.apache.hudi.exception.CorruptedLogFileException;
 import org.apache.hudi.exception.HoodieIOException;
 import org.apache.hudi.exception.HoodieNotSupportedException;
-import org.apache.hudi.hadoop.fs.BoundedFsDataInputStream;
 import org.apache.hudi.hadoop.fs.HadoopSeekableDataInputStream;
-import org.apache.hudi.hadoop.fs.SchemeAwareFSDataInputStream;
-import org.apache.hudi.hadoop.fs.TimedFSDataInputStream;
 import org.apache.hudi.internal.schema.InternalSchema;
 import org.apache.hudi.io.SeekableDataInputStream;
 import org.apache.hudi.io.util.IOUtils;
+import org.apache.hudi.storage.StoragePath;
 import org.apache.hudi.storage.StorageSchemes;
 
 import org.apache.avro.Schema;
 import org.apache.hadoop.conf.Configuration;
-import org.apache.hadoop.fs.BufferedFSInputStream;
-import org.apache.hadoop.fs.FSDataInputStream;
-import org.apache.hadoop.fs.FSInputStream;
 import org.apache.hadoop.fs.FileSystem;
 import org.apache.hadoop.fs.Path;
 import org.slf4j.Logger;
@@ -67,6 +62,7 @@ import java.util.Objects;
 
 import static org.apache.hudi.common.util.ValidationUtils.checkArgument;
 import static org.apache.hudi.common.util.ValidationUtils.checkState;
+import static org.apache.hudi.hadoop.fs.HadoopFSUtils.getFSDataInputStream;
 
 /**
  * Scans a log file and provides block level iterator on the log file Loads 
the entire block contents in memory Can emit
@@ -479,71 +475,6 @@ public class HoodieLogFileReader implements 
HoodieLogFormat.Reader {
   private static SeekableDataInputStream getDataInputStream(FileSystem fs,
 HoodieLogFile 
logFile,
 int bufferSize) {
-return new HadoopSeekableDataInputStream(getFSDataInputStream(fs, logFile, 
bufferSize));
-  }
-
-  

(hudi) 19/28: [MINOR] Rename location to path in `makeQualified` (#11037)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 9a0349828e6c6674865976501af42d0c6bc95d23
Author: Y Ethan Guo 
AuthorDate: Tue Apr 16 18:30:11 2024 -0700

[MINOR] Rename location to path in `makeQualified` (#11037)
---
 .../src/main/java/org/apache/hudi/common/fs/FSUtils.java | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java 
b/hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java
index 68cc5c131db..292c2b41946 100644
--- a/hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java
+++ b/hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java
@@ -123,14 +123,14 @@ public class FSUtils {
   }
 
   /**
-   * Makes location qualified with {@link HoodieStorage}'s URI.
+   * Makes path qualified with {@link HoodieStorage}'s URI.
*
-   * @param storage  instance of {@link HoodieStorage}.
-   * @param location to be qualified.
-   * @return qualified location, prefixed with the URI of the target 
HoodieStorage object provided.
+   * @param storage instance of {@link HoodieStorage}.
+   * @param pathto be qualified.
+   * @return qualified path, prefixed with the URI of the target HoodieStorage 
object provided.
*/
-  public static StoragePath makeQualified(HoodieStorage storage, StoragePath 
location) {
-return location.makeQualified(storage.getUri());
+  public static StoragePath makeQualified(HoodieStorage storage, StoragePath 
path) {
+return path.makeQualified(storage.getUri());
   }
 
   /**



(hudi) 25/28: [MINOR] Remove redundant TestStringUtils in hudi-common (#11046)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 925f2be736ab78137c35d0752218cd48d03b2fd0
Author: Y Ethan Guo 
AuthorDate: Wed Apr 17 21:34:06 2024 -0700

[MINOR] Remove redundant TestStringUtils in hudi-common (#11046)
---
 .../apache/hudi/common/util/TestStringUtils.java   | 124 -
 1 file changed, 124 deletions(-)

diff --git 
a/hudi-common/src/test/java/org/apache/hudi/common/util/TestStringUtils.java 
b/hudi-common/src/test/java/org/apache/hudi/common/util/TestStringUtils.java
deleted file mode 100644
index 54985056bf0..000
--- a/hudi-common/src/test/java/org/apache/hudi/common/util/TestStringUtils.java
+++ /dev/null
@@ -1,124 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one
- * or more contributor license agreements.  See the NOTICE file
- * distributed with this work for additional information
- * regarding copyright ownership.  The ASF licenses this file
- * to you under the Apache License, Version 2.0 (the
- * "License"); you may not use this file except in compliance
- * with the License.  You may obtain a copy of the License at
- *
- *  http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.hudi.common.util;
-
-import org.junit.jupiter.api.Test;
-
-import java.nio.ByteBuffer;
-import java.util.ArrayList;
-import java.util.Arrays;
-import java.util.Collections;
-
-import static org.apache.hudi.common.util.StringUtils.getUTF8Bytes;
-import static org.junit.jupiter.api.Assertions.assertEquals;
-import static org.junit.jupiter.api.Assertions.assertNotEquals;
-import static org.junit.jupiter.api.Assertions.assertNull;
-import static org.junit.jupiter.api.Assertions.assertTrue;
-
-/**
- * Tests {@link StringUtils}.
- */
-public class TestStringUtils {
-
-  private static final String[] STRINGS = {"This", "is", "a", "test"};
-
-  @Test
-  public void testStringJoinWithDelim() {
-String joinedString = StringUtils.joinUsingDelim("-", STRINGS);
-assertEquals(STRINGS.length, joinedString.split("-").length);
-  }
-
-  @Test
-  public void testStringJoin() {
-assertNotEquals(null, StringUtils.join(""));
-assertNotEquals(null, StringUtils.join(STRINGS));
-  }
-
-  @Test
-  public void testStringJoinWithJavaImpl() {
-assertNull(StringUtils.join(",", null));
-assertEquals("", String.join(",", Collections.singletonList("")));
-assertEquals(",", String.join(",", Arrays.asList("", "")));
-assertEquals("a,", String.join(",", Arrays.asList("a", "")));
-  }
-
-  @Test
-  public void testStringNullToEmpty() {
-String str = "This is a test";
-assertEquals(str, StringUtils.nullToEmpty(str));
-assertEquals("", StringUtils.nullToEmpty(null));
-  }
-
-  @Test
-  public void testStringObjToString() {
-assertNull(StringUtils.objToString(null));
-assertEquals("Test String", StringUtils.objToString("Test String"));
-
-// assert byte buffer
-ByteBuffer byteBuffer1 = ByteBuffer.wrap(getUTF8Bytes("1234"));
-ByteBuffer byteBuffer2 = ByteBuffer.wrap(getUTF8Bytes("5678"));
-// assert equal because ByteBuffer has overwritten the toString to return 
a summary string
-assertEquals(byteBuffer1.toString(), byteBuffer2.toString());
-// assert not equal
-assertNotEquals(StringUtils.objToString(byteBuffer1), 
StringUtils.objToString(byteBuffer2));
-  }
-
-  @Test
-  public void testStringEmptyToNull() {
-assertNull(StringUtils.emptyToNull(""));
-assertEquals("Test String", StringUtils.emptyToNull("Test String"));
-  }
-
-  @Test
-  public void testStringNullOrEmpty() {
-assertTrue(StringUtils.isNullOrEmpty(null));
-assertTrue(StringUtils.isNullOrEmpty(""));
-assertNotEquals(null, StringUtils.isNullOrEmpty("this is not empty"));
-assertTrue(StringUtils.isNullOrEmpty(""));
-  }
-
-  @Test
-  public void testSplit() {
-assertEquals(new ArrayList<>(), StringUtils.split(null, ","));
-assertEquals(new ArrayList<>(), StringUtils.split("", ","));
-assertEquals(Arrays.asList("a", "b", "c"), StringUtils.split("a,b, c", 
","));
-assertEquals(Arrays.asList("a", "b", "c"), StringUtils.split("a,b,, c ", 
","));
-  }
-
-  @Test
-  public void testHexString() {
-String str = "abcd";
-assertEquals(StringUtils.toHexString(getUTF8Bytes(str)), 
toHexString(getUTF8Bytes(str)));
-  }
-
-  private static String toHexString(byte[] bytes) {
-StringBuilder sb = new StringBuilder(bytes.length * 2);
-for (byte b : bytes) {
-  sb.append(String.format("%02x", b));
-}
-return sb.toString();
-  }
-
- 

(hudi) 15/28: [HUDI-7584] Always read log block lazily and remove readBlockLazily argument (#11015)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 2d84c5c5b1fa31b157f1c5746e916c81f05312b3
Author: Vova Kolmakov 
AuthorDate: Mon Apr 15 11:31:11 2024 +0700

[HUDI-7584] Always read log block lazily and remove readBlockLazily 
argument (#11015)
---
 .../hudi/cli/commands/HoodieLogFileCommand.java|   3 -
 .../cli/commands/TestHoodieLogFileCommand.java |   3 -
 .../org/apache/hudi/io/HoodieMergedReadHandle.java |   1 -
 .../hudi/table/action/compact/HoodieCompactor.java |   1 -
 .../run/strategy/JavaExecutionStrategy.java|   1 -
 .../MultipleSparkJobExecutionStrategy.java |   1 -
 .../hudi/common/table/TableSchemaResolver.java |  21 ++--
 .../table/log/AbstractHoodieLogRecordReader.java   |  65 +--
 .../table/log/HoodieCDCLogRecordIterator.java  |   3 +-
 .../hudi/common/table/log/HoodieLogFileReader.java |  69 +--
 .../hudi/common/table/log/HoodieLogFormat.java |  13 +--
 .../common/table/log/HoodieLogFormatReader.java|  14 +--
 .../table/log/HoodieMergedLogRecordScanner.java|  27 ++---
 .../table/log/HoodieUnMergedLogRecordScanner.java  |  12 +-
 .../hudi/common/table/log/LogReaderUtils.java  |   2 +-
 .../metadata/HoodieMetadataLogRecordReader.java|   1 -
 .../hudi/metadata/HoodieTableMetadataUtil.java |   1 -
 .../common/functional/TestHoodieLogFormat.java | 128 +++--
 .../examples/quickstart/TestQuickstartData.java|   1 -
 .../hudi/sink/clustering/ClusteringOperator.java   |   1 -
 .../org/apache/hudi/table/format/FormatUtils.java  |   6 -
 .../test/java/org/apache/hudi/utils/TestData.java  |   1 -
 .../realtime/HoodieMergeOnReadSnapshotReader.java  |   3 -
 .../realtime/RealtimeCompactedRecordReader.java|   1 -
 .../realtime/RealtimeUnmergedRecordReader.java |   1 -
 .../reader/DFSHoodieDatasetInputReader.java|   1 -
 .../src/main/scala/org/apache/hudi/Iterators.scala |   4 -
 .../ShowHoodieLogFileRecordsProcedure.scala|   1 -
 .../utilities/HoodieMetadataTableValidator.java| 126 +---
 29 files changed, 188 insertions(+), 324 deletions(-)

diff --git 
a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/HoodieLogFileCommand.java 
b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/HoodieLogFileCommand.java
index 46a9e787ea6..77d9392fcd0 100644
--- 
a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/HoodieLogFileCommand.java
+++ 
b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/HoodieLogFileCommand.java
@@ -238,9 +238,6 @@ public class HoodieLogFileCommand {
   .withLatestInstantTime(
   client.getActiveTimeline()
   .getCommitTimeline().lastInstant().get().getTimestamp())
-  .withReadBlocksLazily(
-  Boolean.parseBoolean(
-  
HoodieCompactionConfig.COMPACTION_LAZY_BLOCK_READ_ENABLE.defaultValue()))
   .withReverseReader(
   Boolean.parseBoolean(
   
HoodieCompactionConfig.COMPACTION_REVERSE_LOG_READ_ENABLE.defaultValue()))
diff --git 
a/hudi-cli/src/test/java/org/apache/hudi/cli/commands/TestHoodieLogFileCommand.java
 
b/hudi-cli/src/test/java/org/apache/hudi/cli/commands/TestHoodieLogFileCommand.java
index 6f75074ff29..dc9cdd1aaf1 100644
--- 
a/hudi-cli/src/test/java/org/apache/hudi/cli/commands/TestHoodieLogFileCommand.java
+++ 
b/hudi-cli/src/test/java/org/apache/hudi/cli/commands/TestHoodieLogFileCommand.java
@@ -241,9 +241,6 @@ public class TestHoodieLogFileCommand extends 
CLIFunctionalTestHarness {
 .withLatestInstantTime(INSTANT_TIME)
 .withMaxMemorySizeInBytes(
 HoodieMemoryConfig.DEFAULT_MAX_MEMORY_FOR_SPILLABLE_MAP_IN_BYTES)
-.withReadBlocksLazily(
-Boolean.parseBoolean(
-
HoodieCompactionConfig.COMPACTION_LAZY_BLOCK_READ_ENABLE.defaultValue()))
 .withReverseReader(
 Boolean.parseBoolean(
 
HoodieCompactionConfig.COMPACTION_REVERSE_LOG_READ_ENABLE.defaultValue()))
diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergedReadHandle.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergedReadHandle.java
index e74ab37f4b6..280e24e46b9 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergedReadHandle.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergedReadHandle.java
@@ -126,7 +126,6 @@ public class HoodieMergedReadHandle extends 
HoodieReadHandle 
implements Serializable {
 .withInstantRange(instantRange)
 
.withInternalSchema(internalSchemaOption.orElse(InternalSchema.getEmptyInternalSchema()))
 .withMaxMemorySizeInBytes(maxMemoryPerCompaction)
-.withReadBlocksLazily(config.getCompactionLazyBlockReadEnabled())
   

(hudi) 12/28: [HUDI-7616] Avoid multiple cleaner plans and deprecate hoodie.clean.allow.multiple (#11013)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 5e0a7cd555276ba1db5f7ea72e3bc94be8438048
Author: Y Ethan Guo 
AuthorDate: Sat Apr 13 19:04:44 2024 -0700

[HUDI-7616] Avoid multiple cleaner plans and deprecate 
hoodie.clean.allow.multiple (#11013)
---
 .../src/main/java/org/apache/hudi/config/HoodieCleanConfig.java | 4 +++-
 .../src/test/java/org/apache/hudi/table/TestCleaner.java| 6 +++---
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java
index a4114152023..e023bee4274 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java
@@ -167,11 +167,13 @@ public class HoodieCleanConfig extends HoodieConfig {
   + "execution is slow due to limited parallelism, you can increase 
this to tune the "
   + "performance..");
 
+  @Deprecated
   public static final ConfigProperty ALLOW_MULTIPLE_CLEANS = 
ConfigProperty
   .key("hoodie.clean.allow.multiple")
-  .defaultValue(true)
+  .defaultValue(false)
   .markAdvanced()
   .sinceVersion("0.11.0")
+  .deprecatedAfter("1.0.0")
   .withDocumentation("Allows scheduling/executing multiple cleans by 
enabling this config. If users prefer to strictly ensure clean requests should 
be mutually exclusive, "
   + ".i.e. a 2nd clean will not be scheduled if another clean is not 
yet completed to avoid repeat cleaning of same files, they might want to 
disable this config.");
 
diff --git 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/TestCleaner.java
 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/TestCleaner.java
index b18238f3392..6a8ce948373 100644
--- 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/TestCleaner.java
+++ 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/TestCleaner.java
@@ -593,13 +593,13 @@ public class TestCleaner extends HoodieCleanerTestBase {
   timeline = metaClient.reloadActiveTimeline();
 
   assertEquals(0, cleanStats.size(), "Must not clean any files");
-  assertEquals(1, timeline.getTimelineOfActions(
+  assertEquals(0, timeline.getTimelineOfActions(
   
CollectionUtils.createSet(HoodieTimeline.CLEAN_ACTION)).filterInflightsAndRequested().countInstants());
   assertEquals(0, timeline.getTimelineOfActions(
   
CollectionUtils.createSet(HoodieTimeline.CLEAN_ACTION)).filterInflights().countInstants());
-  assertEquals(--cleanCount, timeline.getTimelineOfActions(
+  assertEquals(cleanCount, timeline.getTimelineOfActions(
   
CollectionUtils.createSet(HoodieTimeline.CLEAN_ACTION)).filterCompletedInstants().countInstants());
-  assertTrue(timeline.getTimelineOfActions(
+  assertFalse(timeline.getTimelineOfActions(
   
CollectionUtils.createSet(HoodieTimeline.CLEAN_ACTION)).filterInflightsAndRequested().containsInstant(makeNewCommitTime(--instantClean,
 "%09d")));
 }
   }



(hudi) 20/28: [HUDI-7578] Avoid unnecessary rewriting to improve performance (#11028)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit f1f6f93fe0f6e88b68e51b9b5703e9ae34030a95
Author: Danny Chan 
AuthorDate: Wed Apr 17 11:31:17 2024 +0800

[HUDI-7578] Avoid unnecessary rewriting to improve performance (#11028)
---
 .../src/main/java/org/apache/hudi/io/HoodieMergeHandle.java | 13 +
 .../org/apache/hudi/io/HoodieMergeHandleWithChangeLog.java  |  2 +-
 .../java/org/apache/hudi/io/HoodieSortedMergeHandle.java|  4 ++--
 .../hudi/io/FlinkMergeAndReplaceHandleWithChangeLog.java|  2 +-
 .../org/apache/hudi/io/FlinkMergeHandleWithChangeLog.java   |  2 +-
 .../src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java |  4 
 6 files changed, 14 insertions(+), 13 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
index e40a5585067..749b08c3e7e 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
@@ -103,7 +103,7 @@ public class HoodieMergeHandle extends 
HoodieWriteHandle
   protected Map> keyToNewRecords;
   protected Set writtenRecordKeys;
   protected HoodieFileWriter fileWriter;
-  private boolean preserveMetadata = false;
+  protected boolean preserveMetadata = false;
 
   protected Path newFilePath;
   protected Path oldFilePath;
@@ -111,7 +111,6 @@ public class HoodieMergeHandle extends 
HoodieWriteHandle
   protected long recordsDeleted = 0;
   protected long updatedRecordsWritten = 0;
   protected long insertRecordsWritten = 0;
-  protected boolean useWriterSchemaForCompaction;
   protected Option keyGeneratorOpt;
   private HoodieBaseFile baseFileToMerge;
 
@@ -142,7 +141,6 @@ public class HoodieMergeHandle extends 
HoodieWriteHandle
HoodieBaseFile dataFileToBeMerged, 
TaskContextSupplier taskContextSupplier, Option 
keyGeneratorOpt) {
 super(config, instantTime, partitionPath, fileId, hoodieTable, 
taskContextSupplier);
 this.keyToNewRecords = keyToNewRecords;
-this.useWriterSchemaForCompaction = true;
 this.preserveMetadata = true;
 init(fileId, this.partitionPath, dataFileToBeMerged);
 validateAndSetAndKeyGenProps(keyGeneratorOpt, config.populateMetaFields());
@@ -279,7 +277,7 @@ public class HoodieMergeHandle extends 
HoodieWriteHandle
   }
 
   protected void writeInsertRecord(HoodieRecord newRecord) throws 
IOException {
-Schema schema = useWriterSchemaForCompaction ? writeSchemaWithMetaFields : 
writeSchema;
+Schema schema = preserveMetadata ? writeSchemaWithMetaFields : writeSchema;
 // just skip the ignored record
 if (newRecord.shouldIgnore(schema, config.getProps())) {
   return;
@@ -308,7 +306,7 @@ public class HoodieMergeHandle extends 
HoodieWriteHandle
 }
 try {
   if (combineRecord.isPresent() && !combineRecord.get().isDelete(schema, 
config.getProps()) && !isDelete) {
-writeToFile(newRecord.getKey(), combineRecord.get(), schema, prop, 
preserveMetadata && useWriterSchemaForCompaction);
+writeToFile(newRecord.getKey(), combineRecord.get(), schema, prop, 
preserveMetadata);
 recordsWritten++;
   } else {
 recordsDeleted++;
@@ -335,7 +333,7 @@ public class HoodieMergeHandle extends 
HoodieWriteHandle
*/
   public void write(HoodieRecord oldRecord) {
 Schema oldSchema = config.populateMetaFields() ? writeSchemaWithMetaFields 
: writeSchema;
-Schema newSchema = useWriterSchemaForCompaction ? 
writeSchemaWithMetaFields : writeSchema;
+Schema newSchema = preserveMetadata ? writeSchemaWithMetaFields : 
writeSchema;
 boolean copyOldRecord = true;
 String key = oldRecord.getRecordKey(oldSchema, keyGeneratorOpt);
 TypedProperties props = config.getPayloadConfig().getProps();
@@ -384,8 +382,7 @@ public class HoodieMergeHandle extends 
HoodieWriteHandle
 // NOTE: `FILENAME_METADATA_FIELD` has to be rewritten to correctly point 
to the
 //   file holding this record even in cases when overall metadata is 
preserved
 MetadataValues metadataValues = new 
MetadataValues().setFileName(newFilePath.getName());
-HoodieRecord populatedRecord =
-record.prependMetaFields(schema, writeSchemaWithMetaFields, 
metadataValues, prop);
+HoodieRecord populatedRecord = record.prependMetaFields(schema, 
writeSchemaWithMetaFields, metadataValues, prop);
 
 if (shouldPreserveRecordMetadata) {
   fileWriter.write(key.getRecordKey(), populatedRecord, 
writeSchemaWithMetaFields);
diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandleWithChangeLog.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandleWithChange

(hudi) 09/28: [HUDI-7290] Don't assume ReplaceCommits are always Clustering (#10479)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 5b37e8412496224e6746e46100abe3e5b9f6c37d
Author: Jon Vexler 
AuthorDate: Fri Apr 12 00:08:37 2024 -0400

[HUDI-7290]  Don't assume ReplaceCommits are always Clustering (#10479)

* fix all usages not in tests
* do pass through and fix
* fix test that didn't actually use a cluster commit
* make method private and fix naming
* revert write markers changes

-

Co-authored-by: Jonathan Vexler <=>
---
 .../hudi/client/BaseHoodieTableServiceClient.java  | 10 ---
 .../org/apache/hudi/table/marker/WriteMarkers.java |  2 ++
 .../table/timeline/HoodieDefaultTimeline.java  | 31 --
 .../hudi/common/table/timeline/HoodieTimeline.java | 11 
 .../table/view/AbstractTableFileSystemView.java|  5 +---
 .../table/view/TestHoodieTableFileSystemView.java  | 30 +++--
 .../clustering/ClusteringPlanSourceFunction.java   |  2 +-
 .../java/org/apache/hudi/util/ClusteringUtil.java  |  2 +-
 .../apache/hudi/utilities/HoodieClusteringJob.java | 12 -
 9 files changed, 86 insertions(+), 19 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java
index 909581687d4..e408dc7a779 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java
@@ -444,8 +444,12 @@ public abstract class BaseHoodieTableServiceClient extends BaseHoodieCl
 HoodieTimeline pendingClusteringTimeline = 
table.getActiveTimeline().filterPendingReplaceTimeline();
 HoodieInstant inflightInstant = 
HoodieTimeline.getReplaceCommitInflightInstant(clusteringInstant);
 if (pendingClusteringTimeline.containsInstant(inflightInstant)) {
-  table.rollbackInflightClustering(inflightInstant, commitToRollback -> 
getPendingRollbackInfo(table.getMetaClient(), commitToRollback, false));
-  table.getMetaClient().reloadActiveTimeline();
+  if 
(pendingClusteringTimeline.isPendingClusterInstant(inflightInstant.getTimestamp()))
 {
+table.rollbackInflightClustering(inflightInstant, commitToRollback -> 
getPendingRollbackInfo(table.getMetaClient(), commitToRollback, false));
+table.getMetaClient().reloadActiveTimeline();
+  } else {
+throw new HoodieClusteringException("Non clustering replace-commit 
inflight at timestamp " + clusteringInstant);
+  }
 }
 clusteringTimer = metrics.getClusteringCtx();
 LOG.info("Starting clustering at {}", clusteringInstant);
@@ -575,7 +579,7 @@ public abstract class BaseHoodieTableServiceClient 
extends BaseHoodieCl
 
 // if just inline schedule is enabled
 if (!config.inlineClusteringEnabled() && config.scheduleInlineClustering()
-&& table.getActiveTimeline().filterPendingReplaceTimeline().empty()) {
+&& 
!table.getActiveTimeline().getLastPendingClusterInstant().isPresent()) {
   // proceed only if there are no pending clustering
   
metadata.addMetadata(HoodieClusteringConfig.SCHEDULE_INLINE_CLUSTERING.key(), 
"true");
   inlineScheduleClustering(extraMetadata);
diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/marker/WriteMarkers.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/marker/WriteMarkers.java
index 01c8c99618a..f8fbd13b1c2 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/marker/WriteMarkers.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/marker/WriteMarkers.java
@@ -87,6 +87,7 @@ public abstract class WriteMarkers implements Serializable {
   HoodieTimeline pendingReplaceTimeline = 
activeTimeline.filterPendingReplaceTimeline();
   // TODO If current is compact or clustering then create marker directly 
without early conflict detection.
   // Need to support early conflict detection between table service and 
common writers.
+  // ok to use filterPendingReplaceTimeline().containsInstant because 
early conflict detection is not relevant for insert overwrite as well
   if (pendingCompactionTimeline.containsInstant(instantTime) || 
pendingReplaceTimeline.containsInstant(instantTime)) {
 return create(partitionPath, fileName, type, false);
   }
@@ -127,6 +128,7 @@ public abstract class WriteMarkers implements Serializable {
   HoodieTimeline pendingReplaceTimeline = 
activeTimeline.filterPendingReplaceTimeline();
   // TODO If current is compact or clustering then create marker directly 
without early conflict detection.
   // Need to support early conflict detec

(hudi) 05/28: [HUDI-7600] Shutdown ExecutorService when HiveMetastoreBasedLockProvider is closed (#10993)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit f01c133297862f14d4894be782456ecc72485510
Author: Zouxxyy 
AuthorDate: Thu Apr 11 13:03:14 2024 +0800

[HUDI-7600] Shutdown ExecutorService when HiveMetastoreBasedLockProvider is 
closed (#10993)
---
 .../hudi/hive/transaction/lock/HiveMetastoreBasedLockProvider.java   | 1 +
 1 file changed, 1 insertion(+)

diff --git 
a/hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/transaction/lock/HiveMetastoreBasedLockProvider.java
 
b/hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/transaction/lock/HiveMetastoreBasedLockProvider.java
index df848957492..0280621bb53 100644
--- 
a/hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/transaction/lock/HiveMetastoreBasedLockProvider.java
+++ 
b/hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/transaction/lock/HiveMetastoreBasedLockProvider.java
@@ -154,6 +154,7 @@ public class HiveMetastoreBasedLockProvider implements 
LockProvider

(hudi) 10/28: [HUDI-7601] Add heartbeat mechanism to refresh lock (#10994)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 04ec9f669778e6e1d412af4f961076de03c30ae3
Author: Yann Byron 
AuthorDate: Fri Apr 12 14:12:04 2024 +0800

[HUDI-7601] Add heartbeat mechanism to refresh lock (#10994)

* [HUDI-7601] Add heartbeat mechanism to refresh lock
---
 .../org/apache/hudi/config/HoodieLockConfig.java   | 13 +++
 .../hudi/common/config/LockConfiguration.java  |  3 ++
 .../hudi/hive/transaction/lock/Heartbeat.java  | 42 ++
 .../lock/HiveMetastoreBasedLockProvider.java   | 23 ++--
 4 files changed, 79 insertions(+), 2 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLockConfig.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLockConfig.java
index b24aecf46c1..4fbae5326f3 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLockConfig.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLockConfig.java
@@ -36,6 +36,7 @@ import java.util.Properties;
 
 import static 
org.apache.hudi.common.config.LockConfiguration.DEFAULT_LOCK_ACQUIRE_NUM_RETRIES;
 import static 
org.apache.hudi.common.config.LockConfiguration.DEFAULT_LOCK_ACQUIRE_RETRY_WAIT_TIME_IN_MILLIS;
+import static 
org.apache.hudi.common.config.LockConfiguration.DEFAULT_LOCK_HEARTBEAT_INTERVAL_MS;
 import static 
org.apache.hudi.common.config.LockConfiguration.DEFAULT_ZK_CONNECTION_TIMEOUT_MS;
 import static 
org.apache.hudi.common.config.LockConfiguration.DEFAULT_ZK_SESSION_TIMEOUT_MS;
 import static 
org.apache.hudi.common.config.LockConfiguration.FILESYSTEM_LOCK_EXPIRE_PROP_KEY;
@@ -49,6 +50,7 @@ import static 
org.apache.hudi.common.config.LockConfiguration.LOCK_ACQUIRE_NUM_R
 import static 
org.apache.hudi.common.config.LockConfiguration.LOCK_ACQUIRE_RETRY_MAX_WAIT_TIME_IN_MILLIS_PROP_KEY;
 import static 
org.apache.hudi.common.config.LockConfiguration.LOCK_ACQUIRE_RETRY_WAIT_TIME_IN_MILLIS_PROP_KEY;
 import static 
org.apache.hudi.common.config.LockConfiguration.LOCK_ACQUIRE_WAIT_TIMEOUT_MS_PROP_KEY;
+import static 
org.apache.hudi.common.config.LockConfiguration.LOCK_HEARTBEAT_INTERVAL_MS_KEY;
 import static org.apache.hudi.common.config.LockConfiguration.LOCK_PREFIX;
 import static 
org.apache.hudi.common.config.LockConfiguration.ZK_BASE_PATH_PROP_KEY;
 import static 
org.apache.hudi.common.config.LockConfiguration.ZK_CONNECTION_TIMEOUT_MS_PROP_KEY;
@@ -111,6 +113,12 @@ public class HoodieLockConfig extends HoodieConfig {
   .sinceVersion("0.8.0")
   .withDocumentation("Timeout in ms, to wait on an individual lock 
acquire() call, at the lock provider.");
 
+  public static final ConfigProperty LOCK_HEARTBEAT_INTERVAL_MS = 
ConfigProperty
+  .key(LOCK_HEARTBEAT_INTERVAL_MS_KEY)
+  .defaultValue(DEFAULT_LOCK_HEARTBEAT_INTERVAL_MS)
+  .sinceVersion("1.0.0")
+  .withDocumentation("Heartbeat interval in ms, to send a heartbeat to 
indicate that hive client holding locks.");
+
   public static final ConfigProperty FILESYSTEM_LOCK_PATH = 
ConfigProperty
   .key(FILESYSTEM_LOCK_PATH_PROP_KEY)
   .noDefaultValue()
@@ -342,6 +350,11 @@ public class HoodieLockConfig extends HoodieConfig {
   return this;
 }
 
+public HoodieLockConfig.Builder withHeartbeatIntervalInMillis(Long 
intervalInMillis) {
+  lockConfig.setValue(LOCK_HEARTBEAT_INTERVAL_MS, 
String.valueOf(intervalInMillis));
+  return this;
+}
+
 public HoodieLockConfig.Builder 
withConflictResolutionStrategy(ConflictResolutionStrategy 
conflictResolutionStrategy) {
   lockConfig.setValue(WRITE_CONFLICT_RESOLUTION_STRATEGY_CLASS_NAME, 
conflictResolutionStrategy.getClass().getName());
   return this;
diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/config/LockConfiguration.java
 
b/hudi-common/src/main/java/org/apache/hudi/common/config/LockConfiguration.java
index c6ebc54e95d..1788122ffe4 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/config/LockConfiguration.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/config/LockConfiguration.java
@@ -43,6 +43,9 @@ public class LockConfiguration implements Serializable {
 
   public static final String LOCK_ACQUIRE_WAIT_TIMEOUT_MS_PROP_KEY = 
LOCK_PREFIX + "wait_time_ms";
 
+  public static final String LOCK_HEARTBEAT_INTERVAL_MS_KEY = LOCK_PREFIX + 
"heartbeat_interval_ms";
+  public static final int DEFAULT_LOCK_HEARTBEAT_INTERVAL_MS = 60 * 1000;
+
   // configs for file system based locks. NOTE: This only works for DFS with 
atomic create/delete operation
   public static final String FILESYSTEM_BASED_LOCK_PROPERTY_PREFIX = 
LOCK_PREFIX + "filesystem.";
 
diff --git 
a/hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/transaction/lock/Heartbeat.java
 
b/hudi-sync/hudi-hive-sync/src

(hudi) 01/28: [HUDI-7556] Fixing false positive validation with MDT validator (#10986)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit fad8ff04c67b8527506a88ad4d20dd589d055ffa
Author: Sivabalan Narayanan 
AuthorDate: Tue May 14 17:43:15 2024 -0700

[HUDI-7556] Fixing false positive validation with MDT validator (#10986)
---
 .../utilities/HoodieMetadataTableValidator.java|  96 +---
 .../TestHoodieMetadataTableValidator.java  | 125 -
 2 files changed, 181 insertions(+), 40 deletions(-)

diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java
index bbe8610abe3..0e6630967b3 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java
@@ -52,6 +52,7 @@ import org.apache.hudi.common.util.ConfigUtils;
 import org.apache.hudi.common.util.FileIOUtils;
 import org.apache.hudi.common.util.Option;
 import org.apache.hudi.common.util.ParquetUtils;
+import org.apache.hudi.common.util.VisibleForTesting;
 import org.apache.hudi.common.util.collection.Pair;
 import org.apache.hudi.exception.HoodieException;
 import org.apache.hudi.exception.HoodieIOException;
@@ -514,7 +515,9 @@ public class HoodieMetadataTableValidator implements 
Serializable {
 }
 
 HoodieSparkEngineContext engineContext = new HoodieSparkEngineContext(jsc);
-List allPartitions = validatePartitions(engineContext, basePath);
+// compare partitions
+
+List allPartitions = validatePartitions(engineContext, basePath, 
metaClient);
 
 if (allPartitions.isEmpty()) {
   LOG.warn("The result of getting all partitions is null or empty, skip 
current validation. {}", taskLabels);
@@ -612,39 +615,14 @@ public class HoodieMetadataTableValidator implements 
Serializable {
   /**
* Compare the listing partitions result between metadata table and 
fileSystem.
*/
-  private List validatePartitions(HoodieSparkEngineContext 
engineContext, String basePath) {
+  @VisibleForTesting
+  List validatePartitions(HoodieSparkEngineContext engineContext, 
String basePath, HoodieTableMetaClient metaClient) {
 // compare partitions
-List allPartitionPathsFromFS = 
FSUtils.getAllPartitionPaths(engineContext, basePath, false, 
cfg.assumeDatePartitioning);
 HoodieTimeline completedTimeline = 
metaClient.getCommitsTimeline().filterCompletedInstants();
+List allPartitionPathsFromFS = 
getPartitionsFromFileSystem(engineContext, basePath, metaClient.getFs(),
+completedTimeline);
 
-// ignore partitions created by uncommitted ingestion.
-allPartitionPathsFromFS = 
allPartitionPathsFromFS.stream().parallel().filter(part -> {
-  HoodiePartitionMetadata hoodiePartitionMetadata =
-  new HoodiePartitionMetadata(metaClient.getFs(), 
FSUtils.getPartitionPath(basePath, part));
-
-  Option instantOption = 
hoodiePartitionMetadata.readPartitionCreatedCommitTime();
-  if (instantOption.isPresent()) {
-String instantTime = instantOption.get();
-// There are two cases where the created commit time is written to the 
partition metadata:
-// (1) Commit C1 creates the partition and C1 succeeds, the partition 
metadata has C1 as
-// the created commit time.
-// (2) Commit C1 creates the partition, the partition metadata is 
written, and C1 fails
-// during writing data files.  Next time, C2 adds new data to the same 
partition after C1
-// is rolled back. In this case, the partition metadata still has C1 
as the created commit
-// time, since Hudi does not rewrite the partition metadata in C2.
-if (!completedTimeline.containsOrBeforeTimelineStarts(instantTime)) {
-  Option lastInstant = completedTimeline.lastInstant();
-  return lastInstant.isPresent()
-  && HoodieTimeline.compareTimestamps(
-  instantTime, LESSER_THAN_OR_EQUALS, 
lastInstant.get().getTimestamp());
-}
-return true;
-  } else {
-return false;
-  }
-}).collect(Collectors.toList());
-
-List allPartitionPathsMeta = 
FSUtils.getAllPartitionPaths(engineContext, basePath, true, 
cfg.assumeDatePartitioning);
+List allPartitionPathsMeta = getPartitionsFromMDT(engineContext, 
basePath);
 
 Collections.sort(allPartitionPathsFromFS);
 Collections.sort(allPartitionPathsMeta);
@@ -652,26 +630,23 @@ public class HoodieMetadataTableValidator implements 
Serializable {
 if (allPartitionPathsFromFS.size() != allPartitionPathsMeta.size()
 || !allPartitionPathsFromFS.equals(allPartitionPathsMeta)) {
   List additionalFromFS = new ArrayList<>(allPartitionPathsFromFS);
-  additionalFromFS.remove(allPartitionPathsMeta);
+  addit

(hudi) 02/28: [HUDI-7583] Read log block header only for the schema and instant time (#10984)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 53bdcb03469b0f58fe674cf10569c56d6afdf0b1
Author: Y Ethan Guo 
AuthorDate: Tue May 14 16:07:09 2024 -0700

[HUDI-7583] Read log block header only for the schema and instant time 
(#10984)
---
 .../hudi/common/table/TableSchemaResolver.java |  5 +-
 .../common/functional/TestHoodieLogFormat.java |  2 +-
 .../hudi/common/table/TestTableSchemaResolver.java | 56 ++
 .../utilities/HoodieMetadataTableValidator.java|  2 +-
 4 files changed, 62 insertions(+), 3 deletions(-)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/table/TableSchemaResolver.java
 
b/hudi-common/src/main/java/org/apache/hudi/common/table/TableSchemaResolver.java
index f37dd4e7540..0344331ab75 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/table/TableSchemaResolver.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/table/TableSchemaResolver.java
@@ -385,7 +385,10 @@ public class TableSchemaResolver {
* @return
*/
   public static MessageType readSchemaFromLogFile(FileSystem fs, Path path) 
throws IOException {
-try (Reader reader = HoodieLogFormat.newReader(fs, new 
HoodieLogFile(path), null)) {
+// We only need to read the schema from the log block header,
+// so we read the block lazily to avoid reading block content
+// containing the records
+try (Reader reader = HoodieLogFormat.newReader(fs, new 
HoodieLogFile(path), null, true, false)) {
   HoodieDataBlock lastBlock = null;
   while (reader.hasNext()) {
 HoodieLogBlock block = reader.next();
diff --git 
a/hudi-common/src/test/java/org/apache/hudi/common/functional/TestHoodieLogFormat.java
 
b/hudi-common/src/test/java/org/apache/hudi/common/functional/TestHoodieLogFormat.java
index 0b3bcc812ae..d4cb5021afc 100755
--- 
a/hudi-common/src/test/java/org/apache/hudi/common/functional/TestHoodieLogFormat.java
+++ 
b/hudi-common/src/test/java/org/apache/hudi/common/functional/TestHoodieLogFormat.java
@@ -2804,7 +2804,7 @@ public class TestHoodieLogFormat extends 
HoodieCommonTestHarness {
 }
   }
 
-  private static HoodieDataBlock getDataBlock(HoodieLogBlockType 
dataBlockType, List records,
+  public static HoodieDataBlock getDataBlock(HoodieLogBlockType dataBlockType, 
List records,
   Map 
header) {
 return getDataBlock(dataBlockType, 
records.stream().map(HoodieAvroIndexedRecord::new).collect(Collectors.toList()),
 header, new Path("dummy_path"));
   }
diff --git 
a/hudi-common/src/test/java/org/apache/hudi/common/table/TestTableSchemaResolver.java
 
b/hudi-common/src/test/java/org/apache/hudi/common/table/TestTableSchemaResolver.java
index b7f0ba8eba7..d8d0d8c9f72 100644
--- 
a/hudi-common/src/test/java/org/apache/hudi/common/table/TestTableSchemaResolver.java
+++ 
b/hudi-common/src/test/java/org/apache/hudi/common/table/TestTableSchemaResolver.java
@@ -19,13 +19,33 @@
 package org.apache.hudi.common.table;
 
 import org.apache.hudi.avro.AvroSchemaUtils;
+import org.apache.hudi.common.model.HoodieLogFile;
+import org.apache.hudi.common.table.log.HoodieLogFormat;
+import org.apache.hudi.common.table.log.block.HoodieDataBlock;
+import org.apache.hudi.common.table.log.block.HoodieLogBlock;
 import org.apache.hudi.common.testutils.HoodieTestDataGenerator;
+import org.apache.hudi.common.testutils.SchemaTestUtil;
 import org.apache.hudi.common.util.Option;
 import org.apache.hudi.internal.schema.HoodieSchemaException;
 
 import org.apache.avro.Schema;
+import org.apache.avro.generic.IndexedRecord;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.avro.AvroSchemaConverter;
 import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.io.TempDir;
 
+import java.io.IOException;
+import java.net.URISyntaxException;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import static 
org.apache.hudi.common.functional.TestHoodieLogFormat.getDataBlock;
+import static 
org.apache.hudi.common.table.log.block.HoodieLogBlock.HoodieLogBlockType.AVRO_DATA_BLOCK;
+import static org.apache.hudi.common.testutils.SchemaTestUtil.getSimpleSchema;
 import static org.junit.jupiter.api.Assertions.assertEquals;
 import static org.junit.jupiter.api.Assertions.assertNotEquals;
 import static org.junit.jupiter.api.Assertions.assertTrue;
@@ -35,6 +55,9 @@ import static org.junit.jupiter.api.Assertions.assertTrue;
  */
 public class TestTableSchemaResolver {
 
+  @TempDir
+  public java.nio.file.Path tempDir;
+
   @Test
   public void testRecreateSchemaWhenDropPartitionColumns() {
 Schema originSchema = new 
Schema.Parser().parse(HoodieTestDataGenerator.TRIP_EXAMPLE_SCHEMA);
@@ -65,4 +88,37 @@ public class TestTableSchemaResolver {
   assertTrue

(hudi) 03/28: [HUDI-7597] Add logs of Kafka offsets when the checkpoint is out of bound (#10987)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit e5054aa56dbce0ee7d424a045bf1ae9bca68f484
Author: Y Ethan Guo 
AuthorDate: Wed Apr 10 03:03:45 2024 -0700

[HUDI-7597] Add logs of Kafka offsets when the checkpoint is out of bound 
(#10987)

* [HUDI-7597] Add logs of Kafka offsets when the checkpoint is out of bound

* Adjust test
---
 .../utilities/sources/helpers/KafkaOffsetGen.java  | 29 +++---
 .../utilities/sources/BaseTestKafkaSource.java | 16 ++--
 2 files changed, 27 insertions(+), 18 deletions(-)

diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java
index 442046cd948..71fe7a7629a 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java
@@ -331,24 +331,35 @@ public class KafkaOffsetGen {
 
   /**
* Fetch checkpoint offsets for each partition.
-   * @param consumer instance of {@link KafkaConsumer} to fetch offsets from.
+   *
+   * @param consumer  instance of {@link KafkaConsumer} to fetch 
offsets from.
* @param lastCheckpointStr last checkpoint string.
-   * @param topicPartitions set of topic partitions.
+   * @param topicPartitions   set of topic partitions.
* @return a map of Topic partitions to offsets.
*/
   private Map fetchValidOffsets(KafkaConsumer consumer,
-Option 
lastCheckpointStr, Set topicPartitions) {
+  Option 
lastCheckpointStr, Set topicPartitions) {
 Map earliestOffsets = 
consumer.beginningOffsets(topicPartitions);
 Map checkpointOffsets = 
CheckpointUtils.strToOffsets(lastCheckpointStr.get());
-boolean isCheckpointOutOfBounds = checkpointOffsets.entrySet().stream()
-.anyMatch(offset -> offset.getValue() < 
earliestOffsets.get(offset.getKey()));
+List outOfBoundPartitionList = 
checkpointOffsets.entrySet().stream()
+.filter(offset -> offset.getValue() < 
earliestOffsets.get(offset.getKey()))
+.map(Map.Entry::getKey)
+.collect(Collectors.toList());
+boolean isCheckpointOutOfBounds = !outOfBoundPartitionList.isEmpty();
+
 if (isCheckpointOutOfBounds) {
+  String outOfBoundOffsets = outOfBoundPartitionList.stream()
+  .map(p -> p.toString() + ":{checkpoint=" + checkpointOffsets.get(p)
+  + ",earliestOffset=" + earliestOffsets.get(p) + "}")
+  .collect(Collectors.joining(","));
+  String message = "Some data may have been lost because they are not 
available in Kafka any more;"
+  + " either the data was aged out by Kafka or the topic may have been 
deleted before all the data in the topic was processed. "
+  + "Kafka partitions that have out-of-bound checkpoints: " + 
outOfBoundOffsets + " .";
+
   if (getBooleanWithAltKeys(this.props, 
KafkaSourceConfig.ENABLE_FAIL_ON_DATA_LOSS)) {
-throw new HoodieStreamerException("Some data may have been lost 
because they are not available in Kafka any more;"
-+ " either the data was aged out by Kafka or the topic may have 
been deleted before all the data in the topic was processed.");
+throw new HoodieStreamerException(message);
   } else {
-LOG.warn("Some data may have been lost because they are not available 
in Kafka any more;"
-+ " either the data was aged out by Kafka or the topic may have 
been deleted before all the data in the topic was processed."
+LOG.warn(message
 + " If you want Hudi Streamer to fail on such cases, set \"" + 
KafkaSourceConfig.ENABLE_FAIL_ON_DATA_LOSS.key() + "\" to \"true\".");
   }
 }
diff --git 
a/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/BaseTestKafkaSource.java
 
b/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/BaseTestKafkaSource.java
index c5fc7bfaafa..e45d10e7a61 100644
--- 
a/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/BaseTestKafkaSource.java
+++ 
b/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/BaseTestKafkaSource.java
@@ -53,6 +53,7 @@ import static 
org.apache.hudi.utilities.config.KafkaSourceConfig.ENABLE_KAFKA_CO
 import static org.junit.jupiter.api.Assertions.assertEquals;
 import static org.junit.jupiter.api.Assertions.assertNotNull;
 import static org.junit.jupiter.api.Assertions.assertThrows;
+import static org.junit.jupiter.api.Assertions.assertTrue;
 import static org.mockito.Mockito.mock;
 import static org.mockito.Mockito.when;
 
@@ -254,7 +255,7 @@ abstract class BaseTestKafkaSource extends 
SparkClientFunctionalTestHarness {
 final S

(hudi) 04/28: [MINOR] Fix BUG: HoodieLogFormatWriter: unable to close output stream for log file HoodieLogFile{xxx} (#10989)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit fa9cc9f915f1fef827db2990bd84f1e29a484ffb
Author: Silly Carbon 
AuthorDate: Wed Apr 10 18:21:57 2024 +0800

[MINOR] Fix BUG: HoodieLogFormatWriter: unable to close output stream for 
log file HoodieLogFile{xxx} (#10989)

* due to java.lang.IllegalStateException: Shutdown in progress, cause:
when `org.apache.hudi.common.table.log.HoodieLogFormatWriter.close` 
tries to `removeShutdownHook`, hooks were already removed by JVM when triggered 
(hooks == null)
---
 .../java/org/apache/hudi/common/table/log/HoodieLogFormatWriter.java| 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFormatWriter.java
 
b/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFormatWriter.java
index 0b16d2ee2a6..d021cd2c499 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFormatWriter.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFormatWriter.java
@@ -294,7 +294,7 @@ public class HoodieLogFormatWriter implements 
HoodieLogFormat.Writer {
 try {
   LOG.warn("running logformatwriter hook");
   if (output != null) {
-close();
+closeStream();
   }
 } catch (Exception e) {
   LOG.warn("unable to close output stream for log file " + logFile, e);



(hudi) branch branch-0.x updated (31d24d7600b -> 8bc56c52d6c)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a change to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git


 discard 31d24d7600b [HUDI-7637] Make StoragePathInfo Comparable (#11050)
 discard 6c152dbef8a [HUDI-7635] Add default block size and openSeekable APIs 
to HoodieStorage (#11048)
 discard 9b0945c7817 [HUDI-7636] Make StoragePath Serializable (#11049)
 discard bfb83c8eb8c [MINOR] Remove redundant TestStringUtils in hudi-common 
(#11046)
 discard d20f8175b95 [HUDI-7633] Use try with resources for AutoCloseable 
(#11045)
 discard 6fc3ad4060e [HUDI-4228] Clean up literal usage in Hudi CLI argument 
check (#11042)
 discard 3cdb6faefcb [HUDI-7626] Propagate UserGroupInformation from the main 
thread to the new thread of timeline service threadpool (#11039)
 discard a5304f586f4 [HUDI-7625] Avoid unnecessary rewrite for metadata table 
(#11038)
 discard ce9ff1cb21a [HUDI-7578] Avoid unnecessary rewriting to improve 
performance (#11028)
 discard 51f7557070d [MINOR] Rename location to path in `makeQualified` (#11037)
 discard f0cf58d8c99 [MINOR] Remove redundant lines in StreamSync and 
TestStreamSyncUnitTests (#11027)
 discard 566039010cf [HUDI-6762] Removed usages of 
MetadataRecordsGenerationParams (#10962)
 discard 59a1b28849f [HUDI-7619] Removed code duplicates in 
HoodieTableMetadataUtil (#11022)
 discard e39fedfa5b6 [HUDI-7584] Always read log block lazily and remove 
readBlockLazily argument (#11015)
 discard 89ff73de175 [HUDI-7615] Mark a few write configs with the correct 
sinceVersion (#11012)
 discard 7e5871a4634 [HUDI-7606] Unpersist RDDs after table services, mainly 
compaction and clustering (#11000)
 discard f12cd043095 [HUDI-7616] Avoid multiple cleaner plans and deprecate 
hoodie.clean.allow.multiple (#11013)
 discard 922e1eb87d0 [HUDI-7378] Fix Spark SQL DML with custom key generator 
(#10615)
 discard 4aa23beec97 [HUDI-7601] Add heartbeat mechanism to refresh lock 
(#10994)
 discard f9ffb646462 [HUDI-7290]  Don't assume ReplaceCommits are always 
Clustering (#10479)
 discard e9cd05376a3 [HUDI-7605] Allow merger strategy to be set in spark sql 
writer (#10999)
 discard db53c7af20c [HUDI-6441] Passing custom Headers with Hudi Callback URL 
(#10970)
 discard 858fde11fdd [HUDI-7391] HoodieMetadataMetrics should use Metrics 
instance for metrics registry (#10635)
 discard 86f6bdf991e [HUDI-7600] Shutdown ExecutorService when 
HiveMetastoreBasedLockProvider is closed (#10993)
 discard 705e5f59f28 [MINOR] Fix BUG: HoodieLogFormatWriter: unable to close 
output stream for log file HoodieLogFile{xxx} (#10989)
 discard ef96676f39e [HUDI-7597] Add logs of Kafka offsets when the checkpoint 
is out of bound (#10987)
 discard a9c7eebd0fc [HUDI-7583] Read log block header only for the schema and 
instant time (#10984)
 discard 86968a6b9e4 [HUDI-7556] Fixing false positive validation with MDT 
validator (#10986)
 new fad8ff04c67 [HUDI-7556] Fixing false positive validation with MDT 
validator (#10986)
 new 53bdcb03469 [HUDI-7583] Read log block header only for the schema and 
instant time (#10984)
 new e5054aa56db [HUDI-7597] Add logs of Kafka offsets when the checkpoint 
is out of bound (#10987)
 new fa9cc9f915f [MINOR] Fix BUG: HoodieLogFormatWriter: unable to close 
output stream for log file HoodieLogFile{xxx} (#10989)
 new f01c1332978 [HUDI-7600] Shutdown ExecutorService when 
HiveMetastoreBasedLockProvider is closed (#10993)
 new cb05c775cc0 [HUDI-7391] HoodieMetadataMetrics should use Metrics 
instance for metrics registry (#10635)
 new 741bd784113 [HUDI-6441] Passing custom Headers with Hudi Callback URL 
(#10970)
 new ebd8a7d9690 [HUDI-7605] Allow merger strategy to be set in spark sql 
writer (#10999)
 new 5b37e841249 [HUDI-7290]  Don't assume ReplaceCommits are always 
Clustering (#10479)
 new 04ec9f66977 [HUDI-7601] Add heartbeat mechanism to refresh lock 
(#10994)
 new 13beb355176 [HUDI-7378] Fix Spark SQL DML with custom key generator 
(#10615)
 new 5e0a7cd5552 [HUDI-7616] Avoid multiple cleaner plans and deprecate 
hoodie.clean.allow.multiple (#11013)
 new 1b6e36140cb [HUDI-7606] Unpersist RDDs after table services, mainly 
compaction and clustering (#11000)
 new 0c06b3ce011 [HUDI-7615] Mark a few write configs with the correct 
sinceVersion (#11012)
 new 2d84c5c5b1f [HUDI-7584] Always read log block lazily and remove 
readBlockLazily argument (#11015)
 new 23637b7a076 [HUDI-7619] Removed code duplicates in 
HoodieTableMetadataUtil (#11022)
 new 9c5feffa593 [HUDI-6762] Removed usages of 
MetadataRecordsGenerationParams (#10962)
 new d34ba818f75 [MINOR] Remove redundant lines in StreamSync and 
TestStreamSyncUnitTests (#11027)
 new 9a0349828e6 [MINOR] Rename location to path in `makeQualified` (#11037)
 new f1f6f93fe0f [HUDI-7578] Avoid unnecessary rewriting to improve 
performance (#11028)
 new a233bbecb3a [HUDI-7625] Avoid unnecessary rewrite 

(hudi) 06/28: [HUDI-7391] HoodieMetadataMetrics should use Metrics instance for metrics registry (#10635)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit cb05c775cc0b45d13e833731ffc9bdd7915063c5
Author: Lokesh Jain 
AuthorDate: Tue May 14 16:04:34 2024 -0700

[HUDI-7391] HoodieMetadataMetrics should use Metrics instance for metrics 
registry (#10635)

Currently HoodieMetadataMetrics stores metrics in memory and these metrics 
are not pushed by the metric reporters. The metric reporters are configured 
within Metrics instance. List of changes in the PR:

Metrics related classes have been moved from hudi-client-common to 
hudi-common.
HoodieMetadataMetrics now uses Metrics class so that all the reporters can 
be supported with it.
Some gaps in configs which are added in HoodieMetadataWriteUtils
Some metrics related apis and functionality has been moved to 
HoodieMetricsConfig. The HoodieWriteConfig APIs now delegate to 
HoodieMetricsConfig for the functionality.
---
 hudi-client/hudi-client-common/pom.xml |  46 -
 .../lock/metrics/HoodieLockMetrics.java|   2 +-
 .../org/apache/hudi/config/HoodieWriteConfig.java  |  98 +-
 .../hudi/metadata/HoodieMetadataWriteUtils.java|   9 +-
 .../org/apache/hudi/metrics/HoodieMetrics.java |   2 +-
 .../cloudwatch/CloudWatchMetricsReporter.java  |  29 ++-
 .../table/action/index/RunIndexActionExecutor.java |   3 +-
 .../hudi/metrics/TestHoodieConsoleMetrics.java |  16 +-
 .../hudi/metrics/TestHoodieGraphiteMetrics.java|  22 ++-
 .../apache/hudi/metrics/TestHoodieJmxMetrics.java  |  19 +-
 .../org/apache/hudi/metrics/TestHoodieMetrics.java |  17 +-
 .../hudi/metrics/TestMetricsReporterFactory.java   |  20 +-
 .../cloudwatch/TestCloudWatchMetricsReporter.java  |  27 ++-
 .../datadog/TestDatadogMetricsReporter.java|  60 +++---
 .../org/apache/hudi/metrics/m3/TestM3Metrics.java  |  54 +++---
 .../metrics/prometheus/TestPrometheusReporter.java |  19 +-
 .../prometheus/TestPushGateWayReporter.java|  52 +++---
 .../FlinkHoodieBackedTableMetadataWriter.java  |   4 +-
 .../JavaHoodieBackedTableMetadataWriter.java   |   4 +-
 .../hudi/client/TestJavaHoodieBackedMetadata.java  |  21 ++-
 .../SparkHoodieBackedTableMetadataWriter.java  |   2 +-
 .../functional/TestHoodieBackedMetadata.java   |  18 +-
 hudi-common/pom.xml|  47 +
 .../hudi/common/config/HoodieCommonConfig.java |   8 +
 .../metrics/HoodieMetricsCloudWatchConfig.java |   0
 .../hudi/config/metrics/HoodieMetricsConfig.java   | 201 +
 .../config/metrics/HoodieMetricsDatadogConfig.java |   0
 .../metrics/HoodieMetricsGraphiteConfig.java   |   0
 .../config/metrics/HoodieMetricsJmxConfig.java |   0
 .../hudi/config/metrics/HoodieMetricsM3Config.java |   0
 .../metrics/HoodieMetricsPrometheusConfig.java |   0
 .../apache/hudi/metadata/BaseTableMetadata.java|   4 +-
 .../hudi/metadata/HoodieMetadataMetrics.java   |  21 ++-
 .../hudi/metrics/ConsoleMetricsReporter.java   |   0
 .../java/org/apache/hudi/metrics/HoodieGauge.java  |   0
 .../hudi/metrics/InMemoryMetricsReporter.java  |   0
 .../apache/hudi/metrics/JmxMetricsReporter.java|   4 +-
 .../org/apache/hudi/metrics/JmxReporterServer.java |   0
 .../java/org/apache/hudi/metrics/MetricUtils.java  |   0
 .../main/java/org/apache/hudi/metrics/Metrics.java |  43 +++--
 .../hudi/metrics/MetricsGraphiteReporter.java  |  16 +-
 .../org/apache/hudi/metrics/MetricsReporter.java   |   0
 .../hudi/metrics/MetricsReporterFactory.java   |  27 ++-
 .../apache/hudi/metrics/MetricsReporterType.java   |   0
 .../custom/CustomizableMetricsReporter.java|   0
 .../hudi/metrics/datadog/DatadogHttpClient.java|   0
 .../metrics/datadog/DatadogMetricsReporter.java|   4 +-
 .../hudi/metrics/datadog/DatadogReporter.java  |   0
 .../apache/hudi/metrics/m3/M3MetricsReporter.java  |  16 +-
 .../hudi/metrics/m3/M3ScopeReporterAdaptor.java|   0
 .../metrics/prometheus/PrometheusReporter.java |  10 +-
 .../prometheus/PushGatewayMetricsReporter.java |  18 +-
 .../metrics/prometheus/PushGatewayReporter.java|   0
 .../AbstractUserDefinedMetricsReporter.java|   0
 .../deltastreamer/HoodieDeltaStreamerMetrics.java  |   8 +-
 .../ingestion/HoodieIngestionMetrics.java  |   7 +-
 .../utilities/streamer/HoodieStreamerMetrics.java  |   5 +
 .../apache/hudi/utilities/streamer/StreamSync.java |   2 +-
 58 files changed, 650 insertions(+), 335 deletions(-)

diff --git a/hudi-client/hudi-client-common/pom.xml 
b/hudi-client/hudi-client-common/pom.xml
index 6caccd0b0a6..022f5d6faa0 100644
--- a/hudi-client/hudi-client-common/pom.xml
+++ b/hudi-client/hudi-client-common/pom.xml
@@ -85,52 +85,6 @@
   0.2.2
 
 
-
-
-  io.dropwizard.metrics
-  metrics-graphite
-  
-
-  com.rabbitmq
-

(hudi) 17/28: [HUDI-6762] Removed usages of MetadataRecordsGenerationParams (#10962)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 9c5feffa59319c9ecde5956135c25ed8ac8e20bf
Author: Vova Kolmakov 
AuthorDate: Tue May 14 16:10:59 2024 -0700

[HUDI-6762] Removed usages of MetadataRecordsGenerationParams (#10962)

Co-authored-by: Vova Kolmakov 
---
 .../metadata/HoodieBackedTableMetadataWriter.java  | 118 +
 .../hudi/metadata/HoodieTableMetadataUtil.java | 266 -
 .../metadata/MetadataRecordsGenerationParams.java  |  89 ---
 3 files changed, 204 insertions(+), 269 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
index 329ff261f53..3537a6ddb40 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
@@ -329,12 +329,6 @@ public abstract class HoodieBackedTableMetadataWriter 
implements HoodieTableM
   LOG.warn("Metadata Table will need to be re-initialized as no instants 
were found");
   return true;
 }
-
-final String latestMetadataInstantTimestamp = 
latestMetadataInstant.get().getTimestamp();
-if (latestMetadataInstantTimestamp.startsWith(SOLO_COMMIT_TIMESTAMP)) { // 
the initialization timestamp is SOLO_COMMIT_TIMESTAMP + offset
-  return false;
-}
-
 return false;
   }
 
@@ -394,8 +388,8 @@ public abstract class HoodieBackedTableMetadataWriter 
implements HoodieTableM
 for (MetadataPartitionType partitionType : partitionsToInit) {
   // Find the commit timestamp to use for this partition. Each 
initialization should use its own unique commit time.
   String commitTimeForPartition = 
generateUniqueCommitInstantTime(initializationTime);
-
-  LOG.info("Initializing MDT partition " + partitionType.name() + " at 
instant " + commitTimeForPartition);
+  String partitionTypeName = partitionType.name();
+  LOG.info("Initializing MDT partition {} at instant {}", 
partitionTypeName, commitTimeForPartition);
 
   Pair> fileGroupCountAndRecordsPair;
   try {
@@ -413,24 +407,26 @@ public abstract class HoodieBackedTableMetadataWriter 
implements HoodieTableM
 fileGroupCountAndRecordsPair = initializeRecordIndexPartition();
 break;
   default:
-throw new HoodieMetadataException("Unsupported MDT partition type: 
" + partitionType);
+throw new HoodieMetadataException(String.format("Unsupported MDT 
partition type: %s", partitionType));
 }
   } catch (Exception e) {
 String metricKey = partitionType.getPartitionPath() + "_" + 
HoodieMetadataMetrics.BOOTSTRAP_ERR_STR;
 metrics.ifPresent(m -> m.setMetric(metricKey, 1));
-LOG.error("Bootstrap on " + partitionType.getPartitionPath() + " 
partition failed for "
-+ metadataMetaClient.getBasePath(), e);
-throw new HoodieMetadataException(partitionType.getPartitionPath()
-+ " bootstrap failed for " + metadataMetaClient.getBasePath(), e);
+String errMsg = String.format("Bootstrap on %s partition failed for 
%s",
+partitionType.getPartitionPath(), 
metadataMetaClient.getBasePathV2());
+LOG.error(errMsg, e);
+throw new HoodieMetadataException(errMsg, e);
   }
 
-  LOG.info(String.format("Initializing %s index with %d mappings and %d 
file groups.", partitionType.name(), fileGroupCountAndRecordsPair.getKey(),
-  fileGroupCountAndRecordsPair.getValue().count()));
+  if (LOG.isInfoEnabled()) {
+LOG.info("Initializing {} index with {} mappings and {} file groups.", 
partitionTypeName, fileGroupCountAndRecordsPair.getKey(),
+fileGroupCountAndRecordsPair.getValue().count());
+  }
   HoodieTimer partitionInitTimer = HoodieTimer.start();
 
   // Generate the file groups
   final int fileGroupCount = fileGroupCountAndRecordsPair.getKey();
-  ValidationUtils.checkArgument(fileGroupCount > 0, "FileGroup count for 
MDT partition " + partitionType.name() + " should be > 0");
+  ValidationUtils.checkArgument(fileGroupCount > 0, "FileGroup count for 
MDT partition " + partitionTypeName + " should be > 0");
   initializeFileGroups(dataMetaClient, partitionType, 
commitTimeForPartition, fileGroupCount);
 
   // Perform the commit using bulkCommit
@@ -441,7 +437,7 @@ public abstract class HoodieBackedTableMetadataWriter 
implements HoodieTableM
   // initialize the metadata reader again so the MDT partition can be read 
after initialization
   initMetadataReader();
   long totalInitTime = partitionInitTimer.endTimer();
-  LOG.info(Stri

(hudi) 13/28: [HUDI-7606] Unpersist RDDs after table services, mainly compaction and clustering (#11000)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 1b6e36140cb9bc35a15ffd99e6cb928dcd73fc76
Author: Rajesh Mahindra <76502047+rmahindra...@users.noreply.github.com>
AuthorDate: Sun Apr 14 14:38:55 2024 -0700

[HUDI-7606] Unpersist RDDs after table services, mainly compaction and 
clustering (#11000)

-

Co-authored-by: rmahindra123 
---
 .../hudi/client/BaseHoodieTableServiceClient.java  | 12 
 .../apache/hudi/client/BaseHoodieWriteClient.java  |  2 +-
 .../hudi/client/SparkRDDTableServiceClient.java|  6 ++
 .../apache/hudi/client/SparkRDDWriteClient.java| 21 +--
 .../hudi/client/utils/SparkReleaseResources.java   | 64 ++
 5 files changed, 85 insertions(+), 20 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java
index e408dc7a779..d6ec07b89d0 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java
@@ -331,6 +331,7 @@ public abstract class BaseHoodieTableServiceClient 
extends BaseHoodieCl
   CompactHelpers.getInstance().completeInflightCompaction(table, 
compactionCommitTime, metadata);
 } finally {
   this.txnManager.endTransaction(Option.of(compactionInstant));
+  releaseResources(compactionCommitTime);
 }
 WriteMarkersFactory.get(config.getMarkersType(), table, 
compactionCommitTime)
 .quietDeleteMarkerDir(context, config.getMarkersDeleteParallelism());
@@ -391,6 +392,7 @@ public abstract class BaseHoodieTableServiceClient 
extends BaseHoodieCl
   CompactHelpers.getInstance().completeInflightLogCompaction(table, 
logCompactionCommitTime, metadata);
 } finally {
   this.txnManager.endTransaction(Option.of(logCompactionInstant));
+  releaseResources(logCompactionCommitTime);
 }
 WriteMarkersFactory.get(config.getMarkersType(), table, 
logCompactionCommitTime)
 .quietDeleteMarkerDir(context, config.getMarkersDeleteParallelism());
@@ -520,6 +522,7 @@ public abstract class BaseHoodieTableServiceClient 
extends BaseHoodieCl
   throw new HoodieClusteringException("unable to transition clustering 
inflight to complete: " + clusteringCommitTime, e);
 } finally {
   this.txnManager.endTransaction(Option.of(clusteringInstant));
+  releaseResources(clusteringCommitTime);
 }
 WriteMarkersFactory.get(config.getMarkersType(), table, 
clusteringCommitTime)
 .quietDeleteMarkerDir(context, config.getMarkersDeleteParallelism());
@@ -759,6 +762,7 @@ public abstract class BaseHoodieTableServiceClient 
extends BaseHoodieCl
   + " Earliest Retained Instant :" + 
metadata.getEarliestCommitToRetain()
   + " cleanerElapsedMs" + durationMs);
 }
+releaseResources(cleanInstantTime);
 return metadata;
   }
 
@@ -1133,4 +1137,12 @@ public abstract class BaseHoodieTableServiceClient extends BaseHoodieCl
   }
 }
   }
+
+  /**
+   * Called after each commit of a compaction or clustering table service,
+   * to release any resources used.
+   */
+  protected void releaseResources(String instantTime) {
+// do nothing here
+  }
 }
diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java
index d5d74e94673..fdc9eeca90d 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java
@@ -237,11 +237,11 @@ public abstract class BaseHoodieWriteClient 
extends BaseHoodieClient
   commit(table, commitActionType, instantTime, metadata, stats, 
writeStatuses);
   postCommit(table, metadata, instantTime, extraMetadata);
   LOG.info("Committed " + instantTime);
-  releaseResources(instantTime);
 } catch (IOException e) {
   throw new HoodieCommitException("Failed to complete commit " + 
config.getBasePath() + " at time " + instantTime, e);
 } finally {
   this.txnManager.endTransaction(Option.of(inflightInstant));
+  releaseResources(instantTime);
 }
 
 // trigger clean and archival.
diff --git 
a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDTableServiceClient.java
 
b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDTableServiceClient.java
index 54d91fae3cf..98914be7496 100644
--- 
a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDTableServiceClient.java
+++ 
b/hudi-

(hudi) 24/28: [HUDI-7633] Use try with resources for AutoCloseable (#11045)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 5d118c06b915c3c483234029595f4887e3d8d061
Author: Y Ethan Guo 
AuthorDate: Wed Apr 17 21:31:44 2024 -0700

[HUDI-7633] Use try with resources for AutoCloseable (#11045)
---
 .../hudi/cli/commands/ArchivedCommitsCommand.java  | 104 
 .../apache/hudi/cli/commands/ExportCommand.java|  93 +++---
 .../hudi/cli/commands/HoodieLogFileCommand.java| 104 
 .../org/apache/hudi/cli/commands/TableCommand.java |   6 +-
 .../metadata/HoodieBackedTableMetadataWriter.java  |   8 +-
 .../hudi/common/model/HoodiePartitionMetadata.java |   8 +-
 .../hudi/common/table/log/LogReaderUtils.java  |  22 ++--
 .../table/log/block/HoodieAvroDataBlock.java   | 135 ++---
 .../hudi/common/util/SerializationUtils.java   |   6 +-
 .../hudi/metadata/HoodieBackedTableMetadata.java   |  24 ++--
 .../java/HoodieJavaWriteClientExample.java |  70 +--
 .../examples/spark/HoodieWriteClientExample.java   |  90 +++---
 .../org/apache/hudi/common/util/FileIOUtils.java   |  14 +--
 .../hudi/utilities/HoodieCompactionAdminTool.java  |   9 +-
 .../utilities/streamer/SchedulerConfGenerator.java |   6 +-
 15 files changed, 344 insertions(+), 355 deletions(-)

diff --git 
a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/ArchivedCommitsCommand.java
 
b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/ArchivedCommitsCommand.java
index 075a57d541c..5c57c8f5288 100644
--- 
a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/ArchivedCommitsCommand.java
+++ 
b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/ArchivedCommitsCommand.java
@@ -114,47 +114,46 @@ public class ArchivedCommitsCommand {
 List allStats = new ArrayList<>();
 for (FileStatus fs : fsStatuses) {
   // read the archived file
-  Reader reader = HoodieLogFormat.newReader(HadoopFSUtils.getFs(basePath, 
HoodieCLI.conf),
-  new HoodieLogFile(fs.getPath()), 
HoodieArchivedMetaEntry.getClassSchema());
-
-  List readRecords = new ArrayList<>();
-  // read the avro blocks
-  while (reader.hasNext()) {
-HoodieAvroDataBlock blk = (HoodieAvroDataBlock) reader.next();
-blk.getRecordIterator(HoodieRecordType.AVRO).forEachRemaining(r -> 
readRecords.add((IndexedRecord) r.getData()));
+  try (Reader reader = 
HoodieLogFormat.newReader(HadoopFSUtils.getFs(basePath, HoodieCLI.conf),
+  new HoodieLogFile(fs.getPath()), 
HoodieArchivedMetaEntry.getClassSchema())) {
+List readRecords = new ArrayList<>();
+// read the avro blocks
+while (reader.hasNext()) {
+  HoodieAvroDataBlock blk = (HoodieAvroDataBlock) reader.next();
+  blk.getRecordIterator(HoodieRecordType.AVRO).forEachRemaining(r -> 
readRecords.add((IndexedRecord) r.getData()));
+}
+List readCommits = readRecords.stream().map(r -> 
(GenericRecord) r)
+.filter(r -> 
r.get("actionType").toString().equals(HoodieTimeline.COMMIT_ACTION)
+|| 
r.get("actionType").toString().equals(HoodieTimeline.DELTA_COMMIT_ACTION))
+.flatMap(r -> {
+  HoodieCommitMetadata metadata = (HoodieCommitMetadata) 
SpecificData.get()
+  .deepCopy(HoodieCommitMetadata.SCHEMA$, 
r.get("hoodieCommitMetadata"));
+  final String instantTime = r.get("commitTime").toString();
+  final String action = r.get("actionType").toString();
+  return 
metadata.getPartitionToWriteStats().values().stream().flatMap(hoodieWriteStats 
-> hoodieWriteStats.stream().map(hoodieWriteStat -> {
+List row = new ArrayList<>();
+row.add(action);
+row.add(instantTime);
+row.add(hoodieWriteStat.getPartitionPath());
+row.add(hoodieWriteStat.getFileId());
+row.add(hoodieWriteStat.getPrevCommit());
+row.add(hoodieWriteStat.getNumWrites());
+row.add(hoodieWriteStat.getNumInserts());
+row.add(hoodieWriteStat.getNumDeletes());
+row.add(hoodieWriteStat.getNumUpdateWrites());
+row.add(hoodieWriteStat.getTotalLogFiles());
+row.add(hoodieWriteStat.getTotalLogBlocks());
+row.add(hoodieWriteStat.getTotalCorruptLogBlock());
+row.add(hoodieWriteStat.getTotalRollbackBlocks());
+row.add(hoodieWriteStat.getTotalLogRecords());
+row.add(hoodieWriteStat.getTotalUpdatedRecordsCompacted());
+row.add(hoodieWriteStat.getTotalWriteBytes());
+row.add(hoodieWriteStat.getTotalWriteErrors());
+return row;
+  })).map(rowList -> rowList.toArray(new Comparable[0]));
+}).collect(Collectors.toList());
+allStats.a

(hudi) 14/28: [HUDI-7615] Mark a few write configs with the correct sinceVersion (#11012)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 0c06b3ce011a5dd2f0c256543d88ad2380afadcb
Author: FreeTao 
AuthorDate: Sun Apr 14 18:36:22 2024 -0700

[HUDI-7615] Mark a few write configs with the correct sinceVersion (#11012)
---
 .../main/java/org/apache/hudi/keygen/constant/KeyGeneratorOptions.java   | 1 +
 1 file changed, 1 insertion(+)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/keygen/constant/KeyGeneratorOptions.java
 
b/hudi-common/src/main/java/org/apache/hudi/keygen/constant/KeyGeneratorOptions.java
index db4a9162129..3273a4fc49b 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/keygen/constant/KeyGeneratorOptions.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/keygen/constant/KeyGeneratorOptions.java
@@ -63,6 +63,7 @@ public class KeyGeneratorOptions extends HoodieConfig {
   public static final ConfigProperty 
KEYGENERATOR_CONSISTENT_LOGICAL_TIMESTAMP_ENABLED = ConfigProperty
   
.key("hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled")
   .defaultValue("false")
+  .sinceVersion("0.10.1")
   .markAdvanced()
   .withDocumentation("When set to true, consistent value will be generated 
for a logical timestamp type column, "
   + "like timestamp-millis and timestamp-micros, irrespective of 
whether row-writer is enabled. Disabled by default so "



(hudi) 07/28: [HUDI-6441] Passing custom Headers with Hudi Callback URL (#10970)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 741bd7841133074f1e4ae9cda8090569b535a29f
Author: Vova Kolmakov 
AuthorDate: Thu Apr 11 21:16:14 2024 +0700

[HUDI-6441] Passing custom Headers with Hudi Callback URL (#10970)
---
 .../http/HoodieWriteCommitHttpCallbackClient.java  |  46 -
 .../config/HoodieWriteCommitCallbackConfig.java|  15 ++
 .../client/http/TestCallbackHttpClient.java| 202 +
 .../hudi/callback/http/TestCallbackHttpClient.java | 143 ---
 4 files changed, 260 insertions(+), 146 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/callback/client/http/HoodieWriteCommitHttpCallbackClient.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/callback/client/http/HoodieWriteCommitHttpCallbackClient.java
index d9248ed20f1..037e84b3d00 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/callback/client/http/HoodieWriteCommitHttpCallbackClient.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/callback/client/http/HoodieWriteCommitHttpCallbackClient.java
@@ -18,6 +18,8 @@
 
 package org.apache.hudi.callback.client.http;
 
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.util.VisibleForTesting;
 import org.apache.hudi.config.HoodieWriteCommitCallbackConfig;
 import org.apache.hudi.config.HoodieWriteConfig;
 
@@ -34,6 +36,9 @@ import org.slf4j.LoggerFactory;
 
 import java.io.Closeable;
 import java.io.IOException;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.StringTokenizer;
 
 /**
  * Write commit callback http client.
@@ -43,36 +48,42 @@ public class HoodieWriteCommitHttpCallbackClient implements 
Closeable {
   private static final Logger LOG = 
LoggerFactory.getLogger(HoodieWriteCommitHttpCallbackClient.class);
 
   public static final String HEADER_KEY_API_KEY = "HUDI-CALLBACK-KEY";
+  static final String HEADERS_DELIMITER = ";";
+  static final String HEADERS_KV_DELIMITER = ":";
 
   private final String apiKey;
   private final String url;
   private final CloseableHttpClient client;
   private HoodieWriteConfig writeConfig;
+  private final Map customHeaders;
 
   public HoodieWriteCommitHttpCallbackClient(HoodieWriteConfig config) {
 this.writeConfig = config;
 this.apiKey = getApiKey();
 this.url = getUrl();
 this.client = getClient();
+this.customHeaders = parseCustomHeaders();
   }
 
-  public HoodieWriteCommitHttpCallbackClient(String apiKey, String url, 
CloseableHttpClient client) {
+  public HoodieWriteCommitHttpCallbackClient(String apiKey, String url, 
CloseableHttpClient client, Map customHeaders) {
 this.apiKey = apiKey;
 this.url = url;
 this.client = client;
+this.customHeaders = customHeaders != null ? customHeaders : new 
HashMap<>();
   }
 
   public void send(String callbackMsg) {
 HttpPost request = new HttpPost(url);
 request.setHeader(HEADER_KEY_API_KEY, apiKey);
 request.setHeader(HttpHeaders.CONTENT_TYPE, 
ContentType.APPLICATION_JSON.toString());
+customHeaders.forEach(request::setHeader);
 request.setEntity(new StringEntity(callbackMsg, 
ContentType.APPLICATION_JSON));
 try (CloseableHttpResponse response = client.execute(request)) {
   int statusCode = response.getStatusLine().getStatusCode();
   if (statusCode >= 300) {
-LOG.warn(String.format("Failed to send callback message. Response was 
%s", response));
+LOG.warn("Failed to send callback message. Response was {}", response);
   } else {
-LOG.info(String.format("Sent Callback data to %s successfully !", 
url));
+LOG.info("Sent Callback data with {} custom headers to {} successfully 
!", customHeaders.size(), url);
   }
 } catch (IOException e) {
   LOG.warn("Failed to send callback.", e);
@@ -101,8 +112,37 @@ public class HoodieWriteCommitHttpCallbackClient 
implements Closeable {
 return 
writeConfig.getInt(HoodieWriteCommitCallbackConfig.CALLBACK_HTTP_TIMEOUT_IN_SECONDS);
   }
 
+  private Map parseCustomHeaders() {
+Map headers = new HashMap<>();
+String headersString = 
writeConfig.getString(HoodieWriteCommitCallbackConfig.CALLBACK_HTTP_CUSTOM_HEADERS);
+if (!StringUtils.isNullOrEmpty(headersString)) {
+  StringTokenizer tokenizer = new StringTokenizer(headersString, 
HEADERS_DELIMITER);
+  while (tokenizer.hasMoreTokens()) {
+String token = tokenizer.nextToken();
+if (!StringUtils.isNullOrEmpty(token)) {
+  String[] keyValue = token.split(HEADERS_KV_DELIMITER);
+  if (keyValue.length == 2) {
+String trimKey = keyValue[0].trim();
+String trimValue = keyValue[1].trim();
+if (trimKey.length() > 0 && trimValue.length() > 0) {
+  headers.put(trimKey, trimValue);
+

(hudi) 16/28: [HUDI-7619] Removed code duplicates in HoodieTableMetadataUtil (#11022)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 23637b7a07687b980d58df7f34eef96530136a55
Author: Vova Kolmakov 
AuthorDate: Tue May 14 16:01:09 2024 -0700

[HUDI-7619] Removed code duplicates in HoodieTableMetadataUtil (#11022)

Co-authored-by: Vova Kolmakov 
---
 .../hudi/metadata/HoodieTableMetadataUtil.java | 92 +-
 1 file changed, 36 insertions(+), 56 deletions(-)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
 
b/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
index b25d6741b83..503e3351d8c 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
@@ -73,6 +73,7 @@ import org.apache.hudi.exception.HoodieMetadataException;
 import org.apache.hudi.hadoop.fs.HadoopFSUtils;
 import org.apache.hudi.io.storage.HoodieFileReader;
 import org.apache.hudi.io.storage.HoodieFileReaderFactory;
+import org.apache.hudi.storage.StoragePath;
 import org.apache.hudi.util.Lazy;
 
 import org.apache.avro.AvroTypeException;
@@ -1749,26 +1750,7 @@ public class HoodieTableMetadataUtil {
   final String instantTime = baseFile.getCommitTime();
   HoodieFileReader reader = 
HoodieFileReaderFactory.getReaderFactory(HoodieRecord.HoodieRecordType.AVRO)
   .getFileReader(config, configuration.get(), dataFilePath);
-  ClosableIterator recordKeyIterator = 
reader.getRecordKeyIterator();
-
-  return new ClosableIterator() {
-@Override
-public void close() {
-  recordKeyIterator.close();
-}
-
-@Override
-public boolean hasNext() {
-  return recordKeyIterator.hasNext();
-}
-
-@Override
-public HoodieRecord next() {
-  return forDelete
-  ? 
HoodieMetadataPayload.createRecordIndexDelete(recordKeyIterator.next())
-  : 
HoodieMetadataPayload.createRecordIndexUpdate(recordKeyIterator.next(), 
partition, fileId, instantTime, 0);
-}
-  };
+  return getHoodieRecordIterator(reader.getRecordKeyIterator(), forDelete, 
partition, fileId, instantTime);
 });
   }
 
@@ -1816,24 +1798,7 @@ public class HoodieTableMetadataUtil {
 .withTableMetaClient(metaClient)
 .build();
 ClosableIterator recordKeyIterator = 
ClosableIterator.wrap(mergedLogRecordScanner.getRecords().keySet().iterator());
-return new ClosableIterator() {
-  @Override
-  public void close() {
-recordKeyIterator.close();
-  }
-
-  @Override
-  public boolean hasNext() {
-return recordKeyIterator.hasNext();
-  }
-
-  @Override
-  public HoodieRecord next() {
-return forDelete
-? 
HoodieMetadataPayload.createRecordIndexDelete(recordKeyIterator.next())
-: 
HoodieMetadataPayload.createRecordIndexUpdate(recordKeyIterator.next(), 
partition, fileSlice.getFileId(), fileSlice.getBaseInstantTime(), 0);
-  }
-};
+return getHoodieRecordIterator(recordKeyIterator, forDelete, 
partition, fileSlice.getFileId(), fileSlice.getBaseInstantTime());
   }
   final HoodieBaseFile baseFile = fileSlice.getBaseFile().get();
   final String filename = baseFile.getFileName();
@@ -1844,26 +1809,41 @@ public class HoodieTableMetadataUtil {
   HoodieConfig hoodieConfig = getReaderConfigs(configuration.get());
   HoodieFileReader reader = 
HoodieFileReaderFactory.getReaderFactory(HoodieRecord.HoodieRecordType.AVRO)
   .getFileReader(hoodieConfig, configuration.get(), dataFilePath);
-  ClosableIterator recordKeyIterator = 
reader.getRecordKeyIterator();
+  return getHoodieRecordIterator(reader.getRecordKeyIterator(), forDelete, 
partition, fileId, instantTime);
+});
+  }
 
-  return new ClosableIterator() {
-@Override
-public void close() {
-  recordKeyIterator.close();
-}
+  private static Path filePath(String basePath, String partition, String 
filename) {
+if (partition.isEmpty()) {
+  return new Path(basePath, filename);
+} else {
+  return new Path(basePath, partition + StoragePath.SEPARATOR + filename);
+}
+  }
 
-@Override
-public boolean hasNext() {
-  return recordKeyIterator.hasNext();
-}
+  private static ClosableIterator 
getHoodieRecordIterator(ClosableIterator recordKeyIterator,
+
boolean forDelete,
+String 
partition,
+String 
fileId,
+ 

(hudi) 22/28: [HUDI-7626] Propagate UserGroupInformation from the main thread to the new thread of timeline service threadpool (#11039)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 585078471fe30273ec5ac1552fb5575c5beea369
Author: Jing Zhang 
AuthorDate: Wed Apr 17 16:40:29 2024 +0800

[HUDI-7626] Propagate UserGroupInformation from the main thread to the new 
thread of timeline service threadpool (#11039)
---
 .../hudi/timeline/service/RequestHandler.java  | 128 +++--
 1 file changed, 70 insertions(+), 58 deletions(-)

diff --git 
a/hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java
 
b/hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java
index 9385b4eca9e..12e11db403d 100644
--- 
a/hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java
+++ 
b/hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java
@@ -52,11 +52,13 @@ import io.javalin.http.Context;
 import io.javalin.http.Handler;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.security.UserGroupInformation;
 import org.jetbrains.annotations.NotNull;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
 import java.io.IOException;
+import java.security.PrivilegedExceptionAction;
 import java.util.Arrays;
 import java.util.List;
 import java.util.Map;
@@ -563,76 +565,86 @@ public class RequestHandler {
 
 private final Handler handler;
 private final boolean performRefreshCheck;
+private final UserGroupInformation ugi;
 
 ViewHandler(Handler handler, boolean performRefreshCheck) {
   this.handler = handler;
   this.performRefreshCheck = performRefreshCheck;
+  try {
+ugi = UserGroupInformation.getCurrentUser();
+  } catch (Exception e) {
+LOG.warn("Fail to get ugi", e);
+throw new HoodieException(e);
+  }
 }
 
 @Override
 public void handle(@NotNull Context context) throws Exception {
-  boolean success = true;
-  long beginTs = System.currentTimeMillis();
-  boolean synced = false;
-  boolean refreshCheck = performRefreshCheck && 
!isRefreshCheckDisabledInQuery(context);
-  long refreshCheckTimeTaken = 0;
-  long handleTimeTaken = 0;
-  long finalCheckTimeTaken = 0;
-  try {
-if (refreshCheck) {
-  long beginRefreshCheck = System.currentTimeMillis();
-  synced = syncIfLocalViewBehind(context);
-  long endRefreshCheck = System.currentTimeMillis();
-  refreshCheckTimeTaken = endRefreshCheck - beginRefreshCheck;
-}
+  ugi.doAs((PrivilegedExceptionAction) () -> {
+boolean success = true;
+long beginTs = System.currentTimeMillis();
+boolean synced = false;
+boolean refreshCheck = performRefreshCheck && 
!isRefreshCheckDisabledInQuery(context);
+long refreshCheckTimeTaken = 0;
+long handleTimeTaken = 0;
+long finalCheckTimeTaken = 0;
+try {
+  if (refreshCheck) {
+long beginRefreshCheck = System.currentTimeMillis();
+synced = syncIfLocalViewBehind(context);
+long endRefreshCheck = System.currentTimeMillis();
+refreshCheckTimeTaken = endRefreshCheck - beginRefreshCheck;
+  }
 
-long handleBeginMs = System.currentTimeMillis();
-handler.handle(context);
-long handleEndMs = System.currentTimeMillis();
-handleTimeTaken = handleEndMs - handleBeginMs;
-
-if (refreshCheck) {
-  long beginFinalCheck = System.currentTimeMillis();
-  if (isLocalViewBehind(context)) {
-String lastKnownInstantFromClient = 
context.queryParamAsClass(RemoteHoodieTableFileSystemView.LAST_INSTANT_TS, 
String.class).getOrDefault(HoodieTimeline.INVALID_INSTANT_TS);
-String timelineHashFromClient = 
context.queryParamAsClass(RemoteHoodieTableFileSystemView.TIMELINE_HASH, 
String.class).getOrDefault("");
-HoodieTimeline localTimeline =
-
viewManager.getFileSystemView(context.queryParam(RemoteHoodieTableFileSystemView.BASEPATH_PARAM)).getTimeline();
-if (shouldThrowExceptionIfLocalViewBehind(localTimeline, 
timelineHashFromClient)) {
-  String errMsg =
-  "Last known instant from client was "
-  + lastKnownInstantFromClient
-  + " but server has the following timeline "
-  + localTimeline.getInstants();
-  throw new BadRequestResponse(errMsg);
+  long handleBeginMs = System.currentTimeMillis();
+  handler.handle(context);
+  long handleEndMs = System.currentTimeMillis();
+  handleTimeTaken = handleEndMs - handleBeginMs;
+
+  if (refreshCheck) {
+long beginFinalCheck = System.currentTimeMillis();
+if (isLocalVie

(hudi) 28/28: [HUDI-7637] Make StoragePathInfo Comparable (#11050)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 8bc56c52d6c71f125dd00acc695d0e065141188e
Author: Y Ethan Guo 
AuthorDate: Thu Apr 18 05:51:23 2024 -0700

[HUDI-7637] Make StoragePathInfo Comparable (#11050)
---
 .../main/java/org/apache/hudi/storage/StoragePathInfo.java  |  7 ++-
 .../org/apache/hudi/io/storage/TestStoragePathInfo.java | 13 +
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/hudi-io/src/main/java/org/apache/hudi/storage/StoragePathInfo.java 
b/hudi-io/src/main/java/org/apache/hudi/storage/StoragePathInfo.java
index e4711bf72dd..1c1ebc32a2f 100644
--- a/hudi-io/src/main/java/org/apache/hudi/storage/StoragePathInfo.java
+++ b/hudi-io/src/main/java/org/apache/hudi/storage/StoragePathInfo.java
@@ -31,7 +31,7 @@ import java.io.Serializable;
  * with simplification based on what Hudi needs.
  */
 @PublicAPIClass(maturity = ApiMaturityLevel.EVOLVING)
-public class StoragePathInfo implements Serializable {
+public class StoragePathInfo implements Serializable, 
Comparable {
   private final StoragePath path;
   private final long length;
   private final boolean isDirectory;
@@ -109,6 +109,11 @@ public class StoragePathInfo implements Serializable {
 return modificationTime;
   }
 
+  @Override
+  public int compareTo(StoragePathInfo o) {
+return this.getPath().compareTo(o.getPath());
+  }
+
   @Override
   public boolean equals(Object o) {
 if (this == o) {
diff --git 
a/hudi-io/src/test/java/org/apache/hudi/io/storage/TestStoragePathInfo.java 
b/hudi-io/src/test/java/org/apache/hudi/io/storage/TestStoragePathInfo.java
index 72640c5e3df..95cf4d798a4 100644
--- a/hudi-io/src/test/java/org/apache/hudi/io/storage/TestStoragePathInfo.java
+++ b/hudi-io/src/test/java/org/apache/hudi/io/storage/TestStoragePathInfo.java
@@ -71,6 +71,19 @@ public class TestStoragePathInfo {
 }
   }
 
+  @Test
+  public void testCompareTo() {
+StoragePathInfo pathInfo1 = new StoragePathInfo(
+new StoragePath(PATH1), LENGTH, false, BLOCK_REPLICATION, BLOCK_SIZE, 
MODIFICATION_TIME);
+StoragePathInfo pathInfo2 = new StoragePathInfo(
+new StoragePath(PATH1), LENGTH + 2, false, BLOCK_REPLICATION, 
BLOCK_SIZE, MODIFICATION_TIME + 2L);
+StoragePathInfo pathInfo3 = new StoragePathInfo(
+new StoragePath(PATH2), LENGTH, false, BLOCK_REPLICATION, BLOCK_SIZE, 
MODIFICATION_TIME);
+
+assertEquals(0, pathInfo1.compareTo(pathInfo2));
+assertEquals(-1, pathInfo1.compareTo(pathInfo3));
+  }
+
   @Test
   public void testEquals() {
 StoragePathInfo pathInfo1 = new StoragePathInfo(



(hudi) 23/28: [HUDI-4228] Clean up literal usage in Hudi CLI argument check (#11042)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 9bb09d7b78d809ef1f1748915d9761bef63d71fa
Author: Vova Kolmakov 
AuthorDate: Thu Apr 18 09:14:32 2024 +0700

[HUDI-4228] Clean up literal usage in Hudi CLI argument check (#11042)
---
 .../org/apache/hudi/cli/commands/SparkMain.java| 201 +++--
 .../org/apache/hudi/cli/ArchiveExecutorUtils.java  |   2 +-
 2 files changed, 69 insertions(+), 134 deletions(-)

diff --git a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/SparkMain.java 
b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/SparkMain.java
index 742540d0ff5..c312deaf6c3 100644
--- a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/SparkMain.java
+++ b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/SparkMain.java
@@ -19,14 +19,13 @@
 package org.apache.hudi.cli.commands;
 
 import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.cli.ArchiveExecutorUtils;
 import org.apache.hudi.cli.utils.SparkUtil;
 import org.apache.hudi.client.HoodieTimelineArchiver;
 import org.apache.hudi.client.SparkRDDWriteClient;
 import org.apache.hudi.client.common.HoodieSparkEngineContext;
-import org.apache.hudi.common.config.HoodieMetadataConfig;
 import org.apache.hudi.common.config.TypedProperties;
 import org.apache.hudi.common.engine.HoodieEngineContext;
-import org.apache.hudi.common.model.HoodieAvroPayload;
 import org.apache.hudi.common.model.HoodieFailedWritesCleaningPolicy;
 import org.apache.hudi.common.model.HoodieRecord;
 import org.apache.hudi.common.model.WriteOperationType;
@@ -37,7 +36,6 @@ import org.apache.hudi.common.util.Option;
 import org.apache.hudi.common.util.PartitionPathEncodeUtils;
 import org.apache.hudi.common.util.StringUtils;
 import org.apache.hudi.common.util.ValidationUtils;
-import org.apache.hudi.config.HoodieArchivalConfig;
 import org.apache.hudi.config.HoodieBootstrapConfig;
 import org.apache.hudi.config.HoodieCleanConfig;
 import org.apache.hudi.config.HoodieIndexConfig;
@@ -99,16 +97,45 @@ public class SparkMain {
* Commands.
*/
   enum SparkCommand {
-BOOTSTRAP, ROLLBACK, DEDUPLICATE, ROLLBACK_TO_SAVEPOINT, SAVEPOINT, 
IMPORT, UPSERT, COMPACT_SCHEDULE, COMPACT_RUN, COMPACT_SCHEDULE_AND_EXECUTE,
-COMPACT_UNSCHEDULE_PLAN, COMPACT_UNSCHEDULE_FILE, COMPACT_VALIDATE, 
COMPACT_REPAIR, CLUSTERING_SCHEDULE,
-CLUSTERING_RUN, CLUSTERING_SCHEDULE_AND_EXECUTE, CLEAN, DELETE_MARKER, 
DELETE_SAVEPOINT, UPGRADE, DOWNGRADE,
-REPAIR_DEPRECATED_PARTITION, RENAME_PARTITION, ARCHIVE
+BOOTSTRAP(18), ROLLBACK(6), DEDUPLICATE(8), ROLLBACK_TO_SAVEPOINT(6), 
SAVEPOINT(7),
+IMPORT(13), UPSERT(13), COMPACT_SCHEDULE(7), COMPACT_RUN(10), 
COMPACT_SCHEDULE_AND_EXECUTE(9),
+COMPACT_UNSCHEDULE_PLAN(9), COMPACT_UNSCHEDULE_FILE(10), 
COMPACT_VALIDATE(7), COMPACT_REPAIR(8),
+CLUSTERING_SCHEDULE(7), CLUSTERING_RUN(9), 
CLUSTERING_SCHEDULE_AND_EXECUTE(8), CLEAN(5),
+DELETE_MARKER(5), DELETE_SAVEPOINT(5), UPGRADE(5), DOWNGRADE(5),
+REPAIR_DEPRECATED_PARTITION(4), RENAME_PARTITION(6), ARCHIVE(8);
+
+private final int minArgsCount;
+
+SparkCommand(int minArgsCount) {
+  this.minArgsCount = minArgsCount;
+}
+
+void assertEq(int factArgsCount) {
+  ValidationUtils.checkArgument(factArgsCount == minArgsCount);
+}
+
+void assertGtEq(int factArgsCount) {
+  ValidationUtils.checkArgument(factArgsCount >= minArgsCount);
+}
+
+List makeConfigs(String[] args) {
+  List configs = new ArrayList<>();
+  if (args.length > minArgsCount) {
+configs.addAll(Arrays.asList(args).subList(minArgsCount, args.length));
+  }
+  return configs;
+}
+
+String getPropsFilePath(String[] args) {
+  return (args.length >= minArgsCount && 
!StringUtils.isNullOrEmpty(args[minArgsCount - 1]))
+  ? args[minArgsCount - 1] : null;
+}
   }
 
-  public static void main(String[] args) throws Exception {
+  public static void main(String[] args) {
 ValidationUtils.checkArgument(args.length >= 4);
 final String commandString = args[0];
-LOG.info("Invoking SparkMain: " + commandString);
+LOG.info("Invoking SparkMain: {}", commandString);
 final SparkCommand cmd = SparkCommand.valueOf(commandString);
 
 JavaSparkContext jsc = SparkUtil.initJavaSparkContext("hoodie-cli-" + 
commandString,
@@ -116,193 +143,112 @@ public class SparkMain {
 
 int returnCode = 0;
 try {
+  cmd.assertGtEq(args.length);
+  List configs = cmd.makeConfigs(args);
+  String propsFilePath = cmd.getPropsFilePath(args);
   switch (cmd) {
 case ROLLBACK:
-  assert (args.length == 6);
+  cmd.assertEq(args.length);
   returnCode = rollback(jsc, args[3], args[4], 
Boolean.parseBoolean(args[5]));
   break;
 case DEDUPLICATE:
-  assert (args.length == 8);
+  cmd.assertEq(a

(hudi) 08/28: [HUDI-7605] Allow merger strategy to be set in spark sql writer (#10999)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit ebd8a7d9690e6b86187559a7038840523cec621a
Author: Jon Vexler 
AuthorDate: Thu Apr 11 21:20:07 2024 -0400

[HUDI-7605] Allow merger strategy to be set in spark sql writer (#10999)
---
 .../scala/org/apache/hudi/HoodieSparkSqlWriter.scala |  1 +
 .../apache/hudi/functional/TestMORDataSource.scala   | 20 
 2 files changed, 21 insertions(+)

diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
index 7020781faf0..ad19ec48c7a 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
@@ -302,6 +302,7 @@ class HoodieSparkSqlWriterInternal {
   .setPartitionMetafileUseBaseFormat(useBaseFormatMetaFile)
   
.setShouldDropPartitionColumns(hoodieConfig.getBooleanOrDefault(HoodieTableConfig.DROP_PARTITION_COLUMNS))
   .setCommitTimezone(timelineTimeZone)
+  
.setRecordMergerStrategy(hoodieConfig.getStringOrDefault(DataSourceWriteOptions.RECORD_MERGER_STRATEGY))
   .initTable(sparkContext.hadoopConfiguration, path)
   }
   val instantTime = HoodieActiveTimeline.createNewInstantTime()
diff --git 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala
 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala
index 45bd3c645d4..b878eb76c40 100644
--- 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala
+++ 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala
@@ -1403,4 +1403,24 @@ class TestMORDataSource extends 
HoodieSparkClientTestBase with SparkDatasetMixin
   basePath
 }
   }
+
+  @Test
+  def testMergerStrategySet(): Unit = {
+val (writeOpts, _) = getWriterReaderOpts()
+val input = recordsToStrings(dataGen.generateInserts("000", 1)).asScala
+val inputDf= spark.read.json(spark.sparkContext.parallelize(input, 1))
+val mergerStrategyName = "example_merger_strategy"
+inputDf.write.format("hudi")
+  .options(writeOpts)
+  .option(DataSourceWriteOptions.TABLE_TYPE.key, "MERGE_ON_READ")
+  .option(DataSourceWriteOptions.OPERATION.key, 
DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
+  .option(DataSourceWriteOptions.RECORD_MERGER_STRATEGY.key(), 
mergerStrategyName)
+  .mode(SaveMode.Overwrite)
+  .save(basePath)
+metaClient = HoodieTableMetaClient.builder()
+  .setBasePath(basePath)
+  .setConf(spark.sessionState.newHadoopConf)
+  .build()
+assertEquals(metaClient.getTableConfig.getRecordMergerStrategy, 
mergerStrategyName)
+  }
 }



Re: [PR] [HUDI-1517] create marker file for every log file [hudi]

2024-05-14 Thread via GitHub


danny0405 commented on PR #11187:
URL: https://github.com/apache/hudi/pull/11187#issuecomment-2111382919

   > @danny0405 sorry, I don't quite understand. For example, a write task 
create a new a_0-0-1.logfile, and this's speculation task also create a new 
a_0-0-2.logfile, these two files are the same and with tow marker log file. Now 
spark driver will recive one file meta stat which finish first, so other is 
orphaned file I think we need delete when solve file conflict use log marker 
file. And in next commit time, can use the file which has remaind to append.
   
   The two attempts appended to the same file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated: [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction (#10915)

2024-05-14 Thread jonvex
This is an automated email from the ASF dual-hosted git repository.

jonvex pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 9e9bb9006b7 [HUDI-7532] Include only compaction instants for 
lastCompaction in getDeltaCommitsSinceLatestCompaction (#10915)
9e9bb9006b7 is described below

commit 9e9bb9006b73a3363e945cb69582467ae6506efa
Author: Sivabalan Narayanan 
AuthorDate: Tue May 14 17:41:28 2024 -0700

[HUDI-7532] Include only compaction instants for lastCompaction in 
getDeltaCommitsSinceLatestCompaction (#10915)

* Fixing schedule compaction bug

* Addressing comments

* Fixing CDC tests
---
 .../hudi/cli/commands/CompactionCommand.java   |  2 +-
 .../hudi/cli/commands/FileSystemViewCommand.java   |  2 +-
 .../hudi/cli/commands/HoodieLogFileCommand.java|  2 +-
 .../apache/hudi/cli/commands/RepairsCommand.java   |  4 +-
 .../org/apache/hudi/cli/commands/StatsCommand.java |  2 +-
 .../java/org/apache/hudi/cli/utils/CommitUtil.java |  2 +-
 .../apache/hudi/cli/commands/TestTableCommand.java |  6 +--
 .../hudi/cli/integ/ITTestSavepointsCommand.java|  6 +--
 .../index/bucket/ConsistentBucketIndexUtils.java   |  2 +-
 .../metadata/HoodieBackedTableMetadataWriter.java  |  2 +-
 .../table/action/commit/JavaUpsertPartitioner.java |  2 +-
 .../hudi/client/TestJavaHoodieBackedMetadata.java  | 14 +++---
 .../TestHoodieJavaClientOnCopyOnWriteStorage.java  |  2 +-
 .../testutils/HoodieJavaClientTestHarness.java |  2 +-
 .../java/org/apache/hudi/client/TestMultiFS.java   |  4 +-
 .../hudi/client/TestTableSchemaEvolution.java  |  4 +-
 .../functional/TestHoodieBackedMetadata.java   | 14 +++---
 .../TestHoodieClientOnCopyOnWriteStorage.java  |  2 +-
 .../org/apache/hudi/io/TestHoodieMergeHandle.java  |  8 ++--
 .../apache/hudi/io/TestHoodieTimelineArchiver.java |  8 ++--
 .../java/org/apache/hudi/table/TestCleaner.java|  2 +-
 .../hudi/table/TestHoodieMergeOnReadTable.java |  6 +--
 .../table/action/compact/TestInlineCompaction.java |  6 +--
 .../TestCopyOnWriteRollbackActionExecutor.java |  2 +-
 ...dieSparkMergeOnReadTableInsertUpdateDelete.java |  4 +-
 .../TestHoodieSparkMergeOnReadTableRollback.java   |  6 +--
 .../hudi/testutils/HoodieClientTestBase.java   |  2 +-
 .../SparkClientFunctionalTestHarness.java  |  4 +-
 .../hudi/common/table/HoodieTableMetaClient.java   |  6 +--
 .../table/timeline/HoodieDefaultTimeline.java  | 11 -
 .../apache/hudi/common/util/CompactionUtils.java   |  3 +-
 .../hudi/metadata/HoodieBackedTableMetadata.java   |  2 +-
 .../common/table/TestHoodieTableMetaClient.java|  6 +--
 .../hudi/common/table/TestTimelineUtils.java   | 12 ++---
 .../table/timeline/TestHoodieActiveTimeline.java   | 44 +++---
 .../hudi/common/util/TestCompactionUtils.java  | 53 ++
 .../RepairAddpartitionmetaProcedure.scala  |  2 +-
 .../RepairMigratePartitionMetaProcedure.scala  |  2 +-
 .../ShowHoodieLogFileRecordsProcedure.scala|  2 +-
 .../StatsWriteAmplificationProcedure.scala |  2 +-
 .../procedures/ValidateHoodieSyncProcedure.scala   |  2 +-
 .../src/test/java/HoodieJavaStreamingApp.java  |  4 +-
 .../hudi/functional/TestMORDataSourceStorage.scala |  2 +-
 .../hudi/functional/TestStructuredStreaming.scala  |  2 +-
 .../functional/cdc/TestCDCDataFrameSuite.scala | 26 ++-
 .../sql/hudi/procedure/TestRepairsProcedure.scala  |  8 ++--
 .../deltastreamer/HoodieDeltaStreamerTestBase.java |  4 +-
 .../deltastreamer/TestHoodieDeltaStreamer.java |  2 +-
 48 files changed, 195 insertions(+), 122 deletions(-)

diff --git 
a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CompactionCommand.java 
b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CompactionCommand.java
index 135bd48c036..ae572ce2354 100644
--- a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CompactionCommand.java
+++ b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CompactionCommand.java
@@ -320,7 +320,7 @@ public class CompactionCommand {
 .filter(pair -> pair.getRight() != null)
 .collect(Collectors.toList());
 
-Set committedInstants = 
timeline.getCommitTimeline().filterCompletedInstants()
+Set committedInstants = 
timeline.getCommitAndReplaceTimeline().filterCompletedInstants()
 
.getInstantsAsStream().map(HoodieInstant::getTimestamp).collect(Collectors.toSet());
 
 List rows = new ArrayList<>();
diff --git 
a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/FileSystemViewCommand.java
 
b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/FileSystemViewCommand.java
index 16429f65375..a6a3048615b 100644
--- 
a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/FileSystemViewCommand.java
+++ 
b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/FileSystemViewCommand.java
@@ -247,7 +247,7 @@ public cl

Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]

2024-05-14 Thread via GitHub


jonvex merged PR #10915:
URL: https://github.com/apache/hudi/pull/10915


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]

2024-05-14 Thread via GitHub


jonvex commented on PR #10915:
URL: https://github.com/apache/hudi/pull/10915#issuecomment-2111382582

   https://github.com/apache/hudi/assets/26940621/04586717-9262-41f9-9c1b-bb848e5c184c";>
   azure is actually passing


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) 20/28: [HUDI-7578] Avoid unnecessary rewriting to improve performance (#11028)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit ce9ff1cb21a2289a8c563b1c97f9c5943167b3f0
Author: Danny Chan 
AuthorDate: Wed Apr 17 11:31:17 2024 +0800

[HUDI-7578] Avoid unnecessary rewriting to improve performance (#11028)
---
 .../src/main/java/org/apache/hudi/io/HoodieMergeHandle.java | 13 +
 .../org/apache/hudi/io/HoodieMergeHandleWithChangeLog.java  |  2 +-
 .../java/org/apache/hudi/io/HoodieSortedMergeHandle.java|  4 ++--
 .../hudi/io/FlinkMergeAndReplaceHandleWithChangeLog.java|  2 +-
 .../org/apache/hudi/io/FlinkMergeHandleWithChangeLog.java   |  2 +-
 .../src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java |  4 
 6 files changed, 14 insertions(+), 13 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
index e40a5585067..749b08c3e7e 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
@@ -103,7 +103,7 @@ public class HoodieMergeHandle extends 
HoodieWriteHandle
   protected Map> keyToNewRecords;
   protected Set writtenRecordKeys;
   protected HoodieFileWriter fileWriter;
-  private boolean preserveMetadata = false;
+  protected boolean preserveMetadata = false;
 
   protected Path newFilePath;
   protected Path oldFilePath;
@@ -111,7 +111,6 @@ public class HoodieMergeHandle extends 
HoodieWriteHandle
   protected long recordsDeleted = 0;
   protected long updatedRecordsWritten = 0;
   protected long insertRecordsWritten = 0;
-  protected boolean useWriterSchemaForCompaction;
   protected Option keyGeneratorOpt;
   private HoodieBaseFile baseFileToMerge;
 
@@ -142,7 +141,6 @@ public class HoodieMergeHandle extends 
HoodieWriteHandle
HoodieBaseFile dataFileToBeMerged, 
TaskContextSupplier taskContextSupplier, Option 
keyGeneratorOpt) {
 super(config, instantTime, partitionPath, fileId, hoodieTable, 
taskContextSupplier);
 this.keyToNewRecords = keyToNewRecords;
-this.useWriterSchemaForCompaction = true;
 this.preserveMetadata = true;
 init(fileId, this.partitionPath, dataFileToBeMerged);
 validateAndSetAndKeyGenProps(keyGeneratorOpt, config.populateMetaFields());
@@ -279,7 +277,7 @@ public class HoodieMergeHandle extends 
HoodieWriteHandle
   }
 
   protected void writeInsertRecord(HoodieRecord newRecord) throws 
IOException {
-Schema schema = useWriterSchemaForCompaction ? writeSchemaWithMetaFields : 
writeSchema;
+Schema schema = preserveMetadata ? writeSchemaWithMetaFields : writeSchema;
 // just skip the ignored record
 if (newRecord.shouldIgnore(schema, config.getProps())) {
   return;
@@ -308,7 +306,7 @@ public class HoodieMergeHandle extends 
HoodieWriteHandle
 }
 try {
   if (combineRecord.isPresent() && !combineRecord.get().isDelete(schema, 
config.getProps()) && !isDelete) {
-writeToFile(newRecord.getKey(), combineRecord.get(), schema, prop, 
preserveMetadata && useWriterSchemaForCompaction);
+writeToFile(newRecord.getKey(), combineRecord.get(), schema, prop, 
preserveMetadata);
 recordsWritten++;
   } else {
 recordsDeleted++;
@@ -335,7 +333,7 @@ public class HoodieMergeHandle extends 
HoodieWriteHandle
*/
   public void write(HoodieRecord oldRecord) {
 Schema oldSchema = config.populateMetaFields() ? writeSchemaWithMetaFields 
: writeSchema;
-Schema newSchema = useWriterSchemaForCompaction ? 
writeSchemaWithMetaFields : writeSchema;
+Schema newSchema = preserveMetadata ? writeSchemaWithMetaFields : 
writeSchema;
 boolean copyOldRecord = true;
 String key = oldRecord.getRecordKey(oldSchema, keyGeneratorOpt);
 TypedProperties props = config.getPayloadConfig().getProps();
@@ -384,8 +382,7 @@ public class HoodieMergeHandle extends 
HoodieWriteHandle
 // NOTE: `FILENAME_METADATA_FIELD` has to be rewritten to correctly point 
to the
 //   file holding this record even in cases when overall metadata is 
preserved
 MetadataValues metadataValues = new 
MetadataValues().setFileName(newFilePath.getName());
-HoodieRecord populatedRecord =
-record.prependMetaFields(schema, writeSchemaWithMetaFields, 
metadataValues, prop);
+HoodieRecord populatedRecord = record.prependMetaFields(schema, 
writeSchemaWithMetaFields, metadataValues, prop);
 
 if (shouldPreserveRecordMetadata) {
   fileWriter.write(key.getRecordKey(), populatedRecord, 
writeSchemaWithMetaFields);
diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandleWithChangeLog.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandleWithChange

(hudi) 24/28: [HUDI-7633] Use try with resources for AutoCloseable (#11045)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit d20f8175b95db94d30f937b8e696d57c3993b8dd
Author: Y Ethan Guo 
AuthorDate: Wed Apr 17 21:31:44 2024 -0700

[HUDI-7633] Use try with resources for AutoCloseable (#11045)
---
 .../hudi/cli/commands/ArchivedCommitsCommand.java  | 104 
 .../apache/hudi/cli/commands/ExportCommand.java|  93 +++---
 .../hudi/cli/commands/HoodieLogFileCommand.java| 104 
 .../org/apache/hudi/cli/commands/TableCommand.java |   6 +-
 .../metadata/HoodieBackedTableMetadataWriter.java  |   8 +-
 .../hudi/common/model/HoodiePartitionMetadata.java |   8 +-
 .../hudi/common/table/log/LogReaderUtils.java  |  22 ++--
 .../table/log/block/HoodieAvroDataBlock.java   | 135 ++---
 .../hudi/common/util/SerializationUtils.java   |   6 +-
 .../hudi/metadata/HoodieBackedTableMetadata.java   |  24 ++--
 .../java/HoodieJavaWriteClientExample.java |  70 +--
 .../examples/spark/HoodieWriteClientExample.java   |  90 +++---
 .../org/apache/hudi/common/util/FileIOUtils.java   |  14 +--
 .../hudi/utilities/HoodieCompactionAdminTool.java  |   9 +-
 .../utilities/streamer/SchedulerConfGenerator.java |   6 +-
 15 files changed, 344 insertions(+), 355 deletions(-)

diff --git 
a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/ArchivedCommitsCommand.java
 
b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/ArchivedCommitsCommand.java
index 075a57d541c..5c57c8f5288 100644
--- 
a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/ArchivedCommitsCommand.java
+++ 
b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/ArchivedCommitsCommand.java
@@ -114,47 +114,46 @@ public class ArchivedCommitsCommand {
 List allStats = new ArrayList<>();
 for (FileStatus fs : fsStatuses) {
   // read the archived file
-  Reader reader = HoodieLogFormat.newReader(HadoopFSUtils.getFs(basePath, 
HoodieCLI.conf),
-  new HoodieLogFile(fs.getPath()), 
HoodieArchivedMetaEntry.getClassSchema());
-
-  List readRecords = new ArrayList<>();
-  // read the avro blocks
-  while (reader.hasNext()) {
-HoodieAvroDataBlock blk = (HoodieAvroDataBlock) reader.next();
-blk.getRecordIterator(HoodieRecordType.AVRO).forEachRemaining(r -> 
readRecords.add((IndexedRecord) r.getData()));
+  try (Reader reader = 
HoodieLogFormat.newReader(HadoopFSUtils.getFs(basePath, HoodieCLI.conf),
+  new HoodieLogFile(fs.getPath()), 
HoodieArchivedMetaEntry.getClassSchema())) {
+List readRecords = new ArrayList<>();
+// read the avro blocks
+while (reader.hasNext()) {
+  HoodieAvroDataBlock blk = (HoodieAvroDataBlock) reader.next();
+  blk.getRecordIterator(HoodieRecordType.AVRO).forEachRemaining(r -> 
readRecords.add((IndexedRecord) r.getData()));
+}
+List readCommits = readRecords.stream().map(r -> 
(GenericRecord) r)
+.filter(r -> 
r.get("actionType").toString().equals(HoodieTimeline.COMMIT_ACTION)
+|| 
r.get("actionType").toString().equals(HoodieTimeline.DELTA_COMMIT_ACTION))
+.flatMap(r -> {
+  HoodieCommitMetadata metadata = (HoodieCommitMetadata) 
SpecificData.get()
+  .deepCopy(HoodieCommitMetadata.SCHEMA$, 
r.get("hoodieCommitMetadata"));
+  final String instantTime = r.get("commitTime").toString();
+  final String action = r.get("actionType").toString();
+  return 
metadata.getPartitionToWriteStats().values().stream().flatMap(hoodieWriteStats 
-> hoodieWriteStats.stream().map(hoodieWriteStat -> {
+List row = new ArrayList<>();
+row.add(action);
+row.add(instantTime);
+row.add(hoodieWriteStat.getPartitionPath());
+row.add(hoodieWriteStat.getFileId());
+row.add(hoodieWriteStat.getPrevCommit());
+row.add(hoodieWriteStat.getNumWrites());
+row.add(hoodieWriteStat.getNumInserts());
+row.add(hoodieWriteStat.getNumDeletes());
+row.add(hoodieWriteStat.getNumUpdateWrites());
+row.add(hoodieWriteStat.getTotalLogFiles());
+row.add(hoodieWriteStat.getTotalLogBlocks());
+row.add(hoodieWriteStat.getTotalCorruptLogBlock());
+row.add(hoodieWriteStat.getTotalRollbackBlocks());
+row.add(hoodieWriteStat.getTotalLogRecords());
+row.add(hoodieWriteStat.getTotalUpdatedRecordsCompacted());
+row.add(hoodieWriteStat.getTotalWriteBytes());
+row.add(hoodieWriteStat.getTotalWriteErrors());
+return row;
+  })).map(rowList -> rowList.toArray(new Comparable[0]));
+}).collect(Collectors.toList());
+allStats.a

(hudi) 21/28: [HUDI-7625] Avoid unnecessary rewrite for metadata table (#11038)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit a5304f586f456333d6a19479ccee67fd7d75fb56
Author: Danny Chan 
AuthorDate: Wed Apr 17 14:37:28 2024 +0800

[HUDI-7625] Avoid unnecessary rewrite for metadata table (#11038)
---
 .../src/main/java/org/apache/hudi/io/HoodieMergeHandle.java | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
index 749b08c3e7e..3f9aa2981c1 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
@@ -332,7 +332,11 @@ public class HoodieMergeHandle extends 
HoodieWriteHandle
* Go through an old record. Here if we detect a newer version shows up, we 
write the new one to the file.
*/
   public void write(HoodieRecord oldRecord) {
-Schema oldSchema = config.populateMetaFields() ? writeSchemaWithMetaFields 
: writeSchema;
+// Use schema with metadata files no matter whether 
'hoodie.populate.meta.fields' is enabled
+// to avoid unnecessary rewrite. Even with metadata table(whereas the 
option 'hoodie.populate.meta.fields' is configured as false),
+// the record is deserialized with schema including metadata fields,
+// see HoodieMergeHelper#runMerge for more details.
+Schema oldSchema = writeSchemaWithMetaFields;
 Schema newSchema = preserveMetadata ? writeSchemaWithMetaFields : 
writeSchema;
 boolean copyOldRecord = true;
 String key = oldRecord.getRecordKey(oldSchema, keyGeneratorOpt);



(hudi) 14/28: [HUDI-7615] Mark a few write configs with the correct sinceVersion (#11012)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 89ff73de17513d79c311db4ccb0e13dbe1a25b39
Author: FreeTao 
AuthorDate: Sun Apr 14 18:36:22 2024 -0700

[HUDI-7615] Mark a few write configs with the correct sinceVersion (#11012)
---
 .../main/java/org/apache/hudi/keygen/constant/KeyGeneratorOptions.java   | 1 +
 1 file changed, 1 insertion(+)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/keygen/constant/KeyGeneratorOptions.java
 
b/hudi-common/src/main/java/org/apache/hudi/keygen/constant/KeyGeneratorOptions.java
index db4a9162129..3273a4fc49b 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/keygen/constant/KeyGeneratorOptions.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/keygen/constant/KeyGeneratorOptions.java
@@ -63,6 +63,7 @@ public class KeyGeneratorOptions extends HoodieConfig {
   public static final ConfigProperty 
KEYGENERATOR_CONSISTENT_LOGICAL_TIMESTAMP_ENABLED = ConfigProperty
   
.key("hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled")
   .defaultValue("false")
+  .sinceVersion("0.10.1")
   .markAdvanced()
   .withDocumentation("When set to true, consistent value will be generated 
for a logical timestamp type column, "
   + "like timestamp-millis and timestamp-micros, irrespective of 
whether row-writer is enabled. Disabled by default so "



(hudi) 27/28: [HUDI-7635] Add default block size and openSeekable APIs to HoodieStorage (#11048)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 6c152dbef8a0dd1bebdefbe08d22171214964c06
Author: Y Ethan Guo 
AuthorDate: Tue May 14 17:02:25 2024 -0700

[HUDI-7635] Add default block size and openSeekable APIs to HoodieStorage 
(#11048)

This PR adds `getDefaultBlockSize` and `openSeekable` APIs to
`HoodieStorage` and implements these APIs in `HoodieHadoopStorage`.
The implementation follows the same logic of creating seekable input
stream for log file reading, and `openSeekable` will be used by the log
reading logic.

A few util methods are moved from `FSUtils` and
`HoodieLogFileReader` classes to `HadoopFSUtilsclass`.
---
 .../java/org/apache/hudi/common/fs/FSUtils.java| 18 -
 .../hudi/common/table/log/HoodieLogFileReader.java | 75 +-
 .../org/apache/hudi/hadoop/fs/HadoopFSUtils.java   | 90 ++
 .../hudi/storage/hadoop/HoodieHadoopStorage.java   | 13 
 .../org/apache/hudi/storage/HoodieStorage.java | 30 
 .../hudi/io/storage/TestHoodieStorageBase.java | 43 +++
 6 files changed, 179 insertions(+), 90 deletions(-)

diff --git a/hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java 
b/hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java
index 292c2b41946..1b51fd78bfa 100644
--- a/hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java
+++ b/hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java
@@ -667,24 +667,6 @@ public class FSUtils {
 return fs.getUri() + fullPartitionPath.toUri().getRawPath();
   }
 
-  /**
-   * This is due to HUDI-140 GCS has a different behavior for detecting EOF 
during seek().
-   *
-   * @param fs fileSystem instance.
-   * @return true if the inputstream or the wrapped one is of type 
GoogleHadoopFSInputStream
-   */
-  public static boolean isGCSFileSystem(FileSystem fs) {
-return fs.getScheme().equals(StorageSchemes.GCS.getScheme());
-  }
-
-  /**
-   * Chdfs will throw {@code IOException} instead of {@code EOFException}. It 
will cause error in isBlockCorrupted().
-   * Wrapped by {@code BoundedFsDataInputStream}, to check whether the desired 
offset is out of the file size in advance.
-   */
-  public static boolean isCHDFileSystem(FileSystem fs) {
-return StorageSchemes.CHDFS.getScheme().equals(fs.getScheme());
-  }
-
   public static Configuration registerFileSystem(Path file, Configuration 
conf) {
 Configuration returnConf = new Configuration(conf);
 String scheme = HadoopFSUtils.getFs(file.toString(), conf).getScheme();
diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java
 
b/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java
index c1daf5e32d1..062e3639073 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java
@@ -37,20 +37,15 @@ import org.apache.hudi.common.util.Option;
 import org.apache.hudi.exception.CorruptedLogFileException;
 import org.apache.hudi.exception.HoodieIOException;
 import org.apache.hudi.exception.HoodieNotSupportedException;
-import org.apache.hudi.hadoop.fs.BoundedFsDataInputStream;
 import org.apache.hudi.hadoop.fs.HadoopSeekableDataInputStream;
-import org.apache.hudi.hadoop.fs.SchemeAwareFSDataInputStream;
-import org.apache.hudi.hadoop.fs.TimedFSDataInputStream;
 import org.apache.hudi.internal.schema.InternalSchema;
 import org.apache.hudi.io.SeekableDataInputStream;
 import org.apache.hudi.io.util.IOUtils;
+import org.apache.hudi.storage.StoragePath;
 import org.apache.hudi.storage.StorageSchemes;
 
 import org.apache.avro.Schema;
 import org.apache.hadoop.conf.Configuration;
-import org.apache.hadoop.fs.BufferedFSInputStream;
-import org.apache.hadoop.fs.FSDataInputStream;
-import org.apache.hadoop.fs.FSInputStream;
 import org.apache.hadoop.fs.FileSystem;
 import org.apache.hadoop.fs.Path;
 import org.slf4j.Logger;
@@ -67,6 +62,7 @@ import java.util.Objects;
 
 import static org.apache.hudi.common.util.ValidationUtils.checkArgument;
 import static org.apache.hudi.common.util.ValidationUtils.checkState;
+import static org.apache.hudi.hadoop.fs.HadoopFSUtils.getFSDataInputStream;
 
 /**
  * Scans a log file and provides block level iterator on the log file Loads 
the entire block contents in memory Can emit
@@ -479,71 +475,6 @@ public class HoodieLogFileReader implements 
HoodieLogFormat.Reader {
   private static SeekableDataInputStream getDataInputStream(FileSystem fs,
 HoodieLogFile 
logFile,
 int bufferSize) {
-return new HadoopSeekableDataInputStream(getFSDataInputStream(fs, logFile, 
bufferSize));
-  }
-
-  

(hudi) 25/28: [MINOR] Remove redundant TestStringUtils in hudi-common (#11046)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit bfb83c8eb8c59f90a548a34561c7fdee833ebaee
Author: Y Ethan Guo 
AuthorDate: Wed Apr 17 21:34:06 2024 -0700

[MINOR] Remove redundant TestStringUtils in hudi-common (#11046)
---
 .../apache/hudi/common/util/TestStringUtils.java   | 124 -
 1 file changed, 124 deletions(-)

diff --git 
a/hudi-common/src/test/java/org/apache/hudi/common/util/TestStringUtils.java 
b/hudi-common/src/test/java/org/apache/hudi/common/util/TestStringUtils.java
deleted file mode 100644
index 54985056bf0..000
--- a/hudi-common/src/test/java/org/apache/hudi/common/util/TestStringUtils.java
+++ /dev/null
@@ -1,124 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one
- * or more contributor license agreements.  See the NOTICE file
- * distributed with this work for additional information
- * regarding copyright ownership.  The ASF licenses this file
- * to you under the Apache License, Version 2.0 (the
- * "License"); you may not use this file except in compliance
- * with the License.  You may obtain a copy of the License at
- *
- *  http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.hudi.common.util;
-
-import org.junit.jupiter.api.Test;
-
-import java.nio.ByteBuffer;
-import java.util.ArrayList;
-import java.util.Arrays;
-import java.util.Collections;
-
-import static org.apache.hudi.common.util.StringUtils.getUTF8Bytes;
-import static org.junit.jupiter.api.Assertions.assertEquals;
-import static org.junit.jupiter.api.Assertions.assertNotEquals;
-import static org.junit.jupiter.api.Assertions.assertNull;
-import static org.junit.jupiter.api.Assertions.assertTrue;
-
-/**
- * Tests {@link StringUtils}.
- */
-public class TestStringUtils {
-
-  private static final String[] STRINGS = {"This", "is", "a", "test"};
-
-  @Test
-  public void testStringJoinWithDelim() {
-String joinedString = StringUtils.joinUsingDelim("-", STRINGS);
-assertEquals(STRINGS.length, joinedString.split("-").length);
-  }
-
-  @Test
-  public void testStringJoin() {
-assertNotEquals(null, StringUtils.join(""));
-assertNotEquals(null, StringUtils.join(STRINGS));
-  }
-
-  @Test
-  public void testStringJoinWithJavaImpl() {
-assertNull(StringUtils.join(",", null));
-assertEquals("", String.join(",", Collections.singletonList("")));
-assertEquals(",", String.join(",", Arrays.asList("", "")));
-assertEquals("a,", String.join(",", Arrays.asList("a", "")));
-  }
-
-  @Test
-  public void testStringNullToEmpty() {
-String str = "This is a test";
-assertEquals(str, StringUtils.nullToEmpty(str));
-assertEquals("", StringUtils.nullToEmpty(null));
-  }
-
-  @Test
-  public void testStringObjToString() {
-assertNull(StringUtils.objToString(null));
-assertEquals("Test String", StringUtils.objToString("Test String"));
-
-// assert byte buffer
-ByteBuffer byteBuffer1 = ByteBuffer.wrap(getUTF8Bytes("1234"));
-ByteBuffer byteBuffer2 = ByteBuffer.wrap(getUTF8Bytes("5678"));
-// assert equal because ByteBuffer has overwritten the toString to return 
a summary string
-assertEquals(byteBuffer1.toString(), byteBuffer2.toString());
-// assert not equal
-assertNotEquals(StringUtils.objToString(byteBuffer1), 
StringUtils.objToString(byteBuffer2));
-  }
-
-  @Test
-  public void testStringEmptyToNull() {
-assertNull(StringUtils.emptyToNull(""));
-assertEquals("Test String", StringUtils.emptyToNull("Test String"));
-  }
-
-  @Test
-  public void testStringNullOrEmpty() {
-assertTrue(StringUtils.isNullOrEmpty(null));
-assertTrue(StringUtils.isNullOrEmpty(""));
-assertNotEquals(null, StringUtils.isNullOrEmpty("this is not empty"));
-assertTrue(StringUtils.isNullOrEmpty(""));
-  }
-
-  @Test
-  public void testSplit() {
-assertEquals(new ArrayList<>(), StringUtils.split(null, ","));
-assertEquals(new ArrayList<>(), StringUtils.split("", ","));
-assertEquals(Arrays.asList("a", "b", "c"), StringUtils.split("a,b, c", 
","));
-assertEquals(Arrays.asList("a", "b", "c"), StringUtils.split("a,b,, c ", 
","));
-  }
-
-  @Test
-  public void testHexString() {
-String str = "abcd";
-assertEquals(StringUtils.toHexString(getUTF8Bytes(str)), 
toHexString(getUTF8Bytes(str)));
-  }
-
-  private static String toHexString(byte[] bytes) {
-StringBuilder sb = new StringBuilder(bytes.length * 2);
-for (byte b : bytes) {
-  sb.append(String.format("%02x", b));
-}
-return sb.toString();
-  }
-
- 

(hudi) 02/28: [HUDI-7583] Read log block header only for the schema and instant time (#10984)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit a9c7eebd0fc8ffccf8859ae56147f2873d0f9892
Author: Y Ethan Guo 
AuthorDate: Tue May 14 16:07:09 2024 -0700

[HUDI-7583] Read log block header only for the schema and instant time 
(#10984)
---
 .../hudi/common/table/TableSchemaResolver.java |  5 +-
 .../common/functional/TestHoodieLogFormat.java |  2 +-
 .../hudi/common/table/TestTableSchemaResolver.java | 56 ++
 .../utilities/HoodieMetadataTableValidator.java|  2 +-
 4 files changed, 62 insertions(+), 3 deletions(-)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/table/TableSchemaResolver.java
 
b/hudi-common/src/main/java/org/apache/hudi/common/table/TableSchemaResolver.java
index f37dd4e7540..0344331ab75 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/table/TableSchemaResolver.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/table/TableSchemaResolver.java
@@ -385,7 +385,10 @@ public class TableSchemaResolver {
* @return
*/
   public static MessageType readSchemaFromLogFile(FileSystem fs, Path path) 
throws IOException {
-try (Reader reader = HoodieLogFormat.newReader(fs, new 
HoodieLogFile(path), null)) {
+// We only need to read the schema from the log block header,
+// so we read the block lazily to avoid reading block content
+// containing the records
+try (Reader reader = HoodieLogFormat.newReader(fs, new 
HoodieLogFile(path), null, true, false)) {
   HoodieDataBlock lastBlock = null;
   while (reader.hasNext()) {
 HoodieLogBlock block = reader.next();
diff --git 
a/hudi-common/src/test/java/org/apache/hudi/common/functional/TestHoodieLogFormat.java
 
b/hudi-common/src/test/java/org/apache/hudi/common/functional/TestHoodieLogFormat.java
index 0b3bcc812ae..d4cb5021afc 100755
--- 
a/hudi-common/src/test/java/org/apache/hudi/common/functional/TestHoodieLogFormat.java
+++ 
b/hudi-common/src/test/java/org/apache/hudi/common/functional/TestHoodieLogFormat.java
@@ -2804,7 +2804,7 @@ public class TestHoodieLogFormat extends 
HoodieCommonTestHarness {
 }
   }
 
-  private static HoodieDataBlock getDataBlock(HoodieLogBlockType 
dataBlockType, List records,
+  public static HoodieDataBlock getDataBlock(HoodieLogBlockType dataBlockType, 
List records,
   Map 
header) {
 return getDataBlock(dataBlockType, 
records.stream().map(HoodieAvroIndexedRecord::new).collect(Collectors.toList()),
 header, new Path("dummy_path"));
   }
diff --git 
a/hudi-common/src/test/java/org/apache/hudi/common/table/TestTableSchemaResolver.java
 
b/hudi-common/src/test/java/org/apache/hudi/common/table/TestTableSchemaResolver.java
index b7f0ba8eba7..d8d0d8c9f72 100644
--- 
a/hudi-common/src/test/java/org/apache/hudi/common/table/TestTableSchemaResolver.java
+++ 
b/hudi-common/src/test/java/org/apache/hudi/common/table/TestTableSchemaResolver.java
@@ -19,13 +19,33 @@
 package org.apache.hudi.common.table;
 
 import org.apache.hudi.avro.AvroSchemaUtils;
+import org.apache.hudi.common.model.HoodieLogFile;
+import org.apache.hudi.common.table.log.HoodieLogFormat;
+import org.apache.hudi.common.table.log.block.HoodieDataBlock;
+import org.apache.hudi.common.table.log.block.HoodieLogBlock;
 import org.apache.hudi.common.testutils.HoodieTestDataGenerator;
+import org.apache.hudi.common.testutils.SchemaTestUtil;
 import org.apache.hudi.common.util.Option;
 import org.apache.hudi.internal.schema.HoodieSchemaException;
 
 import org.apache.avro.Schema;
+import org.apache.avro.generic.IndexedRecord;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.avro.AvroSchemaConverter;
 import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.io.TempDir;
 
+import java.io.IOException;
+import java.net.URISyntaxException;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import static 
org.apache.hudi.common.functional.TestHoodieLogFormat.getDataBlock;
+import static 
org.apache.hudi.common.table.log.block.HoodieLogBlock.HoodieLogBlockType.AVRO_DATA_BLOCK;
+import static org.apache.hudi.common.testutils.SchemaTestUtil.getSimpleSchema;
 import static org.junit.jupiter.api.Assertions.assertEquals;
 import static org.junit.jupiter.api.Assertions.assertNotEquals;
 import static org.junit.jupiter.api.Assertions.assertTrue;
@@ -35,6 +55,9 @@ import static org.junit.jupiter.api.Assertions.assertTrue;
  */
 public class TestTableSchemaResolver {
 
+  @TempDir
+  public java.nio.file.Path tempDir;
+
   @Test
   public void testRecreateSchemaWhenDropPartitionColumns() {
 Schema originSchema = new 
Schema.Parser().parse(HoodieTestDataGenerator.TRIP_EXAMPLE_SCHEMA);
@@ -65,4 +88,37 @@ public class TestTableSchemaResolver {
   assertTrue

(hudi) 10/28: [HUDI-7601] Add heartbeat mechanism to refresh lock (#10994)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 4aa23beec97032471eb522b7f5ae99ea57e358f5
Author: Yann Byron 
AuthorDate: Fri Apr 12 14:12:04 2024 +0800

[HUDI-7601] Add heartbeat mechanism to refresh lock (#10994)

* [HUDI-7601] Add heartbeat mechanism to refresh lock
---
 .../org/apache/hudi/config/HoodieLockConfig.java   | 13 +++
 .../hudi/common/config/LockConfiguration.java  |  3 ++
 .../hudi/hive/transaction/lock/Heartbeat.java  | 42 ++
 .../lock/HiveMetastoreBasedLockProvider.java   | 23 ++--
 4 files changed, 79 insertions(+), 2 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLockConfig.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLockConfig.java
index b24aecf46c1..4fbae5326f3 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLockConfig.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLockConfig.java
@@ -36,6 +36,7 @@ import java.util.Properties;
 
 import static 
org.apache.hudi.common.config.LockConfiguration.DEFAULT_LOCK_ACQUIRE_NUM_RETRIES;
 import static 
org.apache.hudi.common.config.LockConfiguration.DEFAULT_LOCK_ACQUIRE_RETRY_WAIT_TIME_IN_MILLIS;
+import static 
org.apache.hudi.common.config.LockConfiguration.DEFAULT_LOCK_HEARTBEAT_INTERVAL_MS;
 import static 
org.apache.hudi.common.config.LockConfiguration.DEFAULT_ZK_CONNECTION_TIMEOUT_MS;
 import static 
org.apache.hudi.common.config.LockConfiguration.DEFAULT_ZK_SESSION_TIMEOUT_MS;
 import static 
org.apache.hudi.common.config.LockConfiguration.FILESYSTEM_LOCK_EXPIRE_PROP_KEY;
@@ -49,6 +50,7 @@ import static 
org.apache.hudi.common.config.LockConfiguration.LOCK_ACQUIRE_NUM_R
 import static 
org.apache.hudi.common.config.LockConfiguration.LOCK_ACQUIRE_RETRY_MAX_WAIT_TIME_IN_MILLIS_PROP_KEY;
 import static 
org.apache.hudi.common.config.LockConfiguration.LOCK_ACQUIRE_RETRY_WAIT_TIME_IN_MILLIS_PROP_KEY;
 import static 
org.apache.hudi.common.config.LockConfiguration.LOCK_ACQUIRE_WAIT_TIMEOUT_MS_PROP_KEY;
+import static 
org.apache.hudi.common.config.LockConfiguration.LOCK_HEARTBEAT_INTERVAL_MS_KEY;
 import static org.apache.hudi.common.config.LockConfiguration.LOCK_PREFIX;
 import static 
org.apache.hudi.common.config.LockConfiguration.ZK_BASE_PATH_PROP_KEY;
 import static 
org.apache.hudi.common.config.LockConfiguration.ZK_CONNECTION_TIMEOUT_MS_PROP_KEY;
@@ -111,6 +113,12 @@ public class HoodieLockConfig extends HoodieConfig {
   .sinceVersion("0.8.0")
   .withDocumentation("Timeout in ms, to wait on an individual lock 
acquire() call, at the lock provider.");
 
+  public static final ConfigProperty LOCK_HEARTBEAT_INTERVAL_MS = 
ConfigProperty
+  .key(LOCK_HEARTBEAT_INTERVAL_MS_KEY)
+  .defaultValue(DEFAULT_LOCK_HEARTBEAT_INTERVAL_MS)
+  .sinceVersion("1.0.0")
+  .withDocumentation("Heartbeat interval in ms, to send a heartbeat to 
indicate that hive client holding locks.");
+
   public static final ConfigProperty FILESYSTEM_LOCK_PATH = 
ConfigProperty
   .key(FILESYSTEM_LOCK_PATH_PROP_KEY)
   .noDefaultValue()
@@ -342,6 +350,11 @@ public class HoodieLockConfig extends HoodieConfig {
   return this;
 }
 
+public HoodieLockConfig.Builder withHeartbeatIntervalInMillis(Long 
intervalInMillis) {
+  lockConfig.setValue(LOCK_HEARTBEAT_INTERVAL_MS, 
String.valueOf(intervalInMillis));
+  return this;
+}
+
 public HoodieLockConfig.Builder 
withConflictResolutionStrategy(ConflictResolutionStrategy 
conflictResolutionStrategy) {
   lockConfig.setValue(WRITE_CONFLICT_RESOLUTION_STRATEGY_CLASS_NAME, 
conflictResolutionStrategy.getClass().getName());
   return this;
diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/config/LockConfiguration.java
 
b/hudi-common/src/main/java/org/apache/hudi/common/config/LockConfiguration.java
index c6ebc54e95d..1788122ffe4 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/config/LockConfiguration.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/config/LockConfiguration.java
@@ -43,6 +43,9 @@ public class LockConfiguration implements Serializable {
 
   public static final String LOCK_ACQUIRE_WAIT_TIMEOUT_MS_PROP_KEY = 
LOCK_PREFIX + "wait_time_ms";
 
+  public static final String LOCK_HEARTBEAT_INTERVAL_MS_KEY = LOCK_PREFIX + 
"heartbeat_interval_ms";
+  public static final int DEFAULT_LOCK_HEARTBEAT_INTERVAL_MS = 60 * 1000;
+
   // configs for file system based locks. NOTE: This only works for DFS with 
atomic create/delete operation
   public static final String FILESYSTEM_BASED_LOCK_PROPERTY_PREFIX = 
LOCK_PREFIX + "filesystem.";
 
diff --git 
a/hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/transaction/lock/Heartbeat.java
 
b/hudi-sync/hudi-hive-sync/src

(hudi) 03/28: [HUDI-7597] Add logs of Kafka offsets when the checkpoint is out of bound (#10987)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit ef96676f39e44bc1a5e8133a28512967bcc337f9
Author: Y Ethan Guo 
AuthorDate: Wed Apr 10 03:03:45 2024 -0700

[HUDI-7597] Add logs of Kafka offsets when the checkpoint is out of bound 
(#10987)

* [HUDI-7597] Add logs of Kafka offsets when the checkpoint is out of bound

* Adjust test
---
 .../utilities/sources/helpers/KafkaOffsetGen.java  | 29 +++---
 .../utilities/sources/BaseTestKafkaSource.java | 16 ++--
 2 files changed, 27 insertions(+), 18 deletions(-)

diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java
index 442046cd948..71fe7a7629a 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java
@@ -331,24 +331,35 @@ public class KafkaOffsetGen {
 
   /**
* Fetch checkpoint offsets for each partition.
-   * @param consumer instance of {@link KafkaConsumer} to fetch offsets from.
+   *
+   * @param consumer  instance of {@link KafkaConsumer} to fetch 
offsets from.
* @param lastCheckpointStr last checkpoint string.
-   * @param topicPartitions set of topic partitions.
+   * @param topicPartitions   set of topic partitions.
* @return a map of Topic partitions to offsets.
*/
   private Map fetchValidOffsets(KafkaConsumer consumer,
-Option 
lastCheckpointStr, Set topicPartitions) {
+  Option 
lastCheckpointStr, Set topicPartitions) {
 Map earliestOffsets = 
consumer.beginningOffsets(topicPartitions);
 Map checkpointOffsets = 
CheckpointUtils.strToOffsets(lastCheckpointStr.get());
-boolean isCheckpointOutOfBounds = checkpointOffsets.entrySet().stream()
-.anyMatch(offset -> offset.getValue() < 
earliestOffsets.get(offset.getKey()));
+List outOfBoundPartitionList = 
checkpointOffsets.entrySet().stream()
+.filter(offset -> offset.getValue() < 
earliestOffsets.get(offset.getKey()))
+.map(Map.Entry::getKey)
+.collect(Collectors.toList());
+boolean isCheckpointOutOfBounds = !outOfBoundPartitionList.isEmpty();
+
 if (isCheckpointOutOfBounds) {
+  String outOfBoundOffsets = outOfBoundPartitionList.stream()
+  .map(p -> p.toString() + ":{checkpoint=" + checkpointOffsets.get(p)
+  + ",earliestOffset=" + earliestOffsets.get(p) + "}")
+  .collect(Collectors.joining(","));
+  String message = "Some data may have been lost because they are not 
available in Kafka any more;"
+  + " either the data was aged out by Kafka or the topic may have been 
deleted before all the data in the topic was processed. "
+  + "Kafka partitions that have out-of-bound checkpoints: " + 
outOfBoundOffsets + " .";
+
   if (getBooleanWithAltKeys(this.props, 
KafkaSourceConfig.ENABLE_FAIL_ON_DATA_LOSS)) {
-throw new HoodieStreamerException("Some data may have been lost 
because they are not available in Kafka any more;"
-+ " either the data was aged out by Kafka or the topic may have 
been deleted before all the data in the topic was processed.");
+throw new HoodieStreamerException(message);
   } else {
-LOG.warn("Some data may have been lost because they are not available 
in Kafka any more;"
-+ " either the data was aged out by Kafka or the topic may have 
been deleted before all the data in the topic was processed."
+LOG.warn(message
 + " If you want Hudi Streamer to fail on such cases, set \"" + 
KafkaSourceConfig.ENABLE_FAIL_ON_DATA_LOSS.key() + "\" to \"true\".");
   }
 }
diff --git 
a/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/BaseTestKafkaSource.java
 
b/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/BaseTestKafkaSource.java
index c5fc7bfaafa..e45d10e7a61 100644
--- 
a/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/BaseTestKafkaSource.java
+++ 
b/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/BaseTestKafkaSource.java
@@ -53,6 +53,7 @@ import static 
org.apache.hudi.utilities.config.KafkaSourceConfig.ENABLE_KAFKA_CO
 import static org.junit.jupiter.api.Assertions.assertEquals;
 import static org.junit.jupiter.api.Assertions.assertNotNull;
 import static org.junit.jupiter.api.Assertions.assertThrows;
+import static org.junit.jupiter.api.Assertions.assertTrue;
 import static org.mockito.Mockito.mock;
 import static org.mockito.Mockito.when;
 
@@ -254,7 +255,7 @@ abstract class BaseTestKafkaSource extends 
SparkClientFunctionalTestHarness {
 final S

(hudi) 17/28: [HUDI-6762] Removed usages of MetadataRecordsGenerationParams (#10962)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 566039010cf3725e751cb3c4ca8641f42ff5bdb6
Author: Vova Kolmakov 
AuthorDate: Tue May 14 16:10:59 2024 -0700

[HUDI-6762] Removed usages of MetadataRecordsGenerationParams (#10962)

Co-authored-by: Vova Kolmakov 
---
 .../metadata/HoodieBackedTableMetadataWriter.java  | 118 +
 .../hudi/metadata/HoodieTableMetadataUtil.java | 266 -
 .../metadata/MetadataRecordsGenerationParams.java  |  89 ---
 3 files changed, 204 insertions(+), 269 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
index 329ff261f53..3537a6ddb40 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
@@ -329,12 +329,6 @@ public abstract class HoodieBackedTableMetadataWriter 
implements HoodieTableM
   LOG.warn("Metadata Table will need to be re-initialized as no instants 
were found");
   return true;
 }
-
-final String latestMetadataInstantTimestamp = 
latestMetadataInstant.get().getTimestamp();
-if (latestMetadataInstantTimestamp.startsWith(SOLO_COMMIT_TIMESTAMP)) { // 
the initialization timestamp is SOLO_COMMIT_TIMESTAMP + offset
-  return false;
-}
-
 return false;
   }
 
@@ -394,8 +388,8 @@ public abstract class HoodieBackedTableMetadataWriter 
implements HoodieTableM
 for (MetadataPartitionType partitionType : partitionsToInit) {
   // Find the commit timestamp to use for this partition. Each 
initialization should use its own unique commit time.
   String commitTimeForPartition = 
generateUniqueCommitInstantTime(initializationTime);
-
-  LOG.info("Initializing MDT partition " + partitionType.name() + " at 
instant " + commitTimeForPartition);
+  String partitionTypeName = partitionType.name();
+  LOG.info("Initializing MDT partition {} at instant {}", 
partitionTypeName, commitTimeForPartition);
 
   Pair> fileGroupCountAndRecordsPair;
   try {
@@ -413,24 +407,26 @@ public abstract class HoodieBackedTableMetadataWriter 
implements HoodieTableM
 fileGroupCountAndRecordsPair = initializeRecordIndexPartition();
 break;
   default:
-throw new HoodieMetadataException("Unsupported MDT partition type: 
" + partitionType);
+throw new HoodieMetadataException(String.format("Unsupported MDT 
partition type: %s", partitionType));
 }
   } catch (Exception e) {
 String metricKey = partitionType.getPartitionPath() + "_" + 
HoodieMetadataMetrics.BOOTSTRAP_ERR_STR;
 metrics.ifPresent(m -> m.setMetric(metricKey, 1));
-LOG.error("Bootstrap on " + partitionType.getPartitionPath() + " 
partition failed for "
-+ metadataMetaClient.getBasePath(), e);
-throw new HoodieMetadataException(partitionType.getPartitionPath()
-+ " bootstrap failed for " + metadataMetaClient.getBasePath(), e);
+String errMsg = String.format("Bootstrap on %s partition failed for 
%s",
+partitionType.getPartitionPath(), 
metadataMetaClient.getBasePathV2());
+LOG.error(errMsg, e);
+throw new HoodieMetadataException(errMsg, e);
   }
 
-  LOG.info(String.format("Initializing %s index with %d mappings and %d 
file groups.", partitionType.name(), fileGroupCountAndRecordsPair.getKey(),
-  fileGroupCountAndRecordsPair.getValue().count()));
+  if (LOG.isInfoEnabled()) {
+LOG.info("Initializing {} index with {} mappings and {} file groups.", 
partitionTypeName, fileGroupCountAndRecordsPair.getKey(),
+fileGroupCountAndRecordsPair.getValue().count());
+  }
   HoodieTimer partitionInitTimer = HoodieTimer.start();
 
   // Generate the file groups
   final int fileGroupCount = fileGroupCountAndRecordsPair.getKey();
-  ValidationUtils.checkArgument(fileGroupCount > 0, "FileGroup count for 
MDT partition " + partitionType.name() + " should be > 0");
+  ValidationUtils.checkArgument(fileGroupCount > 0, "FileGroup count for 
MDT partition " + partitionTypeName + " should be > 0");
   initializeFileGroups(dataMetaClient, partitionType, 
commitTimeForPartition, fileGroupCount);
 
   // Perform the commit using bulkCommit
@@ -441,7 +437,7 @@ public abstract class HoodieBackedTableMetadataWriter 
implements HoodieTableM
   // initialize the metadata reader again so the MDT partition can be read 
after initialization
   initMetadataReader();
   long totalInitTime = partitionInitTimer.endTimer();
-  LOG.info(Stri

(hudi) 06/28: [HUDI-7391] HoodieMetadataMetrics should use Metrics instance for metrics registry (#10635)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 858fde11fdd2daf118c0a951f6772e3e0b61004d
Author: Lokesh Jain 
AuthorDate: Tue May 14 16:04:34 2024 -0700

[HUDI-7391] HoodieMetadataMetrics should use Metrics instance for metrics 
registry (#10635)

Currently HoodieMetadataMetrics stores metrics in memory and these metrics 
are not pushed by the metric reporters. The metric reporters are configured 
within Metrics instance. List of changes in the PR:

Metrics related classes have been moved from hudi-client-common to 
hudi-common.
HoodieMetadataMetrics now uses Metrics class so that all the reporters can 
be supported with it.
Some gaps in configs which are added in HoodieMetadataWriteUtils
Some metrics related apis and functionality has been moved to 
HoodieMetricsConfig. The HoodieWriteConfig APIs now delegate to 
HoodieMetricsConfig for the functionality.
---
 hudi-client/hudi-client-common/pom.xml |  46 -
 .../lock/metrics/HoodieLockMetrics.java|   2 +-
 .../org/apache/hudi/config/HoodieWriteConfig.java  |  98 +-
 .../hudi/metadata/HoodieMetadataWriteUtils.java|   9 +-
 .../org/apache/hudi/metrics/HoodieMetrics.java |   2 +-
 .../cloudwatch/CloudWatchMetricsReporter.java  |  29 ++-
 .../table/action/index/RunIndexActionExecutor.java |   3 +-
 .../hudi/metrics/TestHoodieConsoleMetrics.java |  16 +-
 .../hudi/metrics/TestHoodieGraphiteMetrics.java|  22 ++-
 .../apache/hudi/metrics/TestHoodieJmxMetrics.java  |  19 +-
 .../org/apache/hudi/metrics/TestHoodieMetrics.java |  17 +-
 .../hudi/metrics/TestMetricsReporterFactory.java   |  20 +-
 .../cloudwatch/TestCloudWatchMetricsReporter.java  |  27 ++-
 .../datadog/TestDatadogMetricsReporter.java|  60 +++---
 .../org/apache/hudi/metrics/m3/TestM3Metrics.java  |  54 +++---
 .../metrics/prometheus/TestPrometheusReporter.java |  19 +-
 .../prometheus/TestPushGateWayReporter.java|  52 +++---
 .../FlinkHoodieBackedTableMetadataWriter.java  |   4 +-
 .../JavaHoodieBackedTableMetadataWriter.java   |   4 +-
 .../hudi/client/TestJavaHoodieBackedMetadata.java  |  21 ++-
 .../SparkHoodieBackedTableMetadataWriter.java  |   2 +-
 .../functional/TestHoodieBackedMetadata.java   |  18 +-
 hudi-common/pom.xml|  47 +
 .../hudi/common/config/HoodieCommonConfig.java |   8 +
 .../metrics/HoodieMetricsCloudWatchConfig.java |   0
 .../hudi/config/metrics/HoodieMetricsConfig.java   | 201 +
 .../config/metrics/HoodieMetricsDatadogConfig.java |   0
 .../metrics/HoodieMetricsGraphiteConfig.java   |   0
 .../config/metrics/HoodieMetricsJmxConfig.java |   0
 .../hudi/config/metrics/HoodieMetricsM3Config.java |   0
 .../metrics/HoodieMetricsPrometheusConfig.java |   0
 .../apache/hudi/metadata/BaseTableMetadata.java|   4 +-
 .../hudi/metadata/HoodieMetadataMetrics.java   |  21 ++-
 .../hudi/metrics/ConsoleMetricsReporter.java   |   0
 .../java/org/apache/hudi/metrics/HoodieGauge.java  |   0
 .../hudi/metrics/InMemoryMetricsReporter.java  |   0
 .../apache/hudi/metrics/JmxMetricsReporter.java|   4 +-
 .../org/apache/hudi/metrics/JmxReporterServer.java |   0
 .../java/org/apache/hudi/metrics/MetricUtils.java  |   0
 .../main/java/org/apache/hudi/metrics/Metrics.java |  43 +++--
 .../hudi/metrics/MetricsGraphiteReporter.java  |  16 +-
 .../org/apache/hudi/metrics/MetricsReporter.java   |   0
 .../hudi/metrics/MetricsReporterFactory.java   |  27 ++-
 .../apache/hudi/metrics/MetricsReporterType.java   |   0
 .../custom/CustomizableMetricsReporter.java|   0
 .../hudi/metrics/datadog/DatadogHttpClient.java|   0
 .../metrics/datadog/DatadogMetricsReporter.java|   4 +-
 .../hudi/metrics/datadog/DatadogReporter.java  |   0
 .../apache/hudi/metrics/m3/M3MetricsReporter.java  |  16 +-
 .../hudi/metrics/m3/M3ScopeReporterAdaptor.java|   0
 .../metrics/prometheus/PrometheusReporter.java |  10 +-
 .../prometheus/PushGatewayMetricsReporter.java |  18 +-
 .../metrics/prometheus/PushGatewayReporter.java|   0
 .../AbstractUserDefinedMetricsReporter.java|   0
 .../deltastreamer/HoodieDeltaStreamerMetrics.java  |   8 +-
 .../ingestion/HoodieIngestionMetrics.java  |   7 +-
 .../utilities/streamer/HoodieStreamerMetrics.java  |   5 +
 .../apache/hudi/utilities/streamer/StreamSync.java |   2 +-
 58 files changed, 650 insertions(+), 335 deletions(-)

diff --git a/hudi-client/hudi-client-common/pom.xml 
b/hudi-client/hudi-client-common/pom.xml
index 6caccd0b0a6..022f5d6faa0 100644
--- a/hudi-client/hudi-client-common/pom.xml
+++ b/hudi-client/hudi-client-common/pom.xml
@@ -85,52 +85,6 @@
   0.2.2
 
 
-
-
-  io.dropwizard.metrics
-  metrics-graphite
-  
-
-  com.rabbitmq
-

(hudi) 09/28: [HUDI-7290] Don't assume ReplaceCommits are always Clustering (#10479)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit f9ffb64646207ba595559c89fbd415e3b8c7f3f0
Author: Jon Vexler 
AuthorDate: Fri Apr 12 00:08:37 2024 -0400

[HUDI-7290]  Don't assume ReplaceCommits are always Clustering (#10479)

* fix all usages not in tests
* do pass through and fix
* fix test that didn't actually use a cluster commit
* make method private and fix naming
* revert write markers changes

-

Co-authored-by: Jonathan Vexler <=>
---
 .../hudi/client/BaseHoodieTableServiceClient.java  | 10 ---
 .../org/apache/hudi/table/marker/WriteMarkers.java |  2 ++
 .../table/timeline/HoodieDefaultTimeline.java  | 31 --
 .../hudi/common/table/timeline/HoodieTimeline.java | 11 
 .../table/view/AbstractTableFileSystemView.java|  5 +---
 .../table/view/TestHoodieTableFileSystemView.java  | 30 +++--
 .../clustering/ClusteringPlanSourceFunction.java   |  2 +-
 .../java/org/apache/hudi/util/ClusteringUtil.java  |  2 +-
 .../apache/hudi/utilities/HoodieClusteringJob.java | 12 -
 9 files changed, 86 insertions(+), 19 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java
index 909581687d4..e408dc7a779 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java
@@ -444,8 +444,12 @@ public abstract class BaseHoodieTableServiceClient extends BaseHoodieCl
 HoodieTimeline pendingClusteringTimeline = 
table.getActiveTimeline().filterPendingReplaceTimeline();
 HoodieInstant inflightInstant = 
HoodieTimeline.getReplaceCommitInflightInstant(clusteringInstant);
 if (pendingClusteringTimeline.containsInstant(inflightInstant)) {
-  table.rollbackInflightClustering(inflightInstant, commitToRollback -> 
getPendingRollbackInfo(table.getMetaClient(), commitToRollback, false));
-  table.getMetaClient().reloadActiveTimeline();
+  if 
(pendingClusteringTimeline.isPendingClusterInstant(inflightInstant.getTimestamp()))
 {
+table.rollbackInflightClustering(inflightInstant, commitToRollback -> 
getPendingRollbackInfo(table.getMetaClient(), commitToRollback, false));
+table.getMetaClient().reloadActiveTimeline();
+  } else {
+throw new HoodieClusteringException("Non clustering replace-commit 
inflight at timestamp " + clusteringInstant);
+  }
 }
 clusteringTimer = metrics.getClusteringCtx();
 LOG.info("Starting clustering at {}", clusteringInstant);
@@ -575,7 +579,7 @@ public abstract class BaseHoodieTableServiceClient 
extends BaseHoodieCl
 
 // if just inline schedule is enabled
 if (!config.inlineClusteringEnabled() && config.scheduleInlineClustering()
-&& table.getActiveTimeline().filterPendingReplaceTimeline().empty()) {
+&& 
!table.getActiveTimeline().getLastPendingClusterInstant().isPresent()) {
   // proceed only if there are no pending clustering
   
metadata.addMetadata(HoodieClusteringConfig.SCHEDULE_INLINE_CLUSTERING.key(), 
"true");
   inlineScheduleClustering(extraMetadata);
diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/marker/WriteMarkers.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/marker/WriteMarkers.java
index 01c8c99618a..f8fbd13b1c2 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/marker/WriteMarkers.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/marker/WriteMarkers.java
@@ -87,6 +87,7 @@ public abstract class WriteMarkers implements Serializable {
   HoodieTimeline pendingReplaceTimeline = 
activeTimeline.filterPendingReplaceTimeline();
   // TODO If current is compact or clustering then create marker directly 
without early conflict detection.
   // Need to support early conflict detection between table service and 
common writers.
+  // ok to use filterPendingReplaceTimeline().containsInstant because 
early conflict detection is not relevant for insert overwrite as well
   if (pendingCompactionTimeline.containsInstant(instantTime) || 
pendingReplaceTimeline.containsInstant(instantTime)) {
 return create(partitionPath, fileName, type, false);
   }
@@ -127,6 +128,7 @@ public abstract class WriteMarkers implements Serializable {
   HoodieTimeline pendingReplaceTimeline = 
activeTimeline.filterPendingReplaceTimeline();
   // TODO If current is compact or clustering then create marker directly 
without early conflict detection.
   // Need to support early conflict detec

(hudi) 19/28: [MINOR] Rename location to path in `makeQualified` (#11037)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 51f7557070da4b3b218e0f17bbc6f7a18baae218
Author: Y Ethan Guo 
AuthorDate: Tue Apr 16 18:30:11 2024 -0700

[MINOR] Rename location to path in `makeQualified` (#11037)
---
 .../src/main/java/org/apache/hudi/common/fs/FSUtils.java | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java 
b/hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java
index 68cc5c131db..292c2b41946 100644
--- a/hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java
+++ b/hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java
@@ -123,14 +123,14 @@ public class FSUtils {
   }
 
   /**
-   * Makes location qualified with {@link HoodieStorage}'s URI.
+   * Makes path qualified with {@link HoodieStorage}'s URI.
*
-   * @param storage  instance of {@link HoodieStorage}.
-   * @param location to be qualified.
-   * @return qualified location, prefixed with the URI of the target 
HoodieStorage object provided.
+   * @param storage instance of {@link HoodieStorage}.
+   * @param pathto be qualified.
+   * @return qualified path, prefixed with the URI of the target HoodieStorage 
object provided.
*/
-  public static StoragePath makeQualified(HoodieStorage storage, StoragePath 
location) {
-return location.makeQualified(storage.getUri());
+  public static StoragePath makeQualified(HoodieStorage storage, StoragePath 
path) {
+return path.makeQualified(storage.getUri());
   }
 
   /**



(hudi) 18/28: [MINOR] Remove redundant lines in StreamSync and TestStreamSyncUnitTests (#11027)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit f0cf58d8c998018414b27ade424d72350131806e
Author: Y Ethan Guo 
AuthorDate: Mon Apr 15 21:41:41 2024 -0700

[MINOR] Remove redundant lines in StreamSync and TestStreamSyncUnitTests 
(#11027)
---
 .../apache/hudi/utilities/streamer/StreamSync.java   |  4 
 .../utilities/streamer/TestStreamSyncUnitTests.java  | 20 
 2 files changed, 24 deletions(-)

diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java
index 2b0d94da74a..7e0b97ef570 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java
@@ -278,7 +278,6 @@ public class StreamSync implements Serializable, Closeable {
 this.formatAdapter = formatAdapter;
 this.transformer = transformer;
 this.useRowWriter = useRowWriter;
-
   }
 
   @Deprecated
@@ -500,7 +499,6 @@ public class StreamSync implements Serializable, Closeable {
* @return Pair Input data read from upstream 
source, and boolean is true if empty.
* @throws Exception in case of any Exception
*/
-
   public InputBatch readFromSource(String instantTime, HoodieTableMetaClient 
metaClient) throws IOException {
 // Retrieve the previous round checkpoints, if any
 Option resumeCheckpointStr = Option.empty();
@@ -563,7 +561,6 @@ public class StreamSync implements Serializable, Closeable {
 // handle empty batch with change in checkpoint
 hoodieSparkContext.setJobStatus(this.getClass().getSimpleName(), "Checking 
if input is empty: " + cfg.targetTableName);
 
-
 if (useRowWriter) { // no additional processing required for row writer.
   return inputBatch;
 } else {
@@ -1297,5 +1294,4 @@ public class StreamSync implements Serializable, 
Closeable {
   return writeStatusRDD;
 }
   }
-
 }
diff --git 
a/hudi-utilities/src/test/java/org/apache/hudi/utilities/streamer/TestStreamSyncUnitTests.java
 
b/hudi-utilities/src/test/java/org/apache/hudi/utilities/streamer/TestStreamSyncUnitTests.java
index 99148eb4b07..c0169ae64b8 100644
--- 
a/hudi-utilities/src/test/java/org/apache/hudi/utilities/streamer/TestStreamSyncUnitTests.java
+++ 
b/hudi-utilities/src/test/java/org/apache/hudi/utilities/streamer/TestStreamSyncUnitTests.java
@@ -17,25 +17,6 @@
  * under the License.
  */
 
-/*
- * Licensed to the Apache Software Foundation (ASF) under one
- * or more contributor license agreements.  See the NOTICE file
- * distributed with this work for additional information
- * regarding copyright ownership.  The ASF licenses this file
- * to you under the Apache License, Version 2.0 (the
- * "License"); you may not use this file except in compliance
- * with the License.  You may obtain a copy of the License at
- *
- *   http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing,
- * software distributed under the License is distributed on an
- * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- * KIND, either express or implied.  See the License for the
- * specific language governing permissions and limitations
- * under the License.
- */
-
 package org.apache.hudi.utilities.streamer;
 
 import org.apache.hudi.DataSourceWriteOptions;
@@ -75,7 +56,6 @@ import static org.mockito.Mockito.verify;
 import static org.mockito.Mockito.when;
 
 public class TestStreamSyncUnitTests {
-
   @ParameterizedTest
   @MethodSource("testCasesFetchNextBatchFromSource")
   void testFetchNextBatchFromSource(Boolean useRowWriter, Boolean 
hasTransformer, Boolean hasSchemaProvider,



(hudi) 28/28: [HUDI-7637] Make StoragePathInfo Comparable (#11050)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 31d24d7600bbfacc5959b12de97c84ea79fb6fa0
Author: Y Ethan Guo 
AuthorDate: Thu Apr 18 05:51:23 2024 -0700

[HUDI-7637] Make StoragePathInfo Comparable (#11050)
---
 .../main/java/org/apache/hudi/storage/StoragePathInfo.java  |  7 ++-
 .../org/apache/hudi/io/storage/TestStoragePathInfo.java | 13 +
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/hudi-io/src/main/java/org/apache/hudi/storage/StoragePathInfo.java 
b/hudi-io/src/main/java/org/apache/hudi/storage/StoragePathInfo.java
index e4711bf72dd..1c1ebc32a2f 100644
--- a/hudi-io/src/main/java/org/apache/hudi/storage/StoragePathInfo.java
+++ b/hudi-io/src/main/java/org/apache/hudi/storage/StoragePathInfo.java
@@ -31,7 +31,7 @@ import java.io.Serializable;
  * with simplification based on what Hudi needs.
  */
 @PublicAPIClass(maturity = ApiMaturityLevel.EVOLVING)
-public class StoragePathInfo implements Serializable {
+public class StoragePathInfo implements Serializable, 
Comparable {
   private final StoragePath path;
   private final long length;
   private final boolean isDirectory;
@@ -109,6 +109,11 @@ public class StoragePathInfo implements Serializable {
 return modificationTime;
   }
 
+  @Override
+  public int compareTo(StoragePathInfo o) {
+return this.getPath().compareTo(o.getPath());
+  }
+
   @Override
   public boolean equals(Object o) {
 if (this == o) {
diff --git 
a/hudi-io/src/test/java/org/apache/hudi/io/storage/TestStoragePathInfo.java 
b/hudi-io/src/test/java/org/apache/hudi/io/storage/TestStoragePathInfo.java
index 72640c5e3df..95cf4d798a4 100644
--- a/hudi-io/src/test/java/org/apache/hudi/io/storage/TestStoragePathInfo.java
+++ b/hudi-io/src/test/java/org/apache/hudi/io/storage/TestStoragePathInfo.java
@@ -71,6 +71,19 @@ public class TestStoragePathInfo {
 }
   }
 
+  @Test
+  public void testCompareTo() {
+StoragePathInfo pathInfo1 = new StoragePathInfo(
+new StoragePath(PATH1), LENGTH, false, BLOCK_REPLICATION, BLOCK_SIZE, 
MODIFICATION_TIME);
+StoragePathInfo pathInfo2 = new StoragePathInfo(
+new StoragePath(PATH1), LENGTH + 2, false, BLOCK_REPLICATION, 
BLOCK_SIZE, MODIFICATION_TIME + 2L);
+StoragePathInfo pathInfo3 = new StoragePathInfo(
+new StoragePath(PATH2), LENGTH, false, BLOCK_REPLICATION, BLOCK_SIZE, 
MODIFICATION_TIME);
+
+assertEquals(0, pathInfo1.compareTo(pathInfo2));
+assertEquals(-1, pathInfo1.compareTo(pathInfo3));
+  }
+
   @Test
   public void testEquals() {
 StoragePathInfo pathInfo1 = new StoragePathInfo(



(hudi) 15/28: [HUDI-7584] Always read log block lazily and remove readBlockLazily argument (#11015)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit e39fedfa5b6c554ecfaf332f60d95fbcddf10b03
Author: Vova Kolmakov 
AuthorDate: Mon Apr 15 11:31:11 2024 +0700

[HUDI-7584] Always read log block lazily and remove readBlockLazily 
argument (#11015)
---
 .../hudi/cli/commands/HoodieLogFileCommand.java|   3 -
 .../cli/commands/TestHoodieLogFileCommand.java |   3 -
 .../org/apache/hudi/io/HoodieMergedReadHandle.java |   1 -
 .../hudi/table/action/compact/HoodieCompactor.java |   1 -
 .../run/strategy/JavaExecutionStrategy.java|   1 -
 .../MultipleSparkJobExecutionStrategy.java |   1 -
 .../hudi/common/table/TableSchemaResolver.java |  21 ++--
 .../table/log/AbstractHoodieLogRecordReader.java   |  65 +--
 .../table/log/HoodieCDCLogRecordIterator.java  |   3 +-
 .../hudi/common/table/log/HoodieLogFileReader.java |  69 +--
 .../hudi/common/table/log/HoodieLogFormat.java |  13 +--
 .../common/table/log/HoodieLogFormatReader.java|  14 +--
 .../table/log/HoodieMergedLogRecordScanner.java|  27 ++---
 .../table/log/HoodieUnMergedLogRecordScanner.java  |  12 +-
 .../hudi/common/table/log/LogReaderUtils.java  |   2 +-
 .../metadata/HoodieMetadataLogRecordReader.java|   1 -
 .../hudi/metadata/HoodieTableMetadataUtil.java |   1 -
 .../common/functional/TestHoodieLogFormat.java | 128 +++--
 .../examples/quickstart/TestQuickstartData.java|   1 -
 .../hudi/sink/clustering/ClusteringOperator.java   |   1 -
 .../org/apache/hudi/table/format/FormatUtils.java  |   6 -
 .../test/java/org/apache/hudi/utils/TestData.java  |   1 -
 .../realtime/HoodieMergeOnReadSnapshotReader.java  |   3 -
 .../realtime/RealtimeCompactedRecordReader.java|   1 -
 .../realtime/RealtimeUnmergedRecordReader.java |   1 -
 .../reader/DFSHoodieDatasetInputReader.java|   1 -
 .../src/main/scala/org/apache/hudi/Iterators.scala |   4 -
 .../ShowHoodieLogFileRecordsProcedure.scala|   1 -
 .../utilities/HoodieMetadataTableValidator.java| 126 +---
 29 files changed, 188 insertions(+), 324 deletions(-)

diff --git 
a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/HoodieLogFileCommand.java 
b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/HoodieLogFileCommand.java
index 46a9e787ea6..77d9392fcd0 100644
--- 
a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/HoodieLogFileCommand.java
+++ 
b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/HoodieLogFileCommand.java
@@ -238,9 +238,6 @@ public class HoodieLogFileCommand {
   .withLatestInstantTime(
   client.getActiveTimeline()
   .getCommitTimeline().lastInstant().get().getTimestamp())
-  .withReadBlocksLazily(
-  Boolean.parseBoolean(
-  
HoodieCompactionConfig.COMPACTION_LAZY_BLOCK_READ_ENABLE.defaultValue()))
   .withReverseReader(
   Boolean.parseBoolean(
   
HoodieCompactionConfig.COMPACTION_REVERSE_LOG_READ_ENABLE.defaultValue()))
diff --git 
a/hudi-cli/src/test/java/org/apache/hudi/cli/commands/TestHoodieLogFileCommand.java
 
b/hudi-cli/src/test/java/org/apache/hudi/cli/commands/TestHoodieLogFileCommand.java
index 6f75074ff29..dc9cdd1aaf1 100644
--- 
a/hudi-cli/src/test/java/org/apache/hudi/cli/commands/TestHoodieLogFileCommand.java
+++ 
b/hudi-cli/src/test/java/org/apache/hudi/cli/commands/TestHoodieLogFileCommand.java
@@ -241,9 +241,6 @@ public class TestHoodieLogFileCommand extends 
CLIFunctionalTestHarness {
 .withLatestInstantTime(INSTANT_TIME)
 .withMaxMemorySizeInBytes(
 HoodieMemoryConfig.DEFAULT_MAX_MEMORY_FOR_SPILLABLE_MAP_IN_BYTES)
-.withReadBlocksLazily(
-Boolean.parseBoolean(
-
HoodieCompactionConfig.COMPACTION_LAZY_BLOCK_READ_ENABLE.defaultValue()))
 .withReverseReader(
 Boolean.parseBoolean(
 
HoodieCompactionConfig.COMPACTION_REVERSE_LOG_READ_ENABLE.defaultValue()))
diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergedReadHandle.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergedReadHandle.java
index e74ab37f4b6..280e24e46b9 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergedReadHandle.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergedReadHandle.java
@@ -126,7 +126,6 @@ public class HoodieMergedReadHandle extends 
HoodieReadHandle 
implements Serializable {
 .withInstantRange(instantRange)
 
.withInternalSchema(internalSchemaOption.orElse(InternalSchema.getEmptyInternalSchema()))
 .withMaxMemorySizeInBytes(maxMemoryPerCompaction)
-.withReadBlocksLazily(config.getCompactionLazyBlockReadEnabled())
   

(hudi) 07/28: [HUDI-6441] Passing custom Headers with Hudi Callback URL (#10970)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit db53c7af20c228d750b5273beb678240d114ce64
Author: Vova Kolmakov 
AuthorDate: Thu Apr 11 21:16:14 2024 +0700

[HUDI-6441] Passing custom Headers with Hudi Callback URL (#10970)
---
 .../http/HoodieWriteCommitHttpCallbackClient.java  |  46 -
 .../config/HoodieWriteCommitCallbackConfig.java|  15 ++
 .../client/http/TestCallbackHttpClient.java| 202 +
 .../hudi/callback/http/TestCallbackHttpClient.java | 143 ---
 4 files changed, 260 insertions(+), 146 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/callback/client/http/HoodieWriteCommitHttpCallbackClient.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/callback/client/http/HoodieWriteCommitHttpCallbackClient.java
index d9248ed20f1..037e84b3d00 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/callback/client/http/HoodieWriteCommitHttpCallbackClient.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/callback/client/http/HoodieWriteCommitHttpCallbackClient.java
@@ -18,6 +18,8 @@
 
 package org.apache.hudi.callback.client.http;
 
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.util.VisibleForTesting;
 import org.apache.hudi.config.HoodieWriteCommitCallbackConfig;
 import org.apache.hudi.config.HoodieWriteConfig;
 
@@ -34,6 +36,9 @@ import org.slf4j.LoggerFactory;
 
 import java.io.Closeable;
 import java.io.IOException;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.StringTokenizer;
 
 /**
  * Write commit callback http client.
@@ -43,36 +48,42 @@ public class HoodieWriteCommitHttpCallbackClient implements 
Closeable {
   private static final Logger LOG = 
LoggerFactory.getLogger(HoodieWriteCommitHttpCallbackClient.class);
 
   public static final String HEADER_KEY_API_KEY = "HUDI-CALLBACK-KEY";
+  static final String HEADERS_DELIMITER = ";";
+  static final String HEADERS_KV_DELIMITER = ":";
 
   private final String apiKey;
   private final String url;
   private final CloseableHttpClient client;
   private HoodieWriteConfig writeConfig;
+  private final Map customHeaders;
 
   public HoodieWriteCommitHttpCallbackClient(HoodieWriteConfig config) {
 this.writeConfig = config;
 this.apiKey = getApiKey();
 this.url = getUrl();
 this.client = getClient();
+this.customHeaders = parseCustomHeaders();
   }
 
-  public HoodieWriteCommitHttpCallbackClient(String apiKey, String url, 
CloseableHttpClient client) {
+  public HoodieWriteCommitHttpCallbackClient(String apiKey, String url, 
CloseableHttpClient client, Map customHeaders) {
 this.apiKey = apiKey;
 this.url = url;
 this.client = client;
+this.customHeaders = customHeaders != null ? customHeaders : new 
HashMap<>();
   }
 
   public void send(String callbackMsg) {
 HttpPost request = new HttpPost(url);
 request.setHeader(HEADER_KEY_API_KEY, apiKey);
 request.setHeader(HttpHeaders.CONTENT_TYPE, 
ContentType.APPLICATION_JSON.toString());
+customHeaders.forEach(request::setHeader);
 request.setEntity(new StringEntity(callbackMsg, 
ContentType.APPLICATION_JSON));
 try (CloseableHttpResponse response = client.execute(request)) {
   int statusCode = response.getStatusLine().getStatusCode();
   if (statusCode >= 300) {
-LOG.warn(String.format("Failed to send callback message. Response was 
%s", response));
+LOG.warn("Failed to send callback message. Response was {}", response);
   } else {
-LOG.info(String.format("Sent Callback data to %s successfully !", 
url));
+LOG.info("Sent Callback data with {} custom headers to {} successfully 
!", customHeaders.size(), url);
   }
 } catch (IOException e) {
   LOG.warn("Failed to send callback.", e);
@@ -101,8 +112,37 @@ public class HoodieWriteCommitHttpCallbackClient 
implements Closeable {
 return 
writeConfig.getInt(HoodieWriteCommitCallbackConfig.CALLBACK_HTTP_TIMEOUT_IN_SECONDS);
   }
 
+  private Map parseCustomHeaders() {
+Map headers = new HashMap<>();
+String headersString = 
writeConfig.getString(HoodieWriteCommitCallbackConfig.CALLBACK_HTTP_CUSTOM_HEADERS);
+if (!StringUtils.isNullOrEmpty(headersString)) {
+  StringTokenizer tokenizer = new StringTokenizer(headersString, 
HEADERS_DELIMITER);
+  while (tokenizer.hasMoreTokens()) {
+String token = tokenizer.nextToken();
+if (!StringUtils.isNullOrEmpty(token)) {
+  String[] keyValue = token.split(HEADERS_KV_DELIMITER);
+  if (keyValue.length == 2) {
+String trimKey = keyValue[0].trim();
+String trimValue = keyValue[1].trim();
+if (trimKey.length() > 0 && trimValue.length() > 0) {
+  headers.put(trimKey, trimValue);
+

(hudi) 23/28: [HUDI-4228] Clean up literal usage in Hudi CLI argument check (#11042)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 6fc3ad4060e29562ccba00561edcf75170219ac1
Author: Vova Kolmakov 
AuthorDate: Thu Apr 18 09:14:32 2024 +0700

[HUDI-4228] Clean up literal usage in Hudi CLI argument check (#11042)
---
 .../org/apache/hudi/cli/commands/SparkMain.java| 201 +++--
 .../org/apache/hudi/cli/ArchiveExecutorUtils.java  |   2 +-
 2 files changed, 69 insertions(+), 134 deletions(-)

diff --git a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/SparkMain.java 
b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/SparkMain.java
index 742540d0ff5..c312deaf6c3 100644
--- a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/SparkMain.java
+++ b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/SparkMain.java
@@ -19,14 +19,13 @@
 package org.apache.hudi.cli.commands;
 
 import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.cli.ArchiveExecutorUtils;
 import org.apache.hudi.cli.utils.SparkUtil;
 import org.apache.hudi.client.HoodieTimelineArchiver;
 import org.apache.hudi.client.SparkRDDWriteClient;
 import org.apache.hudi.client.common.HoodieSparkEngineContext;
-import org.apache.hudi.common.config.HoodieMetadataConfig;
 import org.apache.hudi.common.config.TypedProperties;
 import org.apache.hudi.common.engine.HoodieEngineContext;
-import org.apache.hudi.common.model.HoodieAvroPayload;
 import org.apache.hudi.common.model.HoodieFailedWritesCleaningPolicy;
 import org.apache.hudi.common.model.HoodieRecord;
 import org.apache.hudi.common.model.WriteOperationType;
@@ -37,7 +36,6 @@ import org.apache.hudi.common.util.Option;
 import org.apache.hudi.common.util.PartitionPathEncodeUtils;
 import org.apache.hudi.common.util.StringUtils;
 import org.apache.hudi.common.util.ValidationUtils;
-import org.apache.hudi.config.HoodieArchivalConfig;
 import org.apache.hudi.config.HoodieBootstrapConfig;
 import org.apache.hudi.config.HoodieCleanConfig;
 import org.apache.hudi.config.HoodieIndexConfig;
@@ -99,16 +97,45 @@ public class SparkMain {
* Commands.
*/
   enum SparkCommand {
-BOOTSTRAP, ROLLBACK, DEDUPLICATE, ROLLBACK_TO_SAVEPOINT, SAVEPOINT, 
IMPORT, UPSERT, COMPACT_SCHEDULE, COMPACT_RUN, COMPACT_SCHEDULE_AND_EXECUTE,
-COMPACT_UNSCHEDULE_PLAN, COMPACT_UNSCHEDULE_FILE, COMPACT_VALIDATE, 
COMPACT_REPAIR, CLUSTERING_SCHEDULE,
-CLUSTERING_RUN, CLUSTERING_SCHEDULE_AND_EXECUTE, CLEAN, DELETE_MARKER, 
DELETE_SAVEPOINT, UPGRADE, DOWNGRADE,
-REPAIR_DEPRECATED_PARTITION, RENAME_PARTITION, ARCHIVE
+BOOTSTRAP(18), ROLLBACK(6), DEDUPLICATE(8), ROLLBACK_TO_SAVEPOINT(6), 
SAVEPOINT(7),
+IMPORT(13), UPSERT(13), COMPACT_SCHEDULE(7), COMPACT_RUN(10), 
COMPACT_SCHEDULE_AND_EXECUTE(9),
+COMPACT_UNSCHEDULE_PLAN(9), COMPACT_UNSCHEDULE_FILE(10), 
COMPACT_VALIDATE(7), COMPACT_REPAIR(8),
+CLUSTERING_SCHEDULE(7), CLUSTERING_RUN(9), 
CLUSTERING_SCHEDULE_AND_EXECUTE(8), CLEAN(5),
+DELETE_MARKER(5), DELETE_SAVEPOINT(5), UPGRADE(5), DOWNGRADE(5),
+REPAIR_DEPRECATED_PARTITION(4), RENAME_PARTITION(6), ARCHIVE(8);
+
+private final int minArgsCount;
+
+SparkCommand(int minArgsCount) {
+  this.minArgsCount = minArgsCount;
+}
+
+void assertEq(int factArgsCount) {
+  ValidationUtils.checkArgument(factArgsCount == minArgsCount);
+}
+
+void assertGtEq(int factArgsCount) {
+  ValidationUtils.checkArgument(factArgsCount >= minArgsCount);
+}
+
+List makeConfigs(String[] args) {
+  List configs = new ArrayList<>();
+  if (args.length > minArgsCount) {
+configs.addAll(Arrays.asList(args).subList(minArgsCount, args.length));
+  }
+  return configs;
+}
+
+String getPropsFilePath(String[] args) {
+  return (args.length >= minArgsCount && 
!StringUtils.isNullOrEmpty(args[minArgsCount - 1]))
+  ? args[minArgsCount - 1] : null;
+}
   }
 
-  public static void main(String[] args) throws Exception {
+  public static void main(String[] args) {
 ValidationUtils.checkArgument(args.length >= 4);
 final String commandString = args[0];
-LOG.info("Invoking SparkMain: " + commandString);
+LOG.info("Invoking SparkMain: {}", commandString);
 final SparkCommand cmd = SparkCommand.valueOf(commandString);
 
 JavaSparkContext jsc = SparkUtil.initJavaSparkContext("hoodie-cli-" + 
commandString,
@@ -116,193 +143,112 @@ public class SparkMain {
 
 int returnCode = 0;
 try {
+  cmd.assertGtEq(args.length);
+  List configs = cmd.makeConfigs(args);
+  String propsFilePath = cmd.getPropsFilePath(args);
   switch (cmd) {
 case ROLLBACK:
-  assert (args.length == 6);
+  cmd.assertEq(args.length);
   returnCode = rollback(jsc, args[3], args[4], 
Boolean.parseBoolean(args[5]));
   break;
 case DEDUPLICATE:
-  assert (args.length == 8);
+  cmd.assertEq(a

(hudi) 16/28: [HUDI-7619] Removed code duplicates in HoodieTableMetadataUtil (#11022)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 59a1b28849f8412449f6fd2d3ed662524443e037
Author: Vova Kolmakov 
AuthorDate: Tue May 14 16:01:09 2024 -0700

[HUDI-7619] Removed code duplicates in HoodieTableMetadataUtil (#11022)

Co-authored-by: Vova Kolmakov 
---
 .../hudi/metadata/HoodieTableMetadataUtil.java | 92 +-
 1 file changed, 36 insertions(+), 56 deletions(-)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
 
b/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
index b25d6741b83..503e3351d8c 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
@@ -73,6 +73,7 @@ import org.apache.hudi.exception.HoodieMetadataException;
 import org.apache.hudi.hadoop.fs.HadoopFSUtils;
 import org.apache.hudi.io.storage.HoodieFileReader;
 import org.apache.hudi.io.storage.HoodieFileReaderFactory;
+import org.apache.hudi.storage.StoragePath;
 import org.apache.hudi.util.Lazy;
 
 import org.apache.avro.AvroTypeException;
@@ -1749,26 +1750,7 @@ public class HoodieTableMetadataUtil {
   final String instantTime = baseFile.getCommitTime();
   HoodieFileReader reader = 
HoodieFileReaderFactory.getReaderFactory(HoodieRecord.HoodieRecordType.AVRO)
   .getFileReader(config, configuration.get(), dataFilePath);
-  ClosableIterator recordKeyIterator = 
reader.getRecordKeyIterator();
-
-  return new ClosableIterator() {
-@Override
-public void close() {
-  recordKeyIterator.close();
-}
-
-@Override
-public boolean hasNext() {
-  return recordKeyIterator.hasNext();
-}
-
-@Override
-public HoodieRecord next() {
-  return forDelete
-  ? 
HoodieMetadataPayload.createRecordIndexDelete(recordKeyIterator.next())
-  : 
HoodieMetadataPayload.createRecordIndexUpdate(recordKeyIterator.next(), 
partition, fileId, instantTime, 0);
-}
-  };
+  return getHoodieRecordIterator(reader.getRecordKeyIterator(), forDelete, 
partition, fileId, instantTime);
 });
   }
 
@@ -1816,24 +1798,7 @@ public class HoodieTableMetadataUtil {
 .withTableMetaClient(metaClient)
 .build();
 ClosableIterator recordKeyIterator = 
ClosableIterator.wrap(mergedLogRecordScanner.getRecords().keySet().iterator());
-return new ClosableIterator() {
-  @Override
-  public void close() {
-recordKeyIterator.close();
-  }
-
-  @Override
-  public boolean hasNext() {
-return recordKeyIterator.hasNext();
-  }
-
-  @Override
-  public HoodieRecord next() {
-return forDelete
-? 
HoodieMetadataPayload.createRecordIndexDelete(recordKeyIterator.next())
-: 
HoodieMetadataPayload.createRecordIndexUpdate(recordKeyIterator.next(), 
partition, fileSlice.getFileId(), fileSlice.getBaseInstantTime(), 0);
-  }
-};
+return getHoodieRecordIterator(recordKeyIterator, forDelete, 
partition, fileSlice.getFileId(), fileSlice.getBaseInstantTime());
   }
   final HoodieBaseFile baseFile = fileSlice.getBaseFile().get();
   final String filename = baseFile.getFileName();
@@ -1844,26 +1809,41 @@ public class HoodieTableMetadataUtil {
   HoodieConfig hoodieConfig = getReaderConfigs(configuration.get());
   HoodieFileReader reader = 
HoodieFileReaderFactory.getReaderFactory(HoodieRecord.HoodieRecordType.AVRO)
   .getFileReader(hoodieConfig, configuration.get(), dataFilePath);
-  ClosableIterator recordKeyIterator = 
reader.getRecordKeyIterator();
+  return getHoodieRecordIterator(reader.getRecordKeyIterator(), forDelete, 
partition, fileId, instantTime);
+});
+  }
 
-  return new ClosableIterator() {
-@Override
-public void close() {
-  recordKeyIterator.close();
-}
+  private static Path filePath(String basePath, String partition, String 
filename) {
+if (partition.isEmpty()) {
+  return new Path(basePath, filename);
+} else {
+  return new Path(basePath, partition + StoragePath.SEPARATOR + filename);
+}
+  }
 
-@Override
-public boolean hasNext() {
-  return recordKeyIterator.hasNext();
-}
+  private static ClosableIterator 
getHoodieRecordIterator(ClosableIterator recordKeyIterator,
+
boolean forDelete,
+String 
partition,
+String 
fileId,
+ 

(hudi) 26/28: [HUDI-7636] Make StoragePath Serializable (#11049)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 9b0945c7817a552548c23c3cf3dd485caa38f8ec
Author: Y Ethan Guo 
AuthorDate: Wed Apr 17 21:39:28 2024 -0700

[HUDI-7636] Make StoragePath Serializable (#11049)
---
 .../java/org/apache/hudi/storage/StoragePath.java  | 14 +--
 .../apache/hudi/io/storage/TestStoragePath.java| 28 +-
 2 files changed, 39 insertions(+), 3 deletions(-)

diff --git a/hudi-io/src/main/java/org/apache/hudi/storage/StoragePath.java 
b/hudi-io/src/main/java/org/apache/hudi/storage/StoragePath.java
index f3a88f7c89b..24bf77e76ad 100644
--- a/hudi-io/src/main/java/org/apache/hudi/storage/StoragePath.java
+++ b/hudi-io/src/main/java/org/apache/hudi/storage/StoragePath.java
@@ -23,6 +23,9 @@ import org.apache.hudi.ApiMaturityLevel;
 import org.apache.hudi.PublicAPIClass;
 import org.apache.hudi.PublicAPIMethod;
 
+import java.io.IOException;
+import java.io.ObjectInputStream;
+import java.io.ObjectOutputStream;
 import java.io.Serializable;
 import java.net.URI;
 import java.net.URISyntaxException;
@@ -33,12 +36,11 @@ import java.net.URISyntaxException;
  * The APIs are mainly based on {@code org.apache.hadoop.fs.Path} class.
  */
 @PublicAPIClass(maturity = ApiMaturityLevel.EVOLVING)
-// StoragePath
 public class StoragePath implements Comparable, Serializable {
   public static final char SEPARATOR_CHAR = '/';
   public static final char COLON_CHAR = ':';
   public static final String SEPARATOR = "" + SEPARATOR_CHAR;
-  private final URI uri;
+  private URI uri;
   private transient volatile StoragePath cachedParent;
   private transient volatile String cachedName;
   private transient volatile String uriString;
@@ -306,4 +308,12 @@ public class StoragePath implements 
Comparable, Serializable {
 }
 return path.substring(0, indexOfLastSlash);
   }
+
+  private void writeObject(ObjectOutputStream out) throws IOException {
+out.writeObject(uri);
+  }
+
+  private void readObject(ObjectInputStream in) throws IOException, 
ClassNotFoundException {
+uri = (URI) in.readObject();
+  }
 }
diff --git 
a/hudi-io/src/test/java/org/apache/hudi/io/storage/TestStoragePath.java 
b/hudi-io/src/test/java/org/apache/hudi/io/storage/TestStoragePath.java
index 9195ebec9fd..e7ce6ecc838 100644
--- a/hudi-io/src/test/java/org/apache/hudi/io/storage/TestStoragePath.java
+++ b/hudi-io/src/test/java/org/apache/hudi/io/storage/TestStoragePath.java
@@ -22,7 +22,14 @@ package org.apache.hudi.io.storage;
 import org.apache.hudi.storage.StoragePath;
 
 import org.junit.jupiter.api.Test;
-
+import org.junit.jupiter.params.ParameterizedTest;
+import org.junit.jupiter.params.provider.ValueSource;
+
+import java.io.ByteArrayInputStream;
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.io.ObjectInputStream;
+import java.io.ObjectOutputStream;
 import java.net.URI;
 import java.net.URISyntaxException;
 import java.util.Arrays;
@@ -197,6 +204,25 @@ public class TestStoragePath {
 () -> new StoragePath("a").makeQualified(defaultUri));
   }
 
+  @ParameterizedTest
+  @ValueSource(strings = {
+  "/x/y/1.file#bar",
+  "s3://foo/bar/1%2F2%2F3",
+  "hdfs://host1/a/b/c"
+  })
+  public void testSerializability(String pathStr) throws IOException, 
ClassNotFoundException {
+StoragePath path = new StoragePath(pathStr);
+try (ByteArrayOutputStream baos = new ByteArrayOutputStream();
+ ObjectOutputStream oos = new ObjectOutputStream(baos)) {
+  oos.writeObject(path);
+  try (ByteArrayInputStream bais = new 
ByteArrayInputStream(baos.toByteArray());
+   ObjectInputStream ois = new ObjectInputStream(bais)) {
+StoragePath deserialized = (StoragePath) ois.readObject();
+assertEquals(path.toUri(), deserialized.toUri());
+  }
+}
+  }
+
   @Test
   public void testEquals() {
 assertEquals(new StoragePath("/foo"), new StoragePath("/foo"));



(hudi) 22/28: [HUDI-7626] Propagate UserGroupInformation from the main thread to the new thread of timeline service threadpool (#11039)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 3cdb6faefcb56bef3e7dcf26a81eb822576aa8ed
Author: Jing Zhang 
AuthorDate: Wed Apr 17 16:40:29 2024 +0800

[HUDI-7626] Propagate UserGroupInformation from the main thread to the new 
thread of timeline service threadpool (#11039)
---
 .../hudi/timeline/service/RequestHandler.java  | 128 +++--
 1 file changed, 70 insertions(+), 58 deletions(-)

diff --git 
a/hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java
 
b/hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java
index 9385b4eca9e..12e11db403d 100644
--- 
a/hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java
+++ 
b/hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java
@@ -52,11 +52,13 @@ import io.javalin.http.Context;
 import io.javalin.http.Handler;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.security.UserGroupInformation;
 import org.jetbrains.annotations.NotNull;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
 import java.io.IOException;
+import java.security.PrivilegedExceptionAction;
 import java.util.Arrays;
 import java.util.List;
 import java.util.Map;
@@ -563,76 +565,86 @@ public class RequestHandler {
 
 private final Handler handler;
 private final boolean performRefreshCheck;
+private final UserGroupInformation ugi;
 
 ViewHandler(Handler handler, boolean performRefreshCheck) {
   this.handler = handler;
   this.performRefreshCheck = performRefreshCheck;
+  try {
+ugi = UserGroupInformation.getCurrentUser();
+  } catch (Exception e) {
+LOG.warn("Fail to get ugi", e);
+throw new HoodieException(e);
+  }
 }
 
 @Override
 public void handle(@NotNull Context context) throws Exception {
-  boolean success = true;
-  long beginTs = System.currentTimeMillis();
-  boolean synced = false;
-  boolean refreshCheck = performRefreshCheck && 
!isRefreshCheckDisabledInQuery(context);
-  long refreshCheckTimeTaken = 0;
-  long handleTimeTaken = 0;
-  long finalCheckTimeTaken = 0;
-  try {
-if (refreshCheck) {
-  long beginRefreshCheck = System.currentTimeMillis();
-  synced = syncIfLocalViewBehind(context);
-  long endRefreshCheck = System.currentTimeMillis();
-  refreshCheckTimeTaken = endRefreshCheck - beginRefreshCheck;
-}
+  ugi.doAs((PrivilegedExceptionAction) () -> {
+boolean success = true;
+long beginTs = System.currentTimeMillis();
+boolean synced = false;
+boolean refreshCheck = performRefreshCheck && 
!isRefreshCheckDisabledInQuery(context);
+long refreshCheckTimeTaken = 0;
+long handleTimeTaken = 0;
+long finalCheckTimeTaken = 0;
+try {
+  if (refreshCheck) {
+long beginRefreshCheck = System.currentTimeMillis();
+synced = syncIfLocalViewBehind(context);
+long endRefreshCheck = System.currentTimeMillis();
+refreshCheckTimeTaken = endRefreshCheck - beginRefreshCheck;
+  }
 
-long handleBeginMs = System.currentTimeMillis();
-handler.handle(context);
-long handleEndMs = System.currentTimeMillis();
-handleTimeTaken = handleEndMs - handleBeginMs;
-
-if (refreshCheck) {
-  long beginFinalCheck = System.currentTimeMillis();
-  if (isLocalViewBehind(context)) {
-String lastKnownInstantFromClient = 
context.queryParamAsClass(RemoteHoodieTableFileSystemView.LAST_INSTANT_TS, 
String.class).getOrDefault(HoodieTimeline.INVALID_INSTANT_TS);
-String timelineHashFromClient = 
context.queryParamAsClass(RemoteHoodieTableFileSystemView.TIMELINE_HASH, 
String.class).getOrDefault("");
-HoodieTimeline localTimeline =
-
viewManager.getFileSystemView(context.queryParam(RemoteHoodieTableFileSystemView.BASEPATH_PARAM)).getTimeline();
-if (shouldThrowExceptionIfLocalViewBehind(localTimeline, 
timelineHashFromClient)) {
-  String errMsg =
-  "Last known instant from client was "
-  + lastKnownInstantFromClient
-  + " but server has the following timeline "
-  + localTimeline.getInstants();
-  throw new BadRequestResponse(errMsg);
+  long handleBeginMs = System.currentTimeMillis();
+  handler.handle(context);
+  long handleEndMs = System.currentTimeMillis();
+  handleTimeTaken = handleEndMs - handleBeginMs;
+
+  if (refreshCheck) {
+long beginFinalCheck = System.currentTimeMillis();
+if (isLocalVie

(hudi) 11/28: [HUDI-7378] Fix Spark SQL DML with custom key generator (#10615)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 922e1eb87d012d626ea35de0f0bca49d17d0acb3
Author: Y Ethan Guo 
AuthorDate: Fri Apr 12 22:51:03 2024 -0700

[HUDI-7378] Fix Spark SQL DML with custom key generator (#10615)
---
 .../factory/HoodieSparkKeyGeneratorFactory.java|   4 +
 .../org/apache/hudi/util/SparkKeyGenUtils.scala|  16 +-
 .../scala/org/apache/hudi/HoodieWriterUtils.scala  |  20 +-
 .../spark/sql/hudi/ProvidesHoodieConfig.scala  |  60 ++-
 .../spark/sql/hudi/TestProvidesHoodieConfig.scala  |  79 +++
 .../hudi/command/MergeIntoHoodieTableCommand.scala |   5 +-
 .../TestSparkSqlWithCustomKeyGenerator.scala   | 571 +
 7 files changed, 742 insertions(+), 13 deletions(-)

diff --git 
a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/factory/HoodieSparkKeyGeneratorFactory.java
 
b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/factory/HoodieSparkKeyGeneratorFactory.java
index 1ea5adcd6b4..dcc2eaec9eb 100644
--- 
a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/factory/HoodieSparkKeyGeneratorFactory.java
+++ 
b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/factory/HoodieSparkKeyGeneratorFactory.java
@@ -79,6 +79,10 @@ public class HoodieSparkKeyGeneratorFactory {
 
   public static KeyGenerator createKeyGenerator(TypedProperties props) throws 
IOException {
 String keyGeneratorClass = getKeyGeneratorClassName(props);
+return createKeyGenerator(keyGeneratorClass, props);
+  }
+
+  public static KeyGenerator createKeyGenerator(String keyGeneratorClass, 
TypedProperties props) throws IOException {
 boolean autoRecordKeyGen = 
KeyGenUtils.isAutoGeneratedRecordKeysEnabled(props)
 //Need to prevent overwriting the keygen for spark sql merge into 
because we need to extract
 //the recordkey from the meta cols if it exists. Sql keygen will use 
pkless keygen if needed.
diff --git 
a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/util/SparkKeyGenUtils.scala
 
b/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/util/SparkKeyGenUtils.scala
index 7b91ae5a728..bd094464096 100644
--- 
a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/util/SparkKeyGenUtils.scala
+++ 
b/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/util/SparkKeyGenUtils.scala
@@ -21,8 +21,8 @@ import org.apache.hudi.common.config.TypedProperties
 import org.apache.hudi.common.util.StringUtils
 import org.apache.hudi.common.util.ValidationUtils.checkArgument
 import org.apache.hudi.keygen.constant.KeyGeneratorOptions
-import org.apache.hudi.keygen.{AutoRecordKeyGeneratorWrapper, 
AutoRecordGenWrapperKeyGenerator, CustomAvroKeyGenerator, CustomKeyGenerator, 
GlobalAvroDeleteKeyGenerator, GlobalDeleteKeyGenerator, KeyGenerator, 
NonpartitionedAvroKeyGenerator, NonpartitionedKeyGenerator}
 import org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory
+import org.apache.hudi.keygen.{AutoRecordKeyGeneratorWrapper, 
CustomAvroKeyGenerator, CustomKeyGenerator, GlobalAvroDeleteKeyGenerator, 
GlobalDeleteKeyGenerator, KeyGenerator, NonpartitionedAvroKeyGenerator, 
NonpartitionedKeyGenerator}
 
 object SparkKeyGenUtils {
 
@@ -35,6 +35,20 @@ object SparkKeyGenUtils {
 getPartitionColumns(keyGenerator, props)
   }
 
+  /**
+   * @param KeyGenClassNameOption key generator class name if present.
+   * @param props config properties.
+   * @return partition column names only, concatenated by ","
+   */
+  def getPartitionColumns(KeyGenClassNameOption: Option[String], props: 
TypedProperties): String = {
+val keyGenerator = if (KeyGenClassNameOption.isEmpty) {
+  HoodieSparkKeyGeneratorFactory.createKeyGenerator(props)
+} else {
+  
HoodieSparkKeyGeneratorFactory.createKeyGenerator(KeyGenClassNameOption.get, 
props)
+}
+getPartitionColumns(keyGenerator, props)
+  }
+
   /**
* @param keyGen key generator class name
* @return partition columns
diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieWriterUtils.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieWriterUtils.scala
index 0a4ef7a3d63..fade5957210 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieWriterUtils.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieWriterUtils.scala
@@ -197,8 +197,26 @@ object HoodieWriterUtils {
   
diffConfigs.append(s"KeyGenerator:\t$datasourceKeyGen\t$tableConfigKeyGen\n")
 }
 
+// Please note that the validation of partition path fields needs the 
key generator class
+// for the table, since the custom key generator expects a different 
format of
+// the value of the write config 
"hoodie.datasource.write

(hudi) 13/28: [HUDI-7606] Unpersist RDDs after table services, mainly compaction and clustering (#11000)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 7e5871a4634fc7b0d195a4148bb02dcf3f57d8e1
Author: Rajesh Mahindra <76502047+rmahindra...@users.noreply.github.com>
AuthorDate: Sun Apr 14 14:38:55 2024 -0700

[HUDI-7606] Unpersist RDDs after table services, mainly compaction and 
clustering (#11000)

-

Co-authored-by: rmahindra123 
---
 .../hudi/client/BaseHoodieTableServiceClient.java  | 12 
 .../apache/hudi/client/BaseHoodieWriteClient.java  |  2 +-
 .../hudi/client/SparkRDDTableServiceClient.java|  6 ++
 .../apache/hudi/client/SparkRDDWriteClient.java| 21 +--
 .../hudi/client/utils/SparkReleaseResources.java   | 64 ++
 5 files changed, 85 insertions(+), 20 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java
index e408dc7a779..d6ec07b89d0 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java
@@ -331,6 +331,7 @@ public abstract class BaseHoodieTableServiceClient 
extends BaseHoodieCl
   CompactHelpers.getInstance().completeInflightCompaction(table, 
compactionCommitTime, metadata);
 } finally {
   this.txnManager.endTransaction(Option.of(compactionInstant));
+  releaseResources(compactionCommitTime);
 }
 WriteMarkersFactory.get(config.getMarkersType(), table, 
compactionCommitTime)
 .quietDeleteMarkerDir(context, config.getMarkersDeleteParallelism());
@@ -391,6 +392,7 @@ public abstract class BaseHoodieTableServiceClient 
extends BaseHoodieCl
   CompactHelpers.getInstance().completeInflightLogCompaction(table, 
logCompactionCommitTime, metadata);
 } finally {
   this.txnManager.endTransaction(Option.of(logCompactionInstant));
+  releaseResources(logCompactionCommitTime);
 }
 WriteMarkersFactory.get(config.getMarkersType(), table, 
logCompactionCommitTime)
 .quietDeleteMarkerDir(context, config.getMarkersDeleteParallelism());
@@ -520,6 +522,7 @@ public abstract class BaseHoodieTableServiceClient 
extends BaseHoodieCl
   throw new HoodieClusteringException("unable to transition clustering 
inflight to complete: " + clusteringCommitTime, e);
 } finally {
   this.txnManager.endTransaction(Option.of(clusteringInstant));
+  releaseResources(clusteringCommitTime);
 }
 WriteMarkersFactory.get(config.getMarkersType(), table, 
clusteringCommitTime)
 .quietDeleteMarkerDir(context, config.getMarkersDeleteParallelism());
@@ -759,6 +762,7 @@ public abstract class BaseHoodieTableServiceClient 
extends BaseHoodieCl
   + " Earliest Retained Instant :" + 
metadata.getEarliestCommitToRetain()
   + " cleanerElapsedMs" + durationMs);
 }
+releaseResources(cleanInstantTime);
 return metadata;
   }
 
@@ -1133,4 +1137,12 @@ public abstract class BaseHoodieTableServiceClient extends BaseHoodieCl
   }
 }
   }
+
+  /**
+   * Called after each commit of a compaction or clustering table service,
+   * to release any resources used.
+   */
+  protected void releaseResources(String instantTime) {
+// do nothing here
+  }
 }
diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java
index d5d74e94673..fdc9eeca90d 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java
@@ -237,11 +237,11 @@ public abstract class BaseHoodieWriteClient 
extends BaseHoodieClient
   commit(table, commitActionType, instantTime, metadata, stats, 
writeStatuses);
   postCommit(table, metadata, instantTime, extraMetadata);
   LOG.info("Committed " + instantTime);
-  releaseResources(instantTime);
 } catch (IOException e) {
   throw new HoodieCommitException("Failed to complete commit " + 
config.getBasePath() + " at time " + instantTime, e);
 } finally {
   this.txnManager.endTransaction(Option.of(inflightInstant));
+  releaseResources(instantTime);
 }
 
 // trigger clean and archival.
diff --git 
a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDTableServiceClient.java
 
b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDTableServiceClient.java
index 54d91fae3cf..98914be7496 100644
--- 
a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDTableServiceClient.java
+++ 
b/hudi-

(hudi) 05/28: [HUDI-7600] Shutdown ExecutorService when HiveMetastoreBasedLockProvider is closed (#10993)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 86f6bdf991e57a8ac3606beab08e0aa2cf7b3bf5
Author: Zouxxyy 
AuthorDate: Thu Apr 11 13:03:14 2024 +0800

[HUDI-7600] Shutdown ExecutorService when HiveMetastoreBasedLockProvider is 
closed (#10993)
---
 .../hudi/hive/transaction/lock/HiveMetastoreBasedLockProvider.java   | 1 +
 1 file changed, 1 insertion(+)

diff --git 
a/hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/transaction/lock/HiveMetastoreBasedLockProvider.java
 
b/hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/transaction/lock/HiveMetastoreBasedLockProvider.java
index df848957492..0280621bb53 100644
--- 
a/hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/transaction/lock/HiveMetastoreBasedLockProvider.java
+++ 
b/hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/transaction/lock/HiveMetastoreBasedLockProvider.java
@@ -154,6 +154,7 @@ public class HiveMetastoreBasedLockProvider implements 
LockProvider

(hudi) 04/28: [MINOR] Fix BUG: HoodieLogFormatWriter: unable to close output stream for log file HoodieLogFile{xxx} (#10989)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 705e5f59f28f72703a647487f9acbab65155f45b
Author: Silly Carbon 
AuthorDate: Wed Apr 10 18:21:57 2024 +0800

[MINOR] Fix BUG: HoodieLogFormatWriter: unable to close output stream for 
log file HoodieLogFile{xxx} (#10989)

* due to java.lang.IllegalStateException: Shutdown in progress, cause:
when `org.apache.hudi.common.table.log.HoodieLogFormatWriter.close` 
tries to `removeShutdownHook`, hooks were already removed by JVM when triggered 
(hooks == null)
---
 .../java/org/apache/hudi/common/table/log/HoodieLogFormatWriter.java| 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFormatWriter.java
 
b/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFormatWriter.java
index 0b16d2ee2a6..d021cd2c499 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFormatWriter.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFormatWriter.java
@@ -294,7 +294,7 @@ public class HoodieLogFormatWriter implements 
HoodieLogFormat.Writer {
 try {
   LOG.warn("running logformatwriter hook");
   if (output != null) {
-close();
+closeStream();
   }
 } catch (Exception e) {
   LOG.warn("unable to close output stream for log file " + logFile, e);



(hudi) 08/28: [HUDI-7605] Allow merger strategy to be set in spark sql writer (#10999)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit e9cd05376a337a806cdfc1a6a1f0d78fa78ac0ec
Author: Jon Vexler 
AuthorDate: Thu Apr 11 21:20:07 2024 -0400

[HUDI-7605] Allow merger strategy to be set in spark sql writer (#10999)
---
 .../scala/org/apache/hudi/HoodieSparkSqlWriter.scala |  1 +
 .../apache/hudi/functional/TestMORDataSource.scala   | 20 
 2 files changed, 21 insertions(+)

diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
index 7020781faf0..ad19ec48c7a 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
@@ -302,6 +302,7 @@ class HoodieSparkSqlWriterInternal {
   .setPartitionMetafileUseBaseFormat(useBaseFormatMetaFile)
   
.setShouldDropPartitionColumns(hoodieConfig.getBooleanOrDefault(HoodieTableConfig.DROP_PARTITION_COLUMNS))
   .setCommitTimezone(timelineTimeZone)
+  
.setRecordMergerStrategy(hoodieConfig.getStringOrDefault(DataSourceWriteOptions.RECORD_MERGER_STRATEGY))
   .initTable(sparkContext.hadoopConfiguration, path)
   }
   val instantTime = HoodieActiveTimeline.createNewInstantTime()
diff --git 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala
 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala
index 45bd3c645d4..b878eb76c40 100644
--- 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala
+++ 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala
@@ -1403,4 +1403,24 @@ class TestMORDataSource extends 
HoodieSparkClientTestBase with SparkDatasetMixin
   basePath
 }
   }
+
+  @Test
+  def testMergerStrategySet(): Unit = {
+val (writeOpts, _) = getWriterReaderOpts()
+val input = recordsToStrings(dataGen.generateInserts("000", 1)).asScala
+val inputDf= spark.read.json(spark.sparkContext.parallelize(input, 1))
+val mergerStrategyName = "example_merger_strategy"
+inputDf.write.format("hudi")
+  .options(writeOpts)
+  .option(DataSourceWriteOptions.TABLE_TYPE.key, "MERGE_ON_READ")
+  .option(DataSourceWriteOptions.OPERATION.key, 
DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
+  .option(DataSourceWriteOptions.RECORD_MERGER_STRATEGY.key(), 
mergerStrategyName)
+  .mode(SaveMode.Overwrite)
+  .save(basePath)
+metaClient = HoodieTableMetaClient.builder()
+  .setBasePath(basePath)
+  .setConf(spark.sessionState.newHadoopConf)
+  .build()
+assertEquals(metaClient.getTableConfig.getRecordMergerStrategy, 
mergerStrategyName)
+  }
 }



(hudi) 12/28: [HUDI-7616] Avoid multiple cleaner plans and deprecate hoodie.clean.allow.multiple (#11013)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit f12cd043095c7a41f84a910bc588cb171a8bfa41
Author: Y Ethan Guo 
AuthorDate: Sat Apr 13 19:04:44 2024 -0700

[HUDI-7616] Avoid multiple cleaner plans and deprecate 
hoodie.clean.allow.multiple (#11013)
---
 .../src/main/java/org/apache/hudi/config/HoodieCleanConfig.java | 4 +++-
 .../src/test/java/org/apache/hudi/table/TestCleaner.java| 6 +++---
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java
index a4114152023..e023bee4274 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java
@@ -167,11 +167,13 @@ public class HoodieCleanConfig extends HoodieConfig {
   + "execution is slow due to limited parallelism, you can increase 
this to tune the "
   + "performance..");
 
+  @Deprecated
   public static final ConfigProperty ALLOW_MULTIPLE_CLEANS = 
ConfigProperty
   .key("hoodie.clean.allow.multiple")
-  .defaultValue(true)
+  .defaultValue(false)
   .markAdvanced()
   .sinceVersion("0.11.0")
+  .deprecatedAfter("1.0.0")
   .withDocumentation("Allows scheduling/executing multiple cleans by 
enabling this config. If users prefer to strictly ensure clean requests should 
be mutually exclusive, "
   + ".i.e. a 2nd clean will not be scheduled if another clean is not 
yet completed to avoid repeat cleaning of same files, they might want to 
disable this config.");
 
diff --git 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/TestCleaner.java
 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/TestCleaner.java
index b18238f3392..6a8ce948373 100644
--- 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/TestCleaner.java
+++ 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/TestCleaner.java
@@ -593,13 +593,13 @@ public class TestCleaner extends HoodieCleanerTestBase {
   timeline = metaClient.reloadActiveTimeline();
 
   assertEquals(0, cleanStats.size(), "Must not clean any files");
-  assertEquals(1, timeline.getTimelineOfActions(
+  assertEquals(0, timeline.getTimelineOfActions(
   
CollectionUtils.createSet(HoodieTimeline.CLEAN_ACTION)).filterInflightsAndRequested().countInstants());
   assertEquals(0, timeline.getTimelineOfActions(
   
CollectionUtils.createSet(HoodieTimeline.CLEAN_ACTION)).filterInflights().countInstants());
-  assertEquals(--cleanCount, timeline.getTimelineOfActions(
+  assertEquals(cleanCount, timeline.getTimelineOfActions(
   
CollectionUtils.createSet(HoodieTimeline.CLEAN_ACTION)).filterCompletedInstants().countInstants());
-  assertTrue(timeline.getTimelineOfActions(
+  assertFalse(timeline.getTimelineOfActions(
   
CollectionUtils.createSet(HoodieTimeline.CLEAN_ACTION)).filterInflightsAndRequested().containsInstant(makeNewCommitTime(--instantClean,
 "%09d")));
 }
   }



(hudi) 01/28: [HUDI-7556] Fixing false positive validation with MDT validator (#10986)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 86968a6b9e4655ae111d599ea7c55e9f16f16625
Author: Sivabalan Narayanan 
AuthorDate: Tue May 14 17:24:06 2024 -0700

[HUDI-7556] Fixing false positive validation with MDT validator (#10986)
---
 .../utilities/HoodieMetadataTableValidator.java|  96 +---
 .../TestHoodieMetadataTableValidator.java  | 125 -
 2 files changed, 181 insertions(+), 40 deletions(-)

diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java
index bbe8610abe3..0e6630967b3 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java
@@ -52,6 +52,7 @@ import org.apache.hudi.common.util.ConfigUtils;
 import org.apache.hudi.common.util.FileIOUtils;
 import org.apache.hudi.common.util.Option;
 import org.apache.hudi.common.util.ParquetUtils;
+import org.apache.hudi.common.util.VisibleForTesting;
 import org.apache.hudi.common.util.collection.Pair;
 import org.apache.hudi.exception.HoodieException;
 import org.apache.hudi.exception.HoodieIOException;
@@ -514,7 +515,9 @@ public class HoodieMetadataTableValidator implements 
Serializable {
 }
 
 HoodieSparkEngineContext engineContext = new HoodieSparkEngineContext(jsc);
-List allPartitions = validatePartitions(engineContext, basePath);
+// compare partitions
+
+List allPartitions = validatePartitions(engineContext, basePath, 
metaClient);
 
 if (allPartitions.isEmpty()) {
   LOG.warn("The result of getting all partitions is null or empty, skip 
current validation. {}", taskLabels);
@@ -612,39 +615,14 @@ public class HoodieMetadataTableValidator implements 
Serializable {
   /**
* Compare the listing partitions result between metadata table and 
fileSystem.
*/
-  private List validatePartitions(HoodieSparkEngineContext 
engineContext, String basePath) {
+  @VisibleForTesting
+  List validatePartitions(HoodieSparkEngineContext engineContext, 
String basePath, HoodieTableMetaClient metaClient) {
 // compare partitions
-List allPartitionPathsFromFS = 
FSUtils.getAllPartitionPaths(engineContext, basePath, false, 
cfg.assumeDatePartitioning);
 HoodieTimeline completedTimeline = 
metaClient.getCommitsTimeline().filterCompletedInstants();
+List allPartitionPathsFromFS = 
getPartitionsFromFileSystem(engineContext, basePath, metaClient.getFs(),
+completedTimeline);
 
-// ignore partitions created by uncommitted ingestion.
-allPartitionPathsFromFS = 
allPartitionPathsFromFS.stream().parallel().filter(part -> {
-  HoodiePartitionMetadata hoodiePartitionMetadata =
-  new HoodiePartitionMetadata(metaClient.getFs(), 
FSUtils.getPartitionPath(basePath, part));
-
-  Option instantOption = 
hoodiePartitionMetadata.readPartitionCreatedCommitTime();
-  if (instantOption.isPresent()) {
-String instantTime = instantOption.get();
-// There are two cases where the created commit time is written to the 
partition metadata:
-// (1) Commit C1 creates the partition and C1 succeeds, the partition 
metadata has C1 as
-// the created commit time.
-// (2) Commit C1 creates the partition, the partition metadata is 
written, and C1 fails
-// during writing data files.  Next time, C2 adds new data to the same 
partition after C1
-// is rolled back. In this case, the partition metadata still has C1 
as the created commit
-// time, since Hudi does not rewrite the partition metadata in C2.
-if (!completedTimeline.containsOrBeforeTimelineStarts(instantTime)) {
-  Option lastInstant = completedTimeline.lastInstant();
-  return lastInstant.isPresent()
-  && HoodieTimeline.compareTimestamps(
-  instantTime, LESSER_THAN_OR_EQUALS, 
lastInstant.get().getTimestamp());
-}
-return true;
-  } else {
-return false;
-  }
-}).collect(Collectors.toList());
-
-List allPartitionPathsMeta = 
FSUtils.getAllPartitionPaths(engineContext, basePath, true, 
cfg.assumeDatePartitioning);
+List allPartitionPathsMeta = getPartitionsFromMDT(engineContext, 
basePath);
 
 Collections.sort(allPartitionPathsFromFS);
 Collections.sort(allPartitionPathsMeta);
@@ -652,26 +630,23 @@ public class HoodieMetadataTableValidator implements 
Serializable {
 if (allPartitionPathsFromFS.size() != allPartitionPathsMeta.size()
 || !allPartitionPathsFromFS.equals(allPartitionPathsMeta)) {
   List additionalFromFS = new ArrayList<>(allPartitionPathsFromFS);
-  additionalFromFS.remove(allPartitionPathsMeta);
+  addit

(hudi) branch branch-0.x updated (b71f03776ba -> 31d24d7600b)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a change to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git


 discard b71f03776ba [HUDI-7637] Make StoragePathInfo Comparable (#11050)
 discard ab3741cd528 [HUDI-7635] Add default block size and openSeekable APIs 
to HoodieStorage (#11048)
 discard 6ac417341eb [HUDI-7636] Make StoragePath Serializable (#11049)
 discard 6238a9dd9a3 [MINOR] Remove redundant TestStringUtils in hudi-common 
(#11046)
 discard 07120a64e81 [HUDI-7633] Use try with resources for AutoCloseable 
(#11045)
 discard 74f91c70474 [HUDI-4228] Clean up literal usage in Hudi CLI argument 
check (#11042)
 discard 6df621aec1f [HUDI-7626] Propagate UserGroupInformation from the main 
thread to the new thread of timeline service threadpool (#11039)
 discard 1c2e2b5341c [HUDI-7625] Avoid unnecessary rewrite for metadata table 
(#11038)
 discard 9bca6153adc [HUDI-7578] Avoid unnecessary rewriting to improve 
performance (#11028)
 discard 6b0c67a5c9f [MINOR] Rename location to path in `makeQualified` (#11037)
 discard 862575a546b [MINOR] Remove redundant lines in StreamSync and 
TestStreamSyncUnitTests (#11027)
 discard 21a1b7806db [HUDI-6762] Removed usages of 
MetadataRecordsGenerationParams (#10962)
 discard 38701e37637 [HUDI-7619] Removed code duplicates in 
HoodieTableMetadataUtil (#11022)
 discard 3a9ca9ecd0a [HUDI-7584] Always read log block lazily and remove 
readBlockLazily argument (#11015)
 discard ba521081ff6 [HUDI-7615] Mark a few write configs with the correct 
sinceVersion (#11012)
 discard a5745c6d550 [HUDI-7606] Unpersist RDDs after table services, mainly 
compaction and clustering (#11000)
 discard b5b04af9d1e [HUDI-7616] Avoid multiple cleaner plans and deprecate 
hoodie.clean.allow.multiple (#11013)
 discard 3e6bb0d77bc [HUDI-7378] Fix Spark SQL DML with custom key generator 
(#10615)
 discard 50caab4ee11 [HUDI-7601] Add heartbeat mechanism to refresh lock 
(#10994)
 discard 1013610ee8b [HUDI-7290]  Don't assume ReplaceCommits are always 
Clustering (#10479)
 discard ca6c4f96dcc [HUDI-7605] Allow merger strategy to be set in spark sql 
writer (#10999)
 discard 804b88d212c [HUDI-6441] Passing custom Headers with Hudi Callback URL 
(#10970)
 discard 0caed098c35 [HUDI-7391] HoodieMetadataMetrics should use Metrics 
instance for metrics registry (#10635)
 discard 604c8fb7c81 [HUDI-7600] Shutdown ExecutorService when 
HiveMetastoreBasedLockProvider is closed (#10993)
 discard 271a22712c2 [MINOR] Fix BUG: HoodieLogFormatWriter: unable to close 
output stream for log file HoodieLogFile{xxx} (#10989)
 discard a1d601c5866 [HUDI-7597] Add logs of Kafka offsets when the checkpoint 
is out of bound (#10987)
 discard 92ee83c0c7c [HUDI-7583] Read log block header only for the schema and 
instant time (#10984)
 discard 1ed67a94fd4 [HUDI-7556] Fixing false positive validation with MDT 
validator (#10986)
 new 86968a6b9e4 [HUDI-7556] Fixing false positive validation with MDT 
validator (#10986)
 new a9c7eebd0fc [HUDI-7583] Read log block header only for the schema and 
instant time (#10984)
 new ef96676f39e [HUDI-7597] Add logs of Kafka offsets when the checkpoint 
is out of bound (#10987)
 new 705e5f59f28 [MINOR] Fix BUG: HoodieLogFormatWriter: unable to close 
output stream for log file HoodieLogFile{xxx} (#10989)
 new 86f6bdf991e [HUDI-7600] Shutdown ExecutorService when 
HiveMetastoreBasedLockProvider is closed (#10993)
 new 858fde11fdd [HUDI-7391] HoodieMetadataMetrics should use Metrics 
instance for metrics registry (#10635)
 new db53c7af20c [HUDI-6441] Passing custom Headers with Hudi Callback URL 
(#10970)
 new e9cd05376a3 [HUDI-7605] Allow merger strategy to be set in spark sql 
writer (#10999)
 new f9ffb646462 [HUDI-7290]  Don't assume ReplaceCommits are always 
Clustering (#10479)
 new 4aa23beec97 [HUDI-7601] Add heartbeat mechanism to refresh lock 
(#10994)
 new 922e1eb87d0 [HUDI-7378] Fix Spark SQL DML with custom key generator 
(#10615)
 new f12cd043095 [HUDI-7616] Avoid multiple cleaner plans and deprecate 
hoodie.clean.allow.multiple (#11013)
 new 7e5871a4634 [HUDI-7606] Unpersist RDDs after table services, mainly 
compaction and clustering (#11000)
 new 89ff73de175 [HUDI-7615] Mark a few write configs with the correct 
sinceVersion (#11012)
 new e39fedfa5b6 [HUDI-7584] Always read log block lazily and remove 
readBlockLazily argument (#11015)
 new 59a1b28849f [HUDI-7619] Removed code duplicates in 
HoodieTableMetadataUtil (#11022)
 new 566039010cf [HUDI-6762] Removed usages of 
MetadataRecordsGenerationParams (#10962)
 new f0cf58d8c99 [MINOR] Remove redundant lines in StreamSync and 
TestStreamSyncUnitTests (#11027)
 new 51f7557070d [MINOR] Rename location to path in `makeQualified` (#11037)
 new ce9ff1cb21a [HUDI-7578] Avoid unnecessary rewriting to improve 
performance (#11028)
 new a5304f586f4 [HUDI-7625] Avoid unnecessary rewrite 

(hudi) branch master updated: [HUDI-7759] Remove Hadoop dependencies in hudi-common module (#11220)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 7e5ff1ec89f [HUDI-7759] Remove Hadoop dependencies in hudi-common 
module (#11220)
7e5ff1ec89f is described below

commit 7e5ff1ec89f298ab4fd16c1774917695ca56
Author: Y Ethan Guo 
AuthorDate: Tue May 14 17:25:45 2024 -0700

[HUDI-7759] Remove Hadoop dependencies in hudi-common module (#11220)

Co-authored-by: Jonathan Vexler <=>
---
 hudi-common/pom.xml| 18 --
 .../table/view/TestPriorityBasedFileSystemView.java|  2 +-
 2 files changed, 1 insertion(+), 19 deletions(-)

diff --git a/hudi-common/pom.xml b/hudi-common/pom.xml
index 1b1f2f15a2a..221590cd69e 100644
--- a/hudi-common/pom.xml
+++ b/hudi-common/pom.xml
@@ -189,24 +189,6 @@
   rocksdbjni
 
 
-
-
-  org.apache.hadoop
-  hadoop-client
-  
-
-  javax.servlet
-  *
-
-  
-  provided
-
-
-  org.apache.hadoop
-  hadoop-hdfs
-  provided
-
-
 
   org.apache.hudi
   hudi-io
diff --git 
a/hudi-common/src/test/java/org/apache/hudi/common/table/view/TestPriorityBasedFileSystemView.java
 
b/hudi-common/src/test/java/org/apache/hudi/common/table/view/TestPriorityBasedFileSystemView.java
index eb30e1387c1..3d9a142e055 100644
--- 
a/hudi-common/src/test/java/org/apache/hudi/common/table/view/TestPriorityBasedFileSystemView.java
+++ 
b/hudi-common/src/test/java/org/apache/hudi/common/table/view/TestPriorityBasedFileSystemView.java
@@ -767,7 +767,7 @@ public class TestPriorityBasedFileSystemView {
 
 @Override
 public void append(LogEvent event) {
-  log.add(event);
+  log.add(event.toImmutable());
 }
 
 public List getLog() {



Re: [PR] [HUDI-7759] Remove Hadoop dependencies in hudi-common module [hudi]

2024-05-14 Thread via GitHub


yihua merged PR #11220:
URL: https://github.com/apache/hudi/pull/11220


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) 25/28: [MINOR] Remove redundant TestStringUtils in hudi-common (#11046)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 6238a9dd9a33feb3e8c380ea80c6f37a0cad08af
Author: Y Ethan Guo 
AuthorDate: Wed Apr 17 21:34:06 2024 -0700

[MINOR] Remove redundant TestStringUtils in hudi-common (#11046)
---
 .../apache/hudi/common/util/TestStringUtils.java   | 124 -
 1 file changed, 124 deletions(-)

diff --git 
a/hudi-common/src/test/java/org/apache/hudi/common/util/TestStringUtils.java 
b/hudi-common/src/test/java/org/apache/hudi/common/util/TestStringUtils.java
deleted file mode 100644
index 54985056bf0..000
--- a/hudi-common/src/test/java/org/apache/hudi/common/util/TestStringUtils.java
+++ /dev/null
@@ -1,124 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one
- * or more contributor license agreements.  See the NOTICE file
- * distributed with this work for additional information
- * regarding copyright ownership.  The ASF licenses this file
- * to you under the Apache License, Version 2.0 (the
- * "License"); you may not use this file except in compliance
- * with the License.  You may obtain a copy of the License at
- *
- *  http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.hudi.common.util;
-
-import org.junit.jupiter.api.Test;
-
-import java.nio.ByteBuffer;
-import java.util.ArrayList;
-import java.util.Arrays;
-import java.util.Collections;
-
-import static org.apache.hudi.common.util.StringUtils.getUTF8Bytes;
-import static org.junit.jupiter.api.Assertions.assertEquals;
-import static org.junit.jupiter.api.Assertions.assertNotEquals;
-import static org.junit.jupiter.api.Assertions.assertNull;
-import static org.junit.jupiter.api.Assertions.assertTrue;
-
-/**
- * Tests {@link StringUtils}.
- */
-public class TestStringUtils {
-
-  private static final String[] STRINGS = {"This", "is", "a", "test"};
-
-  @Test
-  public void testStringJoinWithDelim() {
-String joinedString = StringUtils.joinUsingDelim("-", STRINGS);
-assertEquals(STRINGS.length, joinedString.split("-").length);
-  }
-
-  @Test
-  public void testStringJoin() {
-assertNotEquals(null, StringUtils.join(""));
-assertNotEquals(null, StringUtils.join(STRINGS));
-  }
-
-  @Test
-  public void testStringJoinWithJavaImpl() {
-assertNull(StringUtils.join(",", null));
-assertEquals("", String.join(",", Collections.singletonList("")));
-assertEquals(",", String.join(",", Arrays.asList("", "")));
-assertEquals("a,", String.join(",", Arrays.asList("a", "")));
-  }
-
-  @Test
-  public void testStringNullToEmpty() {
-String str = "This is a test";
-assertEquals(str, StringUtils.nullToEmpty(str));
-assertEquals("", StringUtils.nullToEmpty(null));
-  }
-
-  @Test
-  public void testStringObjToString() {
-assertNull(StringUtils.objToString(null));
-assertEquals("Test String", StringUtils.objToString("Test String"));
-
-// assert byte buffer
-ByteBuffer byteBuffer1 = ByteBuffer.wrap(getUTF8Bytes("1234"));
-ByteBuffer byteBuffer2 = ByteBuffer.wrap(getUTF8Bytes("5678"));
-// assert equal because ByteBuffer has overwritten the toString to return 
a summary string
-assertEquals(byteBuffer1.toString(), byteBuffer2.toString());
-// assert not equal
-assertNotEquals(StringUtils.objToString(byteBuffer1), 
StringUtils.objToString(byteBuffer2));
-  }
-
-  @Test
-  public void testStringEmptyToNull() {
-assertNull(StringUtils.emptyToNull(""));
-assertEquals("Test String", StringUtils.emptyToNull("Test String"));
-  }
-
-  @Test
-  public void testStringNullOrEmpty() {
-assertTrue(StringUtils.isNullOrEmpty(null));
-assertTrue(StringUtils.isNullOrEmpty(""));
-assertNotEquals(null, StringUtils.isNullOrEmpty("this is not empty"));
-assertTrue(StringUtils.isNullOrEmpty(""));
-  }
-
-  @Test
-  public void testSplit() {
-assertEquals(new ArrayList<>(), StringUtils.split(null, ","));
-assertEquals(new ArrayList<>(), StringUtils.split("", ","));
-assertEquals(Arrays.asList("a", "b", "c"), StringUtils.split("a,b, c", 
","));
-assertEquals(Arrays.asList("a", "b", "c"), StringUtils.split("a,b,, c ", 
","));
-  }
-
-  @Test
-  public void testHexString() {
-String str = "abcd";
-assertEquals(StringUtils.toHexString(getUTF8Bytes(str)), 
toHexString(getUTF8Bytes(str)));
-  }
-
-  private static String toHexString(byte[] bytes) {
-StringBuilder sb = new StringBuilder(bytes.length * 2);
-for (byte b : bytes) {
-  sb.append(String.format("%02x", b));
-}
-return sb.toString();
-  }
-
- 

(hudi) 06/28: [HUDI-7391] HoodieMetadataMetrics should use Metrics instance for metrics registry (#10635)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 0caed098c35f168df9d78f52ab426b3b3255be9d
Author: Lokesh Jain 
AuthorDate: Tue May 14 16:04:34 2024 -0700

[HUDI-7391] HoodieMetadataMetrics should use Metrics instance for metrics 
registry (#10635)

Currently HoodieMetadataMetrics stores metrics in memory and these metrics 
are not pushed by the metric reporters. The metric reporters are configured 
within Metrics instance. List of changes in the PR:

Metrics related classes have been moved from hudi-client-common to 
hudi-common.
HoodieMetadataMetrics now uses Metrics class so that all the reporters can 
be supported with it.
Some gaps in configs which are added in HoodieMetadataWriteUtils
Some metrics related apis and functionality has been moved to 
HoodieMetricsConfig. The HoodieWriteConfig APIs now delegate to 
HoodieMetricsConfig for the functionality.
---
 hudi-client/hudi-client-common/pom.xml |  46 -
 .../lock/metrics/HoodieLockMetrics.java|   2 +-
 .../org/apache/hudi/config/HoodieWriteConfig.java  |  98 +-
 .../hudi/metadata/HoodieMetadataWriteUtils.java|   9 +-
 .../org/apache/hudi/metrics/HoodieMetrics.java |   2 +-
 .../cloudwatch/CloudWatchMetricsReporter.java  |  29 ++-
 .../table/action/index/RunIndexActionExecutor.java |   3 +-
 .../hudi/metrics/TestHoodieConsoleMetrics.java |  16 +-
 .../hudi/metrics/TestHoodieGraphiteMetrics.java|  22 ++-
 .../apache/hudi/metrics/TestHoodieJmxMetrics.java  |  19 +-
 .../org/apache/hudi/metrics/TestHoodieMetrics.java |  17 +-
 .../hudi/metrics/TestMetricsReporterFactory.java   |  20 +-
 .../cloudwatch/TestCloudWatchMetricsReporter.java  |  27 ++-
 .../datadog/TestDatadogMetricsReporter.java|  60 +++---
 .../org/apache/hudi/metrics/m3/TestM3Metrics.java  |  54 +++---
 .../metrics/prometheus/TestPrometheusReporter.java |  19 +-
 .../prometheus/TestPushGateWayReporter.java|  52 +++---
 .../FlinkHoodieBackedTableMetadataWriter.java  |   4 +-
 .../JavaHoodieBackedTableMetadataWriter.java   |   4 +-
 .../hudi/client/TestJavaHoodieBackedMetadata.java  |  21 ++-
 .../SparkHoodieBackedTableMetadataWriter.java  |   2 +-
 .../functional/TestHoodieBackedMetadata.java   |  18 +-
 hudi-common/pom.xml|  47 +
 .../hudi/common/config/HoodieCommonConfig.java |   8 +
 .../metrics/HoodieMetricsCloudWatchConfig.java |   0
 .../hudi/config/metrics/HoodieMetricsConfig.java   | 201 +
 .../config/metrics/HoodieMetricsDatadogConfig.java |   0
 .../metrics/HoodieMetricsGraphiteConfig.java   |   0
 .../config/metrics/HoodieMetricsJmxConfig.java |   0
 .../hudi/config/metrics/HoodieMetricsM3Config.java |   0
 .../metrics/HoodieMetricsPrometheusConfig.java |   0
 .../apache/hudi/metadata/BaseTableMetadata.java|   4 +-
 .../hudi/metadata/HoodieMetadataMetrics.java   |  21 ++-
 .../hudi/metrics/ConsoleMetricsReporter.java   |   0
 .../java/org/apache/hudi/metrics/HoodieGauge.java  |   0
 .../hudi/metrics/InMemoryMetricsReporter.java  |   0
 .../apache/hudi/metrics/JmxMetricsReporter.java|   4 +-
 .../org/apache/hudi/metrics/JmxReporterServer.java |   0
 .../java/org/apache/hudi/metrics/MetricUtils.java  |   0
 .../main/java/org/apache/hudi/metrics/Metrics.java |  43 +++--
 .../hudi/metrics/MetricsGraphiteReporter.java  |  16 +-
 .../org/apache/hudi/metrics/MetricsReporter.java   |   0
 .../hudi/metrics/MetricsReporterFactory.java   |  27 ++-
 .../apache/hudi/metrics/MetricsReporterType.java   |   0
 .../custom/CustomizableMetricsReporter.java|   0
 .../hudi/metrics/datadog/DatadogHttpClient.java|   0
 .../metrics/datadog/DatadogMetricsReporter.java|   4 +-
 .../hudi/metrics/datadog/DatadogReporter.java  |   0
 .../apache/hudi/metrics/m3/M3MetricsReporter.java  |  16 +-
 .../hudi/metrics/m3/M3ScopeReporterAdaptor.java|   0
 .../metrics/prometheus/PrometheusReporter.java |  10 +-
 .../prometheus/PushGatewayMetricsReporter.java |  18 +-
 .../metrics/prometheus/PushGatewayReporter.java|   0
 .../AbstractUserDefinedMetricsReporter.java|   0
 .../deltastreamer/HoodieDeltaStreamerMetrics.java  |   8 +-
 .../ingestion/HoodieIngestionMetrics.java  |   7 +-
 .../utilities/streamer/HoodieStreamerMetrics.java  |   5 +
 .../apache/hudi/utilities/streamer/StreamSync.java |   2 +-
 58 files changed, 650 insertions(+), 335 deletions(-)

diff --git a/hudi-client/hudi-client-common/pom.xml 
b/hudi-client/hudi-client-common/pom.xml
index 6caccd0b0a6..022f5d6faa0 100644
--- a/hudi-client/hudi-client-common/pom.xml
+++ b/hudi-client/hudi-client-common/pom.xml
@@ -85,52 +85,6 @@
   0.2.2
 
 
-
-
-  io.dropwizard.metrics
-  metrics-graphite
-  
-
-  com.rabbitmq
-

(hudi) 20/28: [HUDI-7578] Avoid unnecessary rewriting to improve performance (#11028)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 9bca6153adc1658780f8de3491eaf92364d7a873
Author: Danny Chan 
AuthorDate: Wed Apr 17 11:31:17 2024 +0800

[HUDI-7578] Avoid unnecessary rewriting to improve performance (#11028)
---
 .../src/main/java/org/apache/hudi/io/HoodieMergeHandle.java | 13 +
 .../org/apache/hudi/io/HoodieMergeHandleWithChangeLog.java  |  2 +-
 .../java/org/apache/hudi/io/HoodieSortedMergeHandle.java|  4 ++--
 .../hudi/io/FlinkMergeAndReplaceHandleWithChangeLog.java|  2 +-
 .../org/apache/hudi/io/FlinkMergeHandleWithChangeLog.java   |  2 +-
 .../src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java |  4 
 6 files changed, 14 insertions(+), 13 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
index e40a5585067..749b08c3e7e 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
@@ -103,7 +103,7 @@ public class HoodieMergeHandle extends 
HoodieWriteHandle
   protected Map> keyToNewRecords;
   protected Set writtenRecordKeys;
   protected HoodieFileWriter fileWriter;
-  private boolean preserveMetadata = false;
+  protected boolean preserveMetadata = false;
 
   protected Path newFilePath;
   protected Path oldFilePath;
@@ -111,7 +111,6 @@ public class HoodieMergeHandle extends 
HoodieWriteHandle
   protected long recordsDeleted = 0;
   protected long updatedRecordsWritten = 0;
   protected long insertRecordsWritten = 0;
-  protected boolean useWriterSchemaForCompaction;
   protected Option keyGeneratorOpt;
   private HoodieBaseFile baseFileToMerge;
 
@@ -142,7 +141,6 @@ public class HoodieMergeHandle extends 
HoodieWriteHandle
HoodieBaseFile dataFileToBeMerged, 
TaskContextSupplier taskContextSupplier, Option 
keyGeneratorOpt) {
 super(config, instantTime, partitionPath, fileId, hoodieTable, 
taskContextSupplier);
 this.keyToNewRecords = keyToNewRecords;
-this.useWriterSchemaForCompaction = true;
 this.preserveMetadata = true;
 init(fileId, this.partitionPath, dataFileToBeMerged);
 validateAndSetAndKeyGenProps(keyGeneratorOpt, config.populateMetaFields());
@@ -279,7 +277,7 @@ public class HoodieMergeHandle extends 
HoodieWriteHandle
   }
 
   protected void writeInsertRecord(HoodieRecord newRecord) throws 
IOException {
-Schema schema = useWriterSchemaForCompaction ? writeSchemaWithMetaFields : 
writeSchema;
+Schema schema = preserveMetadata ? writeSchemaWithMetaFields : writeSchema;
 // just skip the ignored record
 if (newRecord.shouldIgnore(schema, config.getProps())) {
   return;
@@ -308,7 +306,7 @@ public class HoodieMergeHandle extends 
HoodieWriteHandle
 }
 try {
   if (combineRecord.isPresent() && !combineRecord.get().isDelete(schema, 
config.getProps()) && !isDelete) {
-writeToFile(newRecord.getKey(), combineRecord.get(), schema, prop, 
preserveMetadata && useWriterSchemaForCompaction);
+writeToFile(newRecord.getKey(), combineRecord.get(), schema, prop, 
preserveMetadata);
 recordsWritten++;
   } else {
 recordsDeleted++;
@@ -335,7 +333,7 @@ public class HoodieMergeHandle extends 
HoodieWriteHandle
*/
   public void write(HoodieRecord oldRecord) {
 Schema oldSchema = config.populateMetaFields() ? writeSchemaWithMetaFields 
: writeSchema;
-Schema newSchema = useWriterSchemaForCompaction ? 
writeSchemaWithMetaFields : writeSchema;
+Schema newSchema = preserveMetadata ? writeSchemaWithMetaFields : 
writeSchema;
 boolean copyOldRecord = true;
 String key = oldRecord.getRecordKey(oldSchema, keyGeneratorOpt);
 TypedProperties props = config.getPayloadConfig().getProps();
@@ -384,8 +382,7 @@ public class HoodieMergeHandle extends 
HoodieWriteHandle
 // NOTE: `FILENAME_METADATA_FIELD` has to be rewritten to correctly point 
to the
 //   file holding this record even in cases when overall metadata is 
preserved
 MetadataValues metadataValues = new 
MetadataValues().setFileName(newFilePath.getName());
-HoodieRecord populatedRecord =
-record.prependMetaFields(schema, writeSchemaWithMetaFields, 
metadataValues, prop);
+HoodieRecord populatedRecord = record.prependMetaFields(schema, 
writeSchemaWithMetaFields, metadataValues, prop);
 
 if (shouldPreserveRecordMetadata) {
   fileWriter.write(key.getRecordKey(), populatedRecord, 
writeSchemaWithMetaFields);
diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandleWithChangeLog.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandleWithChange

(hudi) 12/28: [HUDI-7616] Avoid multiple cleaner plans and deprecate hoodie.clean.allow.multiple (#11013)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit b5b04af9d1e99c60ebbf2f07a57a4fc0c70c92c7
Author: Y Ethan Guo 
AuthorDate: Sat Apr 13 19:04:44 2024 -0700

[HUDI-7616] Avoid multiple cleaner plans and deprecate 
hoodie.clean.allow.multiple (#11013)
---
 .../src/main/java/org/apache/hudi/config/HoodieCleanConfig.java | 4 +++-
 .../src/test/java/org/apache/hudi/table/TestCleaner.java| 6 +++---
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java
index a4114152023..e023bee4274 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java
@@ -167,11 +167,13 @@ public class HoodieCleanConfig extends HoodieConfig {
   + "execution is slow due to limited parallelism, you can increase 
this to tune the "
   + "performance..");
 
+  @Deprecated
   public static final ConfigProperty ALLOW_MULTIPLE_CLEANS = 
ConfigProperty
   .key("hoodie.clean.allow.multiple")
-  .defaultValue(true)
+  .defaultValue(false)
   .markAdvanced()
   .sinceVersion("0.11.0")
+  .deprecatedAfter("1.0.0")
   .withDocumentation("Allows scheduling/executing multiple cleans by 
enabling this config. If users prefer to strictly ensure clean requests should 
be mutually exclusive, "
   + ".i.e. a 2nd clean will not be scheduled if another clean is not 
yet completed to avoid repeat cleaning of same files, they might want to 
disable this config.");
 
diff --git 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/TestCleaner.java
 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/TestCleaner.java
index b18238f3392..6a8ce948373 100644
--- 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/TestCleaner.java
+++ 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/TestCleaner.java
@@ -593,13 +593,13 @@ public class TestCleaner extends HoodieCleanerTestBase {
   timeline = metaClient.reloadActiveTimeline();
 
   assertEquals(0, cleanStats.size(), "Must not clean any files");
-  assertEquals(1, timeline.getTimelineOfActions(
+  assertEquals(0, timeline.getTimelineOfActions(
   
CollectionUtils.createSet(HoodieTimeline.CLEAN_ACTION)).filterInflightsAndRequested().countInstants());
   assertEquals(0, timeline.getTimelineOfActions(
   
CollectionUtils.createSet(HoodieTimeline.CLEAN_ACTION)).filterInflights().countInstants());
-  assertEquals(--cleanCount, timeline.getTimelineOfActions(
+  assertEquals(cleanCount, timeline.getTimelineOfActions(
   
CollectionUtils.createSet(HoodieTimeline.CLEAN_ACTION)).filterCompletedInstants().countInstants());
-  assertTrue(timeline.getTimelineOfActions(
+  assertFalse(timeline.getTimelineOfActions(
   
CollectionUtils.createSet(HoodieTimeline.CLEAN_ACTION)).filterInflightsAndRequested().containsInstant(makeNewCommitTime(--instantClean,
 "%09d")));
 }
   }



(hudi) 03/28: [HUDI-7597] Add logs of Kafka offsets when the checkpoint is out of bound (#10987)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit a1d601c5866712ce0a97d8f0238b084eeb883c90
Author: Y Ethan Guo 
AuthorDate: Wed Apr 10 03:03:45 2024 -0700

[HUDI-7597] Add logs of Kafka offsets when the checkpoint is out of bound 
(#10987)

* [HUDI-7597] Add logs of Kafka offsets when the checkpoint is out of bound

* Adjust test
---
 .../utilities/sources/helpers/KafkaOffsetGen.java  | 29 +++---
 .../utilities/sources/BaseTestKafkaSource.java | 16 ++--
 2 files changed, 27 insertions(+), 18 deletions(-)

diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java
index 442046cd948..71fe7a7629a 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java
@@ -331,24 +331,35 @@ public class KafkaOffsetGen {
 
   /**
* Fetch checkpoint offsets for each partition.
-   * @param consumer instance of {@link KafkaConsumer} to fetch offsets from.
+   *
+   * @param consumer  instance of {@link KafkaConsumer} to fetch 
offsets from.
* @param lastCheckpointStr last checkpoint string.
-   * @param topicPartitions set of topic partitions.
+   * @param topicPartitions   set of topic partitions.
* @return a map of Topic partitions to offsets.
*/
   private Map fetchValidOffsets(KafkaConsumer consumer,
-Option 
lastCheckpointStr, Set topicPartitions) {
+  Option 
lastCheckpointStr, Set topicPartitions) {
 Map earliestOffsets = 
consumer.beginningOffsets(topicPartitions);
 Map checkpointOffsets = 
CheckpointUtils.strToOffsets(lastCheckpointStr.get());
-boolean isCheckpointOutOfBounds = checkpointOffsets.entrySet().stream()
-.anyMatch(offset -> offset.getValue() < 
earliestOffsets.get(offset.getKey()));
+List outOfBoundPartitionList = 
checkpointOffsets.entrySet().stream()
+.filter(offset -> offset.getValue() < 
earliestOffsets.get(offset.getKey()))
+.map(Map.Entry::getKey)
+.collect(Collectors.toList());
+boolean isCheckpointOutOfBounds = !outOfBoundPartitionList.isEmpty();
+
 if (isCheckpointOutOfBounds) {
+  String outOfBoundOffsets = outOfBoundPartitionList.stream()
+  .map(p -> p.toString() + ":{checkpoint=" + checkpointOffsets.get(p)
+  + ",earliestOffset=" + earliestOffsets.get(p) + "}")
+  .collect(Collectors.joining(","));
+  String message = "Some data may have been lost because they are not 
available in Kafka any more;"
+  + " either the data was aged out by Kafka or the topic may have been 
deleted before all the data in the topic was processed. "
+  + "Kafka partitions that have out-of-bound checkpoints: " + 
outOfBoundOffsets + " .";
+
   if (getBooleanWithAltKeys(this.props, 
KafkaSourceConfig.ENABLE_FAIL_ON_DATA_LOSS)) {
-throw new HoodieStreamerException("Some data may have been lost 
because they are not available in Kafka any more;"
-+ " either the data was aged out by Kafka or the topic may have 
been deleted before all the data in the topic was processed.");
+throw new HoodieStreamerException(message);
   } else {
-LOG.warn("Some data may have been lost because they are not available 
in Kafka any more;"
-+ " either the data was aged out by Kafka or the topic may have 
been deleted before all the data in the topic was processed."
+LOG.warn(message
 + " If you want Hudi Streamer to fail on such cases, set \"" + 
KafkaSourceConfig.ENABLE_FAIL_ON_DATA_LOSS.key() + "\" to \"true\".");
   }
 }
diff --git 
a/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/BaseTestKafkaSource.java
 
b/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/BaseTestKafkaSource.java
index c5fc7bfaafa..e45d10e7a61 100644
--- 
a/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/BaseTestKafkaSource.java
+++ 
b/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/BaseTestKafkaSource.java
@@ -53,6 +53,7 @@ import static 
org.apache.hudi.utilities.config.KafkaSourceConfig.ENABLE_KAFKA_CO
 import static org.junit.jupiter.api.Assertions.assertEquals;
 import static org.junit.jupiter.api.Assertions.assertNotNull;
 import static org.junit.jupiter.api.Assertions.assertThrows;
+import static org.junit.jupiter.api.Assertions.assertTrue;
 import static org.mockito.Mockito.mock;
 import static org.mockito.Mockito.when;
 
@@ -254,7 +255,7 @@ abstract class BaseTestKafkaSource extends 
SparkClientFunctionalTestHarness {
 final S

(hudi) 11/28: [HUDI-7378] Fix Spark SQL DML with custom key generator (#10615)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 3e6bb0d77bc5d8060a307c5e291f164754498b61
Author: Y Ethan Guo 
AuthorDate: Fri Apr 12 22:51:03 2024 -0700

[HUDI-7378] Fix Spark SQL DML with custom key generator (#10615)
---
 .../factory/HoodieSparkKeyGeneratorFactory.java|   4 +
 .../org/apache/hudi/util/SparkKeyGenUtils.scala|  16 +-
 .../scala/org/apache/hudi/HoodieWriterUtils.scala  |  20 +-
 .../spark/sql/hudi/ProvidesHoodieConfig.scala  |  60 ++-
 .../spark/sql/hudi/TestProvidesHoodieConfig.scala  |  79 +++
 .../hudi/command/MergeIntoHoodieTableCommand.scala |   5 +-
 .../TestSparkSqlWithCustomKeyGenerator.scala   | 571 +
 7 files changed, 742 insertions(+), 13 deletions(-)

diff --git 
a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/factory/HoodieSparkKeyGeneratorFactory.java
 
b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/factory/HoodieSparkKeyGeneratorFactory.java
index 1ea5adcd6b4..dcc2eaec9eb 100644
--- 
a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/factory/HoodieSparkKeyGeneratorFactory.java
+++ 
b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/factory/HoodieSparkKeyGeneratorFactory.java
@@ -79,6 +79,10 @@ public class HoodieSparkKeyGeneratorFactory {
 
   public static KeyGenerator createKeyGenerator(TypedProperties props) throws 
IOException {
 String keyGeneratorClass = getKeyGeneratorClassName(props);
+return createKeyGenerator(keyGeneratorClass, props);
+  }
+
+  public static KeyGenerator createKeyGenerator(String keyGeneratorClass, 
TypedProperties props) throws IOException {
 boolean autoRecordKeyGen = 
KeyGenUtils.isAutoGeneratedRecordKeysEnabled(props)
 //Need to prevent overwriting the keygen for spark sql merge into 
because we need to extract
 //the recordkey from the meta cols if it exists. Sql keygen will use 
pkless keygen if needed.
diff --git 
a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/util/SparkKeyGenUtils.scala
 
b/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/util/SparkKeyGenUtils.scala
index 7b91ae5a728..bd094464096 100644
--- 
a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/util/SparkKeyGenUtils.scala
+++ 
b/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/util/SparkKeyGenUtils.scala
@@ -21,8 +21,8 @@ import org.apache.hudi.common.config.TypedProperties
 import org.apache.hudi.common.util.StringUtils
 import org.apache.hudi.common.util.ValidationUtils.checkArgument
 import org.apache.hudi.keygen.constant.KeyGeneratorOptions
-import org.apache.hudi.keygen.{AutoRecordKeyGeneratorWrapper, 
AutoRecordGenWrapperKeyGenerator, CustomAvroKeyGenerator, CustomKeyGenerator, 
GlobalAvroDeleteKeyGenerator, GlobalDeleteKeyGenerator, KeyGenerator, 
NonpartitionedAvroKeyGenerator, NonpartitionedKeyGenerator}
 import org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory
+import org.apache.hudi.keygen.{AutoRecordKeyGeneratorWrapper, 
CustomAvroKeyGenerator, CustomKeyGenerator, GlobalAvroDeleteKeyGenerator, 
GlobalDeleteKeyGenerator, KeyGenerator, NonpartitionedAvroKeyGenerator, 
NonpartitionedKeyGenerator}
 
 object SparkKeyGenUtils {
 
@@ -35,6 +35,20 @@ object SparkKeyGenUtils {
 getPartitionColumns(keyGenerator, props)
   }
 
+  /**
+   * @param KeyGenClassNameOption key generator class name if present.
+   * @param props config properties.
+   * @return partition column names only, concatenated by ","
+   */
+  def getPartitionColumns(KeyGenClassNameOption: Option[String], props: 
TypedProperties): String = {
+val keyGenerator = if (KeyGenClassNameOption.isEmpty) {
+  HoodieSparkKeyGeneratorFactory.createKeyGenerator(props)
+} else {
+  
HoodieSparkKeyGeneratorFactory.createKeyGenerator(KeyGenClassNameOption.get, 
props)
+}
+getPartitionColumns(keyGenerator, props)
+  }
+
   /**
* @param keyGen key generator class name
* @return partition columns
diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieWriterUtils.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieWriterUtils.scala
index 0a4ef7a3d63..fade5957210 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieWriterUtils.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieWriterUtils.scala
@@ -197,8 +197,26 @@ object HoodieWriterUtils {
   
diffConfigs.append(s"KeyGenerator:\t$datasourceKeyGen\t$tableConfigKeyGen\n")
 }
 
+// Please note that the validation of partition path fields needs the 
key generator class
+// for the table, since the custom key generator expects a different 
format of
+// the value of the write config 
"hoodie.datasource.write

(hudi) 22/28: [HUDI-7626] Propagate UserGroupInformation from the main thread to the new thread of timeline service threadpool (#11039)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 6df621aec1f1aa09a3bdf71b4703c508bf16ad47
Author: Jing Zhang 
AuthorDate: Wed Apr 17 16:40:29 2024 +0800

[HUDI-7626] Propagate UserGroupInformation from the main thread to the new 
thread of timeline service threadpool (#11039)
---
 .../hudi/timeline/service/RequestHandler.java  | 128 +++--
 1 file changed, 70 insertions(+), 58 deletions(-)

diff --git 
a/hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java
 
b/hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java
index 9385b4eca9e..12e11db403d 100644
--- 
a/hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java
+++ 
b/hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java
@@ -52,11 +52,13 @@ import io.javalin.http.Context;
 import io.javalin.http.Handler;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.security.UserGroupInformation;
 import org.jetbrains.annotations.NotNull;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
 import java.io.IOException;
+import java.security.PrivilegedExceptionAction;
 import java.util.Arrays;
 import java.util.List;
 import java.util.Map;
@@ -563,76 +565,86 @@ public class RequestHandler {
 
 private final Handler handler;
 private final boolean performRefreshCheck;
+private final UserGroupInformation ugi;
 
 ViewHandler(Handler handler, boolean performRefreshCheck) {
   this.handler = handler;
   this.performRefreshCheck = performRefreshCheck;
+  try {
+ugi = UserGroupInformation.getCurrentUser();
+  } catch (Exception e) {
+LOG.warn("Fail to get ugi", e);
+throw new HoodieException(e);
+  }
 }
 
 @Override
 public void handle(@NotNull Context context) throws Exception {
-  boolean success = true;
-  long beginTs = System.currentTimeMillis();
-  boolean synced = false;
-  boolean refreshCheck = performRefreshCheck && 
!isRefreshCheckDisabledInQuery(context);
-  long refreshCheckTimeTaken = 0;
-  long handleTimeTaken = 0;
-  long finalCheckTimeTaken = 0;
-  try {
-if (refreshCheck) {
-  long beginRefreshCheck = System.currentTimeMillis();
-  synced = syncIfLocalViewBehind(context);
-  long endRefreshCheck = System.currentTimeMillis();
-  refreshCheckTimeTaken = endRefreshCheck - beginRefreshCheck;
-}
+  ugi.doAs((PrivilegedExceptionAction) () -> {
+boolean success = true;
+long beginTs = System.currentTimeMillis();
+boolean synced = false;
+boolean refreshCheck = performRefreshCheck && 
!isRefreshCheckDisabledInQuery(context);
+long refreshCheckTimeTaken = 0;
+long handleTimeTaken = 0;
+long finalCheckTimeTaken = 0;
+try {
+  if (refreshCheck) {
+long beginRefreshCheck = System.currentTimeMillis();
+synced = syncIfLocalViewBehind(context);
+long endRefreshCheck = System.currentTimeMillis();
+refreshCheckTimeTaken = endRefreshCheck - beginRefreshCheck;
+  }
 
-long handleBeginMs = System.currentTimeMillis();
-handler.handle(context);
-long handleEndMs = System.currentTimeMillis();
-handleTimeTaken = handleEndMs - handleBeginMs;
-
-if (refreshCheck) {
-  long beginFinalCheck = System.currentTimeMillis();
-  if (isLocalViewBehind(context)) {
-String lastKnownInstantFromClient = 
context.queryParamAsClass(RemoteHoodieTableFileSystemView.LAST_INSTANT_TS, 
String.class).getOrDefault(HoodieTimeline.INVALID_INSTANT_TS);
-String timelineHashFromClient = 
context.queryParamAsClass(RemoteHoodieTableFileSystemView.TIMELINE_HASH, 
String.class).getOrDefault("");
-HoodieTimeline localTimeline =
-
viewManager.getFileSystemView(context.queryParam(RemoteHoodieTableFileSystemView.BASEPATH_PARAM)).getTimeline();
-if (shouldThrowExceptionIfLocalViewBehind(localTimeline, 
timelineHashFromClient)) {
-  String errMsg =
-  "Last known instant from client was "
-  + lastKnownInstantFromClient
-  + " but server has the following timeline "
-  + localTimeline.getInstants();
-  throw new BadRequestResponse(errMsg);
+  long handleBeginMs = System.currentTimeMillis();
+  handler.handle(context);
+  long handleEndMs = System.currentTimeMillis();
+  handleTimeTaken = handleEndMs - handleBeginMs;
+
+  if (refreshCheck) {
+long beginFinalCheck = System.currentTimeMillis();
+if (isLocalVie

(hudi) 07/28: [HUDI-6441] Passing custom Headers with Hudi Callback URL (#10970)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 804b88d212c3a87b3d8d07dc5dc902bb5d6910b7
Author: Vova Kolmakov 
AuthorDate: Thu Apr 11 21:16:14 2024 +0700

[HUDI-6441] Passing custom Headers with Hudi Callback URL (#10970)
---
 .../http/HoodieWriteCommitHttpCallbackClient.java  |  46 -
 .../config/HoodieWriteCommitCallbackConfig.java|  15 ++
 .../client/http/TestCallbackHttpClient.java| 202 +
 .../hudi/callback/http/TestCallbackHttpClient.java | 143 ---
 4 files changed, 260 insertions(+), 146 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/callback/client/http/HoodieWriteCommitHttpCallbackClient.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/callback/client/http/HoodieWriteCommitHttpCallbackClient.java
index d9248ed20f1..037e84b3d00 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/callback/client/http/HoodieWriteCommitHttpCallbackClient.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/callback/client/http/HoodieWriteCommitHttpCallbackClient.java
@@ -18,6 +18,8 @@
 
 package org.apache.hudi.callback.client.http;
 
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.util.VisibleForTesting;
 import org.apache.hudi.config.HoodieWriteCommitCallbackConfig;
 import org.apache.hudi.config.HoodieWriteConfig;
 
@@ -34,6 +36,9 @@ import org.slf4j.LoggerFactory;
 
 import java.io.Closeable;
 import java.io.IOException;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.StringTokenizer;
 
 /**
  * Write commit callback http client.
@@ -43,36 +48,42 @@ public class HoodieWriteCommitHttpCallbackClient implements 
Closeable {
   private static final Logger LOG = 
LoggerFactory.getLogger(HoodieWriteCommitHttpCallbackClient.class);
 
   public static final String HEADER_KEY_API_KEY = "HUDI-CALLBACK-KEY";
+  static final String HEADERS_DELIMITER = ";";
+  static final String HEADERS_KV_DELIMITER = ":";
 
   private final String apiKey;
   private final String url;
   private final CloseableHttpClient client;
   private HoodieWriteConfig writeConfig;
+  private final Map customHeaders;
 
   public HoodieWriteCommitHttpCallbackClient(HoodieWriteConfig config) {
 this.writeConfig = config;
 this.apiKey = getApiKey();
 this.url = getUrl();
 this.client = getClient();
+this.customHeaders = parseCustomHeaders();
   }
 
-  public HoodieWriteCommitHttpCallbackClient(String apiKey, String url, 
CloseableHttpClient client) {
+  public HoodieWriteCommitHttpCallbackClient(String apiKey, String url, 
CloseableHttpClient client, Map customHeaders) {
 this.apiKey = apiKey;
 this.url = url;
 this.client = client;
+this.customHeaders = customHeaders != null ? customHeaders : new 
HashMap<>();
   }
 
   public void send(String callbackMsg) {
 HttpPost request = new HttpPost(url);
 request.setHeader(HEADER_KEY_API_KEY, apiKey);
 request.setHeader(HttpHeaders.CONTENT_TYPE, 
ContentType.APPLICATION_JSON.toString());
+customHeaders.forEach(request::setHeader);
 request.setEntity(new StringEntity(callbackMsg, 
ContentType.APPLICATION_JSON));
 try (CloseableHttpResponse response = client.execute(request)) {
   int statusCode = response.getStatusLine().getStatusCode();
   if (statusCode >= 300) {
-LOG.warn(String.format("Failed to send callback message. Response was 
%s", response));
+LOG.warn("Failed to send callback message. Response was {}", response);
   } else {
-LOG.info(String.format("Sent Callback data to %s successfully !", 
url));
+LOG.info("Sent Callback data with {} custom headers to {} successfully 
!", customHeaders.size(), url);
   }
 } catch (IOException e) {
   LOG.warn("Failed to send callback.", e);
@@ -101,8 +112,37 @@ public class HoodieWriteCommitHttpCallbackClient 
implements Closeable {
 return 
writeConfig.getInt(HoodieWriteCommitCallbackConfig.CALLBACK_HTTP_TIMEOUT_IN_SECONDS);
   }
 
+  private Map parseCustomHeaders() {
+Map headers = new HashMap<>();
+String headersString = 
writeConfig.getString(HoodieWriteCommitCallbackConfig.CALLBACK_HTTP_CUSTOM_HEADERS);
+if (!StringUtils.isNullOrEmpty(headersString)) {
+  StringTokenizer tokenizer = new StringTokenizer(headersString, 
HEADERS_DELIMITER);
+  while (tokenizer.hasMoreTokens()) {
+String token = tokenizer.nextToken();
+if (!StringUtils.isNullOrEmpty(token)) {
+  String[] keyValue = token.split(HEADERS_KV_DELIMITER);
+  if (keyValue.length == 2) {
+String trimKey = keyValue[0].trim();
+String trimValue = keyValue[1].trim();
+if (trimKey.length() > 0 && trimValue.length() > 0) {
+  headers.put(trimKey, trimValue);
+

(hudi) 27/28: [HUDI-7635] Add default block size and openSeekable APIs to HoodieStorage (#11048)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit ab3741cd528595d4e89b6018f12a9038e8d61e63
Author: Y Ethan Guo 
AuthorDate: Tue May 14 17:02:25 2024 -0700

[HUDI-7635] Add default block size and openSeekable APIs to HoodieStorage 
(#11048)

This PR adds `getDefaultBlockSize` and `openSeekable` APIs to
`HoodieStorage` and implements these APIs in `HoodieHadoopStorage`.
The implementation follows the same logic of creating seekable input
stream for log file reading, and `openSeekable` will be used by the log
reading logic.

A few util methods are moved from `FSUtils` and
`HoodieLogFileReader` classes to `HadoopFSUtilsclass`.
---
 .../java/org/apache/hudi/common/fs/FSUtils.java| 18 -
 .../hudi/common/table/log/HoodieLogFileReader.java | 75 +-
 .../org/apache/hudi/hadoop/fs/HadoopFSUtils.java   | 90 ++
 .../hudi/storage/hadoop/HoodieHadoopStorage.java   | 13 
 .../org/apache/hudi/storage/HoodieStorage.java | 30 
 .../hudi/io/storage/TestHoodieStorageBase.java | 43 +++
 6 files changed, 179 insertions(+), 90 deletions(-)

diff --git a/hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java 
b/hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java
index 292c2b41946..1b51fd78bfa 100644
--- a/hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java
+++ b/hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java
@@ -667,24 +667,6 @@ public class FSUtils {
 return fs.getUri() + fullPartitionPath.toUri().getRawPath();
   }
 
-  /**
-   * This is due to HUDI-140 GCS has a different behavior for detecting EOF 
during seek().
-   *
-   * @param fs fileSystem instance.
-   * @return true if the inputstream or the wrapped one is of type 
GoogleHadoopFSInputStream
-   */
-  public static boolean isGCSFileSystem(FileSystem fs) {
-return fs.getScheme().equals(StorageSchemes.GCS.getScheme());
-  }
-
-  /**
-   * Chdfs will throw {@code IOException} instead of {@code EOFException}. It 
will cause error in isBlockCorrupted().
-   * Wrapped by {@code BoundedFsDataInputStream}, to check whether the desired 
offset is out of the file size in advance.
-   */
-  public static boolean isCHDFileSystem(FileSystem fs) {
-return StorageSchemes.CHDFS.getScheme().equals(fs.getScheme());
-  }
-
   public static Configuration registerFileSystem(Path file, Configuration 
conf) {
 Configuration returnConf = new Configuration(conf);
 String scheme = HadoopFSUtils.getFs(file.toString(), conf).getScheme();
diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java
 
b/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java
index c1daf5e32d1..062e3639073 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java
@@ -37,20 +37,15 @@ import org.apache.hudi.common.util.Option;
 import org.apache.hudi.exception.CorruptedLogFileException;
 import org.apache.hudi.exception.HoodieIOException;
 import org.apache.hudi.exception.HoodieNotSupportedException;
-import org.apache.hudi.hadoop.fs.BoundedFsDataInputStream;
 import org.apache.hudi.hadoop.fs.HadoopSeekableDataInputStream;
-import org.apache.hudi.hadoop.fs.SchemeAwareFSDataInputStream;
-import org.apache.hudi.hadoop.fs.TimedFSDataInputStream;
 import org.apache.hudi.internal.schema.InternalSchema;
 import org.apache.hudi.io.SeekableDataInputStream;
 import org.apache.hudi.io.util.IOUtils;
+import org.apache.hudi.storage.StoragePath;
 import org.apache.hudi.storage.StorageSchemes;
 
 import org.apache.avro.Schema;
 import org.apache.hadoop.conf.Configuration;
-import org.apache.hadoop.fs.BufferedFSInputStream;
-import org.apache.hadoop.fs.FSDataInputStream;
-import org.apache.hadoop.fs.FSInputStream;
 import org.apache.hadoop.fs.FileSystem;
 import org.apache.hadoop.fs.Path;
 import org.slf4j.Logger;
@@ -67,6 +62,7 @@ import java.util.Objects;
 
 import static org.apache.hudi.common.util.ValidationUtils.checkArgument;
 import static org.apache.hudi.common.util.ValidationUtils.checkState;
+import static org.apache.hudi.hadoop.fs.HadoopFSUtils.getFSDataInputStream;
 
 /**
  * Scans a log file and provides block level iterator on the log file Loads 
the entire block contents in memory Can emit
@@ -479,71 +475,6 @@ public class HoodieLogFileReader implements 
HoodieLogFormat.Reader {
   private static SeekableDataInputStream getDataInputStream(FileSystem fs,
 HoodieLogFile 
logFile,
 int bufferSize) {
-return new HadoopSeekableDataInputStream(getFSDataInputStream(fs, logFile, 
bufferSize));
-  }
-
-  

(hudi) 23/28: [HUDI-4228] Clean up literal usage in Hudi CLI argument check (#11042)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 74f91c704741439f131e9c516119b4927ed99822
Author: Vova Kolmakov 
AuthorDate: Thu Apr 18 09:14:32 2024 +0700

[HUDI-4228] Clean up literal usage in Hudi CLI argument check (#11042)
---
 .../org/apache/hudi/cli/commands/SparkMain.java| 201 +++--
 .../org/apache/hudi/cli/ArchiveExecutorUtils.java  |   2 +-
 2 files changed, 69 insertions(+), 134 deletions(-)

diff --git a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/SparkMain.java 
b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/SparkMain.java
index 742540d0ff5..c312deaf6c3 100644
--- a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/SparkMain.java
+++ b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/SparkMain.java
@@ -19,14 +19,13 @@
 package org.apache.hudi.cli.commands;
 
 import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.cli.ArchiveExecutorUtils;
 import org.apache.hudi.cli.utils.SparkUtil;
 import org.apache.hudi.client.HoodieTimelineArchiver;
 import org.apache.hudi.client.SparkRDDWriteClient;
 import org.apache.hudi.client.common.HoodieSparkEngineContext;
-import org.apache.hudi.common.config.HoodieMetadataConfig;
 import org.apache.hudi.common.config.TypedProperties;
 import org.apache.hudi.common.engine.HoodieEngineContext;
-import org.apache.hudi.common.model.HoodieAvroPayload;
 import org.apache.hudi.common.model.HoodieFailedWritesCleaningPolicy;
 import org.apache.hudi.common.model.HoodieRecord;
 import org.apache.hudi.common.model.WriteOperationType;
@@ -37,7 +36,6 @@ import org.apache.hudi.common.util.Option;
 import org.apache.hudi.common.util.PartitionPathEncodeUtils;
 import org.apache.hudi.common.util.StringUtils;
 import org.apache.hudi.common.util.ValidationUtils;
-import org.apache.hudi.config.HoodieArchivalConfig;
 import org.apache.hudi.config.HoodieBootstrapConfig;
 import org.apache.hudi.config.HoodieCleanConfig;
 import org.apache.hudi.config.HoodieIndexConfig;
@@ -99,16 +97,45 @@ public class SparkMain {
* Commands.
*/
   enum SparkCommand {
-BOOTSTRAP, ROLLBACK, DEDUPLICATE, ROLLBACK_TO_SAVEPOINT, SAVEPOINT, 
IMPORT, UPSERT, COMPACT_SCHEDULE, COMPACT_RUN, COMPACT_SCHEDULE_AND_EXECUTE,
-COMPACT_UNSCHEDULE_PLAN, COMPACT_UNSCHEDULE_FILE, COMPACT_VALIDATE, 
COMPACT_REPAIR, CLUSTERING_SCHEDULE,
-CLUSTERING_RUN, CLUSTERING_SCHEDULE_AND_EXECUTE, CLEAN, DELETE_MARKER, 
DELETE_SAVEPOINT, UPGRADE, DOWNGRADE,
-REPAIR_DEPRECATED_PARTITION, RENAME_PARTITION, ARCHIVE
+BOOTSTRAP(18), ROLLBACK(6), DEDUPLICATE(8), ROLLBACK_TO_SAVEPOINT(6), 
SAVEPOINT(7),
+IMPORT(13), UPSERT(13), COMPACT_SCHEDULE(7), COMPACT_RUN(10), 
COMPACT_SCHEDULE_AND_EXECUTE(9),
+COMPACT_UNSCHEDULE_PLAN(9), COMPACT_UNSCHEDULE_FILE(10), 
COMPACT_VALIDATE(7), COMPACT_REPAIR(8),
+CLUSTERING_SCHEDULE(7), CLUSTERING_RUN(9), 
CLUSTERING_SCHEDULE_AND_EXECUTE(8), CLEAN(5),
+DELETE_MARKER(5), DELETE_SAVEPOINT(5), UPGRADE(5), DOWNGRADE(5),
+REPAIR_DEPRECATED_PARTITION(4), RENAME_PARTITION(6), ARCHIVE(8);
+
+private final int minArgsCount;
+
+SparkCommand(int minArgsCount) {
+  this.minArgsCount = minArgsCount;
+}
+
+void assertEq(int factArgsCount) {
+  ValidationUtils.checkArgument(factArgsCount == minArgsCount);
+}
+
+void assertGtEq(int factArgsCount) {
+  ValidationUtils.checkArgument(factArgsCount >= minArgsCount);
+}
+
+List makeConfigs(String[] args) {
+  List configs = new ArrayList<>();
+  if (args.length > minArgsCount) {
+configs.addAll(Arrays.asList(args).subList(minArgsCount, args.length));
+  }
+  return configs;
+}
+
+String getPropsFilePath(String[] args) {
+  return (args.length >= minArgsCount && 
!StringUtils.isNullOrEmpty(args[minArgsCount - 1]))
+  ? args[minArgsCount - 1] : null;
+}
   }
 
-  public static void main(String[] args) throws Exception {
+  public static void main(String[] args) {
 ValidationUtils.checkArgument(args.length >= 4);
 final String commandString = args[0];
-LOG.info("Invoking SparkMain: " + commandString);
+LOG.info("Invoking SparkMain: {}", commandString);
 final SparkCommand cmd = SparkCommand.valueOf(commandString);
 
 JavaSparkContext jsc = SparkUtil.initJavaSparkContext("hoodie-cli-" + 
commandString,
@@ -116,193 +143,112 @@ public class SparkMain {
 
 int returnCode = 0;
 try {
+  cmd.assertGtEq(args.length);
+  List configs = cmd.makeConfigs(args);
+  String propsFilePath = cmd.getPropsFilePath(args);
   switch (cmd) {
 case ROLLBACK:
-  assert (args.length == 6);
+  cmd.assertEq(args.length);
   returnCode = rollback(jsc, args[3], args[4], 
Boolean.parseBoolean(args[5]));
   break;
 case DEDUPLICATE:
-  assert (args.length == 8);
+  cmd.assertEq(a

(hudi) 28/28: [HUDI-7637] Make StoragePathInfo Comparable (#11050)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit b71f03776ba9922fc0b71ba3b7c9d79ff694fef5
Author: Y Ethan Guo 
AuthorDate: Thu Apr 18 05:51:23 2024 -0700

[HUDI-7637] Make StoragePathInfo Comparable (#11050)
---
 .../main/java/org/apache/hudi/storage/StoragePathInfo.java  |  7 ++-
 .../org/apache/hudi/io/storage/TestStoragePathInfo.java | 13 +
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/hudi-io/src/main/java/org/apache/hudi/storage/StoragePathInfo.java 
b/hudi-io/src/main/java/org/apache/hudi/storage/StoragePathInfo.java
index e4711bf72dd..1c1ebc32a2f 100644
--- a/hudi-io/src/main/java/org/apache/hudi/storage/StoragePathInfo.java
+++ b/hudi-io/src/main/java/org/apache/hudi/storage/StoragePathInfo.java
@@ -31,7 +31,7 @@ import java.io.Serializable;
  * with simplification based on what Hudi needs.
  */
 @PublicAPIClass(maturity = ApiMaturityLevel.EVOLVING)
-public class StoragePathInfo implements Serializable {
+public class StoragePathInfo implements Serializable, 
Comparable {
   private final StoragePath path;
   private final long length;
   private final boolean isDirectory;
@@ -109,6 +109,11 @@ public class StoragePathInfo implements Serializable {
 return modificationTime;
   }
 
+  @Override
+  public int compareTo(StoragePathInfo o) {
+return this.getPath().compareTo(o.getPath());
+  }
+
   @Override
   public boolean equals(Object o) {
 if (this == o) {
diff --git 
a/hudi-io/src/test/java/org/apache/hudi/io/storage/TestStoragePathInfo.java 
b/hudi-io/src/test/java/org/apache/hudi/io/storage/TestStoragePathInfo.java
index 72640c5e3df..95cf4d798a4 100644
--- a/hudi-io/src/test/java/org/apache/hudi/io/storage/TestStoragePathInfo.java
+++ b/hudi-io/src/test/java/org/apache/hudi/io/storage/TestStoragePathInfo.java
@@ -71,6 +71,19 @@ public class TestStoragePathInfo {
 }
   }
 
+  @Test
+  public void testCompareTo() {
+StoragePathInfo pathInfo1 = new StoragePathInfo(
+new StoragePath(PATH1), LENGTH, false, BLOCK_REPLICATION, BLOCK_SIZE, 
MODIFICATION_TIME);
+StoragePathInfo pathInfo2 = new StoragePathInfo(
+new StoragePath(PATH1), LENGTH + 2, false, BLOCK_REPLICATION, 
BLOCK_SIZE, MODIFICATION_TIME + 2L);
+StoragePathInfo pathInfo3 = new StoragePathInfo(
+new StoragePath(PATH2), LENGTH, false, BLOCK_REPLICATION, BLOCK_SIZE, 
MODIFICATION_TIME);
+
+assertEquals(0, pathInfo1.compareTo(pathInfo2));
+assertEquals(-1, pathInfo1.compareTo(pathInfo3));
+  }
+
   @Test
   public void testEquals() {
 StoragePathInfo pathInfo1 = new StoragePathInfo(



(hudi) 08/28: [HUDI-7605] Allow merger strategy to be set in spark sql writer (#10999)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit ca6c4f96dcc7442ea61b84cb7d852998ef1a11b4
Author: Jon Vexler 
AuthorDate: Thu Apr 11 21:20:07 2024 -0400

[HUDI-7605] Allow merger strategy to be set in spark sql writer (#10999)
---
 .../scala/org/apache/hudi/HoodieSparkSqlWriter.scala |  1 +
 .../apache/hudi/functional/TestMORDataSource.scala   | 20 
 2 files changed, 21 insertions(+)

diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
index 7020781faf0..ad19ec48c7a 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
@@ -302,6 +302,7 @@ class HoodieSparkSqlWriterInternal {
   .setPartitionMetafileUseBaseFormat(useBaseFormatMetaFile)
   
.setShouldDropPartitionColumns(hoodieConfig.getBooleanOrDefault(HoodieTableConfig.DROP_PARTITION_COLUMNS))
   .setCommitTimezone(timelineTimeZone)
+  
.setRecordMergerStrategy(hoodieConfig.getStringOrDefault(DataSourceWriteOptions.RECORD_MERGER_STRATEGY))
   .initTable(sparkContext.hadoopConfiguration, path)
   }
   val instantTime = HoodieActiveTimeline.createNewInstantTime()
diff --git 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala
 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala
index 45bd3c645d4..b878eb76c40 100644
--- 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala
+++ 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala
@@ -1403,4 +1403,24 @@ class TestMORDataSource extends 
HoodieSparkClientTestBase with SparkDatasetMixin
   basePath
 }
   }
+
+  @Test
+  def testMergerStrategySet(): Unit = {
+val (writeOpts, _) = getWriterReaderOpts()
+val input = recordsToStrings(dataGen.generateInserts("000", 1)).asScala
+val inputDf= spark.read.json(spark.sparkContext.parallelize(input, 1))
+val mergerStrategyName = "example_merger_strategy"
+inputDf.write.format("hudi")
+  .options(writeOpts)
+  .option(DataSourceWriteOptions.TABLE_TYPE.key, "MERGE_ON_READ")
+  .option(DataSourceWriteOptions.OPERATION.key, 
DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
+  .option(DataSourceWriteOptions.RECORD_MERGER_STRATEGY.key(), 
mergerStrategyName)
+  .mode(SaveMode.Overwrite)
+  .save(basePath)
+metaClient = HoodieTableMetaClient.builder()
+  .setBasePath(basePath)
+  .setConf(spark.sessionState.newHadoopConf)
+  .build()
+assertEquals(metaClient.getTableConfig.getRecordMergerStrategy, 
mergerStrategyName)
+  }
 }



(hudi) 19/28: [MINOR] Rename location to path in `makeQualified` (#11037)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 6b0c67a5c9f129be285894da0d1bd26926b9571e
Author: Y Ethan Guo 
AuthorDate: Tue Apr 16 18:30:11 2024 -0700

[MINOR] Rename location to path in `makeQualified` (#11037)
---
 .../src/main/java/org/apache/hudi/common/fs/FSUtils.java | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java 
b/hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java
index 68cc5c131db..292c2b41946 100644
--- a/hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java
+++ b/hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java
@@ -123,14 +123,14 @@ public class FSUtils {
   }
 
   /**
-   * Makes location qualified with {@link HoodieStorage}'s URI.
+   * Makes path qualified with {@link HoodieStorage}'s URI.
*
-   * @param storage  instance of {@link HoodieStorage}.
-   * @param location to be qualified.
-   * @return qualified location, prefixed with the URI of the target 
HoodieStorage object provided.
+   * @param storage instance of {@link HoodieStorage}.
+   * @param pathto be qualified.
+   * @return qualified path, prefixed with the URI of the target HoodieStorage 
object provided.
*/
-  public static StoragePath makeQualified(HoodieStorage storage, StoragePath 
location) {
-return location.makeQualified(storage.getUri());
+  public static StoragePath makeQualified(HoodieStorage storage, StoragePath 
path) {
+return path.makeQualified(storage.getUri());
   }
 
   /**



(hudi) 18/28: [MINOR] Remove redundant lines in StreamSync and TestStreamSyncUnitTests (#11027)

2024-05-14 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch branch-0.x
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 862575a546b0e02d2be312e64fb12a07f08aba09
Author: Y Ethan Guo 
AuthorDate: Mon Apr 15 21:41:41 2024 -0700

[MINOR] Remove redundant lines in StreamSync and TestStreamSyncUnitTests 
(#11027)
---
 .../apache/hudi/utilities/streamer/StreamSync.java   |  4 
 .../utilities/streamer/TestStreamSyncUnitTests.java  | 20 
 2 files changed, 24 deletions(-)

diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java
index 2b0d94da74a..7e0b97ef570 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java
@@ -278,7 +278,6 @@ public class StreamSync implements Serializable, Closeable {
 this.formatAdapter = formatAdapter;
 this.transformer = transformer;
 this.useRowWriter = useRowWriter;
-
   }
 
   @Deprecated
@@ -500,7 +499,6 @@ public class StreamSync implements Serializable, Closeable {
* @return Pair Input data read from upstream 
source, and boolean is true if empty.
* @throws Exception in case of any Exception
*/
-
   public InputBatch readFromSource(String instantTime, HoodieTableMetaClient 
metaClient) throws IOException {
 // Retrieve the previous round checkpoints, if any
 Option resumeCheckpointStr = Option.empty();
@@ -563,7 +561,6 @@ public class StreamSync implements Serializable, Closeable {
 // handle empty batch with change in checkpoint
 hoodieSparkContext.setJobStatus(this.getClass().getSimpleName(), "Checking 
if input is empty: " + cfg.targetTableName);
 
-
 if (useRowWriter) { // no additional processing required for row writer.
   return inputBatch;
 } else {
@@ -1297,5 +1294,4 @@ public class StreamSync implements Serializable, 
Closeable {
   return writeStatusRDD;
 }
   }
-
 }
diff --git 
a/hudi-utilities/src/test/java/org/apache/hudi/utilities/streamer/TestStreamSyncUnitTests.java
 
b/hudi-utilities/src/test/java/org/apache/hudi/utilities/streamer/TestStreamSyncUnitTests.java
index 99148eb4b07..c0169ae64b8 100644
--- 
a/hudi-utilities/src/test/java/org/apache/hudi/utilities/streamer/TestStreamSyncUnitTests.java
+++ 
b/hudi-utilities/src/test/java/org/apache/hudi/utilities/streamer/TestStreamSyncUnitTests.java
@@ -17,25 +17,6 @@
  * under the License.
  */
 
-/*
- * Licensed to the Apache Software Foundation (ASF) under one
- * or more contributor license agreements.  See the NOTICE file
- * distributed with this work for additional information
- * regarding copyright ownership.  The ASF licenses this file
- * to you under the Apache License, Version 2.0 (the
- * "License"); you may not use this file except in compliance
- * with the License.  You may obtain a copy of the License at
- *
- *   http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing,
- * software distributed under the License is distributed on an
- * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- * KIND, either express or implied.  See the License for the
- * specific language governing permissions and limitations
- * under the License.
- */
-
 package org.apache.hudi.utilities.streamer;
 
 import org.apache.hudi.DataSourceWriteOptions;
@@ -75,7 +56,6 @@ import static org.mockito.Mockito.verify;
 import static org.mockito.Mockito.when;
 
 public class TestStreamSyncUnitTests {
-
   @ParameterizedTest
   @MethodSource("testCasesFetchNextBatchFromSource")
   void testFetchNextBatchFromSource(Boolean useRowWriter, Boolean 
hasTransformer, Boolean hasSchemaProvider,



  1   2   3   4   5   >