[GitHub] [hudi] hudi-bot commented on pull request #9362: [HUDI-6644] Flink append mode use auto key generator

2023-08-03 Thread via GitHub


hudi-bot commented on PR #9362:
URL: https://github.com/apache/hudi/pull/9362#issuecomment-1665035180

   
   ## CI report:
   
   * be8d22a885fedffd0baa991470d0e04862b3c380 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19060)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9359: [HUDI-6639] Rename hoodie.sql.write.operation to hoodie.spark.sql.insert.into.operation

2023-08-03 Thread via GitHub


hudi-bot commented on PR #9359:
URL: https://github.com/apache/hudi/pull/9359#issuecomment-1665035137

   
   ## CI report:
   
   * 7308946c4344ca04736b2c83b505d3a159146541 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19050)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9347: [HUDI-6638] Upgrade AWS Java SDK to V2

2023-08-03 Thread via GitHub


hudi-bot commented on PR #9347:
URL: https://github.com/apache/hudi/pull/9347#issuecomment-1665035080

   
   ## CI report:
   
   * 5c1391571fbed3ce391399b7848f33b629455941 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19041)
 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19057)
 
   * c24825896bd7893fe303f240b523bd20ae1e7567 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19059)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9199: [HUDI-6534]Support consistent hashing row writer

2023-08-03 Thread via GitHub


hudi-bot commented on PR #9199:
URL: https://github.com/apache/hudi/pull/9199#issuecomment-1665034674

   
   ## CI report:
   
   * 884a71af797b71a7f5818472884e45f39f758328 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19049)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9362: [HUDI-6644] Flink append mode use auto key generator

2023-08-03 Thread via GitHub


hudi-bot commented on PR #9362:
URL: https://github.com/apache/hudi/pull/9362#issuecomment-1664960288

   
   ## CI report:
   
   * be8d22a885fedffd0baa991470d0e04862b3c380 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9360: [MINOR] Upgrade thrift's version to 0.13.0

2023-08-03 Thread via GitHub


hudi-bot commented on PR #9360:
URL: https://github.com/apache/hudi/pull/9360#issuecomment-1664960255

   
   ## CI report:
   
   * 1f04f2b8ec4d73b4cb96229dd5381650152ee1dd Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19056)
 
   * c275941c84095c11d2290d3d9f24f33c91938844 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19058)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9347: [HUDI-6638] Upgrade AWS Java SDK to V2

2023-08-03 Thread via GitHub


hudi-bot commented on PR #9347:
URL: https://github.com/apache/hudi/pull/9347#issuecomment-1664960199

   
   ## CI report:
   
   * 5c1391571fbed3ce391399b7848f33b629455941 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19041)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19057)
 
   * c24825896bd7893fe303f240b523bd20ae1e7567 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9360: [MINOR] Upgrade thrift's version to 0.13.0

2023-08-03 Thread via GitHub


hudi-bot commented on PR #9360:
URL: https://github.com/apache/hudi/pull/9360#issuecomment-1664955794

   
   ## CI report:
   
   * 1f04f2b8ec4d73b4cb96229dd5381650152ee1dd Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19056)
 
   * c275941c84095c11d2290d3d9f24f33c91938844 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9347: [HUDI-6638] Upgrade AWS Java SDK to V2

2023-08-03 Thread via GitHub


hudi-bot commented on PR #9347:
URL: https://github.com/apache/hudi/pull/9347#issuecomment-1664955745

   
   ## CI report:
   
   * 5c1391571fbed3ce391399b7848f33b629455941 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19041)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19057)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6644) Flink append mode use auto key generator

2023-08-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6644:
-
Labels: pull-request-available  (was: )

> Flink append mode use auto key generator
> 
>
> Key: HUDI-6644
> URL: https://issues.apache.org/jira/browse/HUDI-6644
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: HBG
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hbgstc123 opened a new pull request, #9362: [HUDI-6644] Flink append mode use auto key generator

2023-08-03 Thread via GitHub


hbgstc123 opened a new pull request, #9362:
URL: https://github.com/apache/hudi/pull/9362

   ### Change Logs
   
   Support use auto key generator in flink append mode when user don't provide 
primary key to align with spark.
   
   Add a new class AutoRowDataKeyGen to used in AppendWriteFunction to support 
pk-less case.
   
   ### Impact
   
   none
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   none
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] kepplertreet opened a new issue, #9361: [SUPPORT] Hudi Merge On Read Tables don't write Delta Log Files

2023-08-03 Thread via GitHub


kepplertreet opened a new issue, #9361:
URL: https://github.com/apache/hudi/issues/9361

   Hi. 
   
   I'm using a Spark Structured Streaming Application running on EMR-6.11.0 to 
Write into a Hudi MOR Table. 
   
   Hudi Version : 0.13.0
   Spark Version : 3.3.2
   ``` 
   'hoodie.table.name': ,
   'hoodie.datasource.write.recordkey.field':  ,
   'hoodie.datasource.write.keygenerator.class': 
'org.apache.hudi.keygen.SimpleKeyGenerator',
   'hoodie.datasource.write.table.type': "MERGE_ON_READ",
   'hoodie.datasource.write.partitionpath.field': ,
   'hoodie.datasource.write.table.name': ,
   'hoodie.datasource.write.precombine.field': ,
   "hoodie.table.version": 5,
   "hoodie.datasource.write.commitmeta.key.prefix": "_",
   "hoodie.datasource.write.hive_style_partitioning": 'true',
   "hoodie.datasource.meta.sync.enable": 'false',
   "hoodie.datasource.hive_sync.enable": 'true',
   "hoodie.datasource.hive_sync.auto_create_database": 'true',
   "hoodie.datasource.hive_sync.skip_ro_suffix": 'true',
   "hoodie.datasource.hive_sync.partition_extractor_class": 
"org.apache.hudi.hive.MultiPartKeysValueExtractor",
   "hoodie.parquet.small.file.limit": 125217728,
   "hoodie.parquet.max.file.size": 134217728,
   
   # Compaction Configs 
   "hoodie.compact.inline" : "false", 
   "hoodie.compact.schedule.inline" : "false", 
   "hoodie.datasource.compaction.async.enable": "true",
   "hoodie.compact.inline.trigger.strategy": "NUM_COMMITS",
   "hoodie.compact.inline.max.delta.commits": 3,
   
   # --- Cleaner Configs  
   "hoodie.clean.automatic": 'true',
   "hoodie.clean.async": 'true',
   "hoodie.cleaner.policy.failed.writes": "LAZY",
   "hoodie.clean.trigger.strategy" : "NUM_COMMITS", 
   "hoodie.clean.max.commits" : 7, 
   "hoodie.cleaner.commits.retained" : 3, 
   "hoodie.cleaner.fileversions.retained": 1, 
   "hoodie.cleaner.hours.retained": 1, 
   "hoodie.cleaner.policy": "KEEP_LATEST_COMMITS",
   
   "hoodie.parquet.compression.codec": "snappy",
   "hoodie.embed.timeline.server": 'true',
   "hoodie.embed.timeline.server.async": 'false',
   "hoodie.write.concurrency.mode": "OPTIMISTIC_CONCURRENCY_CONTROL",
   "hoodie.write.lock.provider": 
"org.apache.hudi.client.transaction.lock.FileSystemBasedLockProvider",
   "hoodie.index.type": "BLOOM",
   "hoodie.datasource.write.streaming.checkpoint.identifier" : 
,
   
   # Metadata Configs 
   "hoodie.metadata.enable": 'true',
   "hoodie.bloom.index.use.metadata": 'true',
   "hoodie.metadata.index.async": 'false',
   "hoodie.metadata.clean.async": 'true',
   "hoodie.metadata.index.bloom.filter.enable": 'true',
   "hoodie.metadata.index.column.stats.enable" : 'true', 
   "hoodie.metadata.index.bloom.filter.column.list": , 
   "hoodie.metadata.index.column.stats.column.list" : ,
   "hoodie.metadata.metrics.enable": 'true', 
   
   "hoodie.keep.max.commits": 50,
   "hoodie.archive.async": 'true',
   "hoodie.archive.merge.enable": 'false',
   "hoodie.archive.beyond.savepoint": 'true',
   "hoodie.cleaner.policy": "KEEP_LATEST_BY_HOURS",
   "hoodie.cleaner.hours.retained": 1
   ```
   Issues Faced : As the configs show,  we have OCC and Metadata enabled for 
the table. 
   My only concern for now is that I never see log files being written into the 
main table and hence a compaction is never scheduled nor triggered for the main 
table i.e all incoming data is written directly into parquet files, whereas the 
metadata timeline show scheduling and execution of compaction and hence a 
**commit** is reflected into the timeline. 
   
   Is this a normal expected behaviour? Is hudi internally calculating the cost 
of carrying out a trade off between the cost of writing Log Files and then 
executing a compaction on them v/s directly writing the data to a parquet, and 
chooses to perform whichever turns out less expensive. Is their some defined 
threshold for ingress batches crossing which only makes Hudi Write Data into 
Log Files.
   
   
   Thanks  
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] mansipp commented on pull request #9347: [HUDI-6638] Upgrade AWS Java SDK to V2

2023-08-03 Thread via GitHub


mansipp commented on PR #9347:
URL: https://github.com/apache/hudi/pull/9347#issuecomment-1664944883

   @hudi-bot run azure
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9360: [MINOR] Upgrade thrift's version to 0.13.0

2023-08-03 Thread via GitHub


hudi-bot commented on PR #9360:
URL: https://github.com/apache/hudi/pull/9360#issuecomment-1664931647

   
   ## CI report:
   
   * 1f04f2b8ec4d73b4cb96229dd5381650152ee1dd Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19056)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9337: [HUDI-6628] Rely on methods in HoodieBaseFile and HoodieLogFile instead of FSUtils when possible

2023-08-03 Thread via GitHub


hudi-bot commented on PR #9337:
URL: https://github.com/apache/hudi/pull/9337#issuecomment-1664931559

   
   ## CI report:
   
   * 306b6c94e2f4793f91ae9b6ffa3f102c8bc2a18e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19037)
 
   * dbd715a0c9d98e2bbb69823025a6e314338bc17a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19055)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9324: [HUDI-6619] Fix hudi-integ-test-bundle dependency on jackson jsk310 package

2023-08-03 Thread via GitHub


hudi-bot commented on PR #9324:
URL: https://github.com/apache/hudi/pull/9324#issuecomment-1664931493

   
   ## CI report:
   
   * 98e49fad21b4c7b1151e96c7a72b18caf5014a7f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18933)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18949)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18965)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18983)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19014)
 
   * 5f4efeb0992b1fcce899787ddd1182850d51698d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19054)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] stream2000 commented on a diff in pull request #9199: [HUDI-6534]Support consistent hashing row writer

2023-08-03 Thread via GitHub


stream2000 commented on code in PR #9199:
URL: https://github.com/apache/hudi/pull/9199#discussion_r1283899831


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BucketBulkInsertDataInternalWriterHelper.java:
##
@@ -65,7 +71,6 @@ public void write(InternalRow row) throws IOException {
   int bucketId = BucketIdentifier.getBucketId(String.valueOf(recordKey), 
indexKeyFields, bucketNum);
   Pair fileId = Pair.of(partitionPath, bucketId);
   if (lastFileId == null || !lastFileId.equals(fileId)) {
-LOG.info("Creating new file for partition path " + partitionPath);

Review Comment:
   We will not always create a new file handle if we have already created one 
for this partition so I delete this line. I will add it back to 
`getBucketRowCreateHandle` when we do need to create a new handle. 



##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/SparkConsistentBucketClusteringExecutionStrategy.java:
##
@@ -79,15 +94,19 @@ public HoodieData 
performClusteringWithRecordsRDD(HoodieData partitioner = new 
RDDConsistentBucketBulkInsertPartitioner<>(getHoodieTable(), strategyParams, 
preserveHoodieMetadata);
+addHashingChildNodes(partitioner, extraMetadata);
+
+return (HoodieData) SparkBulkInsertHelper.newInstance()
+.bulkInsert(inputRecords, instantTime, getHoodieTable(), newConfig, 
false, partitioner, true, numOutputGroups);
+  }
+
+  private void addHashingChildNodes(ConsistentHashingBucketInsertPartitioner 
partitioner, Map extraMetadata) {
 try {
   List nodes = 
ConsistentHashingNode.fromJsonString(extraMetadata.get(BaseConsistentHashingBucketClusteringPlanStrategy.METADATA_CHILD_NODE_KEY));

Review Comment:
   Will check the nodes is non-emtpy in `addHashingChildrenNodes`



##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/ConsistentBucketIndexBulkInsertPartitionerWithRows.java:
##
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.execution.bulkinsert;
+
+import org.apache.hudi.common.model.ConsistentHashingNode;
+import org.apache.hudi.common.model.HoodieConsistentHashingMetadata;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ValidationUtils;
+import org.apache.hudi.index.bucket.ConsistentBucketIdentifier;
+import org.apache.hudi.index.bucket.ConsistentBucketIndexUtils;
+import org.apache.hudi.index.bucket.HoodieSparkConsistentBucketIndex;
+import org.apache.hudi.keygen.BuiltinKeyGenerator;
+import org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory;
+import org.apache.hudi.table.BulkInsertPartitioner;
+import org.apache.hudi.table.ConsistentHashingBucketInsertPartitioner;
+import org.apache.hudi.table.HoodieTable;
+
+import org.apache.spark.Partitioner;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+import scala.Tuple2;
+
+/**
+ * Bulk_insert partitioner of Spark row using consistent hashing bucket index.
+ */
+public class ConsistentBucketIndexBulkInsertPartitionerWithRows
+implements BulkInsertPartitioner>, 
ConsistentHashingBucketInsertPartitioner {
+
+  private final HoodieTable table;
+
+  private final String indexKeyFields;
+
+  private final List fileIdPfxList = new ArrayList<>();
+  private final Map> hashingChildrenNodes;
+
+  private Map partitionToIdentifier;
+
+  private final Option keyGeneratorOpt;
+
+  private Map> partitionToFileIdPfxIdxMap;
+
+  private final RowRecordKeyExtractor extractor;
+
+  public ConsistentBucketIndexBulkInsertPartitionerWithRows(HoodieTable table, 
boolean populateMetaFields) {
+this.indexKeyFields = table.getConfig().getBucketIndexHashField();
+this.table = table;
+this.hashingChildrenNodes = new HashMap<>();
+if (!populateMetaFields) {
+  this.keyGeneratorOpt = 

[GitHub] [hudi] hudi-bot commented on pull request #9360: [MINOR] Upgrade thrift's version to 0.13.0

2023-08-03 Thread via GitHub


hudi-bot commented on PR #9360:
URL: https://github.com/apache/hudi/pull/9360#issuecomment-1664927344

   
   ## CI report:
   
   * 1f04f2b8ec4d73b4cb96229dd5381650152ee1dd UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9324: [HUDI-6619] Fix hudi-integ-test-bundle dependency on jackson jsk310 package

2023-08-03 Thread via GitHub


hudi-bot commented on PR #9324:
URL: https://github.com/apache/hudi/pull/9324#issuecomment-1664927190

   
   ## CI report:
   
   * 98e49fad21b4c7b1151e96c7a72b18caf5014a7f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18933)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18949)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18965)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18983)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19014)
 
   * 5f4efeb0992b1fcce899787ddd1182850d51698d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9349: [MINOR] JSR dependency not used in spark3.3 version

2023-08-03 Thread via GitHub


hudi-bot commented on PR #9349:
URL: https://github.com/apache/hudi/pull/9349#issuecomment-1664927305

   
   ## CI report:
   
   * 7c3142bdb0e1b1c677e61495e42c81e44916e1a0 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19021)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9337: [HUDI-6628] Rely on methods in HoodieBaseFile and HoodieLogFile instead of FSUtils when possible

2023-08-03 Thread via GitHub


hudi-bot commented on PR #9337:
URL: https://github.com/apache/hudi/pull/9337#issuecomment-1664927277

   
   ## CI report:
   
   * 306b6c94e2f4793f91ae9b6ffa3f102c8bc2a18e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19037)
 
   * dbd715a0c9d98e2bbb69823025a6e314338bc17a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xushiyan commented on a diff in pull request #9278: [HUDI-6312] Rename enum values of `HollowCommitHandling`

2023-08-03 Thread via GitHub


xushiyan commented on code in PR #9278:
URL: https://github.com/apache/hudi/pull/9278#discussion_r1283941770


##
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieCommonConfig.java:
##
@@ -78,18 +78,18 @@ public class HoodieCommonConfig extends HoodieConfig {
 
   public static final ConfigProperty 
INCREMENTAL_READ_HANDLE_HOLLOW_COMMIT = ConfigProperty
   .key("hoodie.datasource.read.handle.hollow.commit")
-  .defaultValue(HollowCommitHandling.EXCEPTION.name())
+  .defaultValue(HollowCommitHandling.FAIL.name())

Review Comment:
   @nsivabalan is there a consensus on the name? i'll update it quickly then we 
can merge this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9261: [HUDI-6579] Adding support for upsert and deletes with spark datasource for pk less table

2023-08-03 Thread via GitHub


hudi-bot commented on PR #9261:
URL: https://github.com/apache/hudi/pull/9261#issuecomment-1664922934

   
   ## CI report:
   
   * 7e7efc78003b0e8ef5c2d809276796b7b987a35b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19047)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xushiyan commented on a diff in pull request #9278: [HUDI-6312] Rename enum values of `HollowCommitHandling`

2023-08-03 Thread via GitHub


xushiyan commented on code in PR #9278:
URL: https://github.com/apache/hudi/pull/9278#discussion_r1283941125


##
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieCommonConfig.java:
##
@@ -78,18 +78,18 @@ public class HoodieCommonConfig extends HoodieConfig {
 
   public static final ConfigProperty 
INCREMENTAL_READ_HANDLE_HOLLOW_COMMIT = ConfigProperty
   .key("hoodie.datasource.read.handle.hollow.commit")
-  .defaultValue(HollowCommitHandling.EXCEPTION.name())
+  .defaultValue(HollowCommitHandling.FAIL.name())
   .sinceVersion("0.14.0")
   .markAdvanced()
   .withValidValues(enumNames(HollowCommitHandling.class))
   .withDocumentation("When doing incremental queries, there could be 
hollow commits (requested or inflight commits that are not the latest)"

Review Comment:
   time travel query related change is in a follow up PR. this one is meant 
only for renaming



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-6644) Flink append mode use auto key generator

2023-08-03 Thread HBG (Jira)
HBG created HUDI-6644:
-

 Summary: Flink append mode use auto key generator
 Key: HUDI-6644
 URL: https://issues.apache.org/jira/browse/HUDI-6644
 Project: Apache Hudi
  Issue Type: Task
Reporter: HBG






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] xuzifu666 commented on pull request #9349: [MINOR] JSR dependency not used in spark3.3 version

2023-08-03 Thread via GitHub


xuzifu666 commented on PR #9349:
URL: https://github.com/apache/hudi/pull/9349#issuecomment-1664917538

   > @xuzifu666 why close this?
   
   sorry,want to run ci @xushiyan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xushiyan commented on pull request #9349: [MINOR] JSR dependency not used in spark3.3 version

2023-08-03 Thread via GitHub


xushiyan commented on PR #9349:
URL: https://github.com/apache/hudi/pull/9349#issuecomment-1664916791

   @xuzifu666 why close this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] wecharyu commented on a diff in pull request #9343: Revert "[HUDI-6476] Improve the performance of getAllPartitionPaths (#9121)"

2023-08-03 Thread via GitHub


wecharyu commented on code in PR #9343:
URL: https://github.com/apache/hudi/pull/9343#discussion_r1283930796


##
hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java:
##
@@ -168,57 +167,66 @@ private List 
getPartitionPathWithPathPrefixUsingFilterExpression(String
   // TODO: Get the parallelism from HoodieWriteConfig
   int listingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, 
pathsToList.size());
 
-  // List all directories in parallel:
-  // if current dictionary contains PartitionMetadata, add it to result
-  // if current dictionary does not contain PartitionMetadata, add its 
subdirectory to queue to be processed.
+  // List all directories in parallel
   engineContext.setJobStatus(this.getClass().getSimpleName(), "Listing all 
partitions with prefix " + relativePathPrefix);
-  // result below holds a list of pair. first entry in the pair optionally 
holds the deduced list of partitions.
-  // and second entry holds optionally a directory path to be processed 
further.
-  List, Option>> result = 
engineContext.flatMap(pathsToList, path -> {
+  List dirToFileListing = engineContext.flatMap(pathsToList, 
path -> {
 FileSystem fileSystem = path.getFileSystem(hadoopConf.get());
-if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, path)) {
-  return 
Stream.of(Pair.of(Option.of(FSUtils.getRelativePartitionPath(dataBasePath.get(),
 path)), Option.empty()));
-}
-return Arrays.stream(fileSystem.listStatus(path, p -> {
-  try {
-return fileSystem.isDirectory(p) && 
!p.getName().equals(HoodieTableMetaClient.METAFOLDER_NAME);
-  } catch (IOException e) {
-// noop
-  }
-  return false;
-})).map(status -> Pair.of(Option.empty(), 
Option.of(status.getPath(;
+return Arrays.stream(fileSystem.listStatus(path));
   }, listingParallelism);
   pathsToList.clear();
 
-  partitionPaths.addAll(result.stream().filter(entry -> 
entry.getKey().isPresent())
-  .map(entry -> entry.getKey().get())
-  .filter(relativePartitionPath -> fullBoundExpr instanceof 
Predicates.TrueExpression
-  || (Boolean) fullBoundExpr.eval(
-  extractPartitionValues(partitionFields, 
relativePartitionPath, urlEncodePartitioningEnabled)))
-  .collect(Collectors.toList()));
-
-  Expression partialBoundExpr;
-  // If partitionPaths is nonEmpty, we're already at the last path level, 
and all paths
-  // are filtered already.
-  if (needPushDownExpressions && partitionPaths.isEmpty()) {
-// Here we assume the path level matches the number of partition 
columns, so we'll rebuild
-// new schema based on current path level.
-// e.g. partition columns are , if we're listing the 
second level, then
-// currentSchema would be 
-// `PartialBindVisitor` will bind reference if it can be found from 
`currentSchema`, otherwise
-// will change the expression to `alwaysTrue`. Can see 
`PartialBindVisitor` for details.
-Types.RecordType currentSchema = 
Types.RecordType.get(partitionFields.fields().subList(0, 
++currentPartitionLevel));
-PartialBindVisitor partialBindVisitor = new 
PartialBindVisitor(currentSchema, caseSensitive);
-partialBoundExpr = pushedExpr.accept(partialBindVisitor);
-  } else {
-partialBoundExpr = Predicates.alwaysTrue();
-  }
+  // if current dictionary contains PartitionMetadata, add it to result
+  // if current dictionary does not contain PartitionMetadata, add it to 
queue to be processed.
+  int fileListingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, 
dirToFileListing.size());
+  if (!dirToFileListing.isEmpty()) {
+// result below holds a list of pair. first entry in the pair 
optionally holds the deduced list of partitions.
+// and second entry holds optionally a directory path to be processed 
further.
+engineContext.setJobStatus(this.getClass().getSimpleName(), 
"Processing listed partitions");
+List, Option>> result = 
engineContext.map(dirToFileListing, fileStatus -> {
+  FileSystem fileSystem = 
fileStatus.getPath().getFileSystem(hadoopConf.get());
+  if (fileStatus.isDirectory()) {
+if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, 
fileStatus.getPath())) {

Review Comment:
   Got it. Let me investigate why the latency happened.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] eric9204 opened a new pull request, #9360: [MINOR] Upgrade thrift's version to 0.13.0

2023-08-03 Thread via GitHub


eric9204 opened a new pull request, #9360:
URL: https://github.com/apache/hudi/pull/9360

   ### Change Logs
   
   Upgrade thrift's version from 0.12.0 to 0.13.0.
   
   There is an issue in thrift-0.12.0:
   https://issues.apache.org/jira/browse/THRIFT-4805 
   
   Will cause the following problems:
   
   
![error](https://github.com/apache/hudi/assets/90449228/3e6fa891-4bc5-4385-b5a7-d4fed4a26131)
   
   This error message has anything to do with the program,
   but if there is a program connected to hoodie metaserver, this error log 
message will be continuously output.
   
   ### Impact
   
   None
   
   ### Risk level (write none, low medium or high below)
   
   None
   
   ### Documentation Update
   
   None
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9357: [HUDI-6588] Fix duplicate fileId on TM failover and recovery

2023-08-03 Thread via GitHub


hudi-bot commented on PR #9357:
URL: https://github.com/apache/hudi/pull/9357#issuecomment-1664898540

   
   ## CI report:
   
   * d8e159b823f516f584802bd3dacdaa782f185854 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19038)
 
   * c5e0ade2fbd9a9f5d0d7336c138de99b697b07cb Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19053)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9357: [HUDI-6588] Fix duplicate fileId on TM failover and recovery

2023-08-03 Thread via GitHub


hudi-bot commented on PR #9357:
URL: https://github.com/apache/hudi/pull/9357#issuecomment-1664893935

   
   ## CI report:
   
   * d8e159b823f516f584802bd3dacdaa782f185854 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19038)
 
   * c5e0ade2fbd9a9f5d0d7336c138de99b697b07cb UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9330: [HUDI-6622] Reuse the table config from HoodieTableMetaClient in the …

2023-08-03 Thread via GitHub


hudi-bot commented on PR #9330:
URL: https://github.com/apache/hudi/pull/9330#issuecomment-1664893863

   
   ## CI report:
   
   * 38aec912160b7531914cd4c07ea8317606f34616 UNKNOWN
   * d6d32a693c455830a31b883915e9940fa309c77f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19026)
 
   * df00f708b0143c2a3ce5dec86ebcd69c26b91a1b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19051)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #9308: [HUDI-6606] Use record level index with SQL equality queries

2023-08-03 Thread via GitHub


nsivabalan commented on code in PR #9308:
URL: https://github.com/apache/hudi/pull/9308#discussion_r1283913819


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestRecordLevelIndexWithSQL.scala:
##
@@ -0,0 +1,56 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.functional
+
+import org.apache.hudi.{DataSourceReadOptions, DataSourceWriteOptions}
+import org.apache.spark.sql.SaveMode
+import org.junit.jupiter.api.Tag
+import org.junit.jupiter.params.ParameterizedTest
+import org.junit.jupiter.params.provider.ValueSource
+
+@Tag("functional")
+class TestRecordLevelIndexWithSQL extends RecordLevelIndexTestBase {
+  val sqlTempTable = "tbl"
+
+  @ParameterizedTest
+  @ValueSource(strings = Array("COPY_ON_WRITE"))
+  def testRLIWithSQL(tableType: String): Unit = {
+var hudiOpts = commonOpts
+hudiOpts = hudiOpts + (
+  DataSourceWriteOptions.TABLE_TYPE.key -> tableType,
+  DataSourceReadOptions.ENABLE_DATA_SKIPPING.key -> "true")
+
+doWriteAndValidateDataAndRecordIndex(hudiOpts,
+  operation = DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL,
+  saveMode = SaveMode.Overwrite,
+  validate = false)
+doWriteAndValidateDataAndRecordIndex(hudiOpts,
+  operation = DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL,
+  saveMode = SaveMode.Append,
+  validate = false)
+
+createTempTable(hudiOpts)
+val reckey = 
mergedDfList.last.limit(1).collect()(0).getAs("_row_key").toString
+spark.sql("select * from " + sqlTempTable + " where '" + reckey + "' = 
_row_key").show(false)

Review Comment:
   this test does not catch anything. 
   canyou write tests directly against RLI support. 
   lets also think about how we can write an end to end test and validate that 
RLI is invoked. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9330: [HUDI-6622] Reuse the table config from HoodieTableMetaClient in the …

2023-08-03 Thread via GitHub


hudi-bot commented on PR #9330:
URL: https://github.com/apache/hudi/pull/9330#issuecomment-1664888349

   
   ## CI report:
   
   * 38aec912160b7531914cd4c07ea8317606f34616 UNKNOWN
   * d6d32a693c455830a31b883915e9940fa309c77f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19026)
 
   * df00f708b0143c2a3ce5dec86ebcd69c26b91a1b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #9308: [HUDI-6606] Use record level index with SQL equality queries

2023-08-03 Thread via GitHub


nsivabalan commented on code in PR #9308:
URL: https://github.com/apache/hudi/pull/9308#discussion_r1283907863


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##
@@ -223,10 +225,13 @@ case class HoodieFileIndex(spark: SparkSession,
 //  nothing CSI in particular could be applied for)
 lazy val queryReferencedColumns = collectReferencedColumns(spark, 
queryFilters, schema)
 
-if (!isMetadataTableEnabled || !isDataSkippingEnabled || 
!columnStatsIndex.isIndexAvailable) {
+if (!isMetadataTableEnabled || !isDataSkippingEnabled) {
   validateConfig()
   Option.empty
-} else if (queryFilters.isEmpty || queryReferencedColumns.isEmpty) {
+} else if (recordLevelIndex.isIndexApplicable(queryFilters)) {
+  Option.apply(recordLevelIndex.getCandidateFiles(allFiles, queryFilters))
+} else if (!columnStatsIndex.isIndexAvailable || queryFilters.isEmpty || 
queryReferencedColumns.isEmpty) {
+  validateConfig()

Review Comment:
   lets atleast fix the filtering such that just for record key we look up in 
RLI, for, but if there are other columns, we do route to col stats. and combine 
both together. 
   currently, this seems very limited. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Assigned] (HUDI-6640) Non-blocking concurrency control

2023-08-03 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen reassigned HUDI-6640:


Assignee: Jing Zhang

> Non-blocking concurrency control
> 
>
> Key: HUDI-6640
> URL: https://issues.apache.org/jira/browse/HUDI-6640
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: core
>Reporter: Danny Chen
>Assignee: Jing Zhang
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6643) Make the compaction non-serial (plan schedule and execution)

2023-08-03 Thread Danny Chen (Jira)
Danny Chen created HUDI-6643:


 Summary: Make the compaction non-serial (plan schedule and 
execution)
 Key: HUDI-6643
 URL: https://issues.apache.org/jira/browse/HUDI-6643
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: core
Reporter: Danny Chen
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6480) Flink lockless multi-writer

2023-08-03 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-6480:
-
Parent: HUDI-6640
Issue Type: Sub-task  (was: New Feature)

> Flink lockless multi-writer
> ---
>
> Key: HUDI-6480
> URL: https://issues.apache.org/jira/browse/HUDI-6480
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: flink
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6642) Use completion time for file slicing

2023-08-03 Thread Danny Chen (Jira)
Danny Chen created HUDI-6642:


 Summary: Use completion time for file slicing
 Key: HUDI-6642
 URL: https://issues.apache.org/jira/browse/HUDI-6642
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: core
Reporter: Danny Chen
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6641) Remove the log append and always uses the current instant time in file name

2023-08-03 Thread Danny Chen (Jira)
Danny Chen created HUDI-6641:


 Summary: Remove the log append and always uses the current instant 
time in file name
 Key: HUDI-6641
 URL: https://issues.apache.org/jira/browse/HUDI-6641
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: core
Reporter: Danny Chen
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6640) Non-blocking concurrency control

2023-08-03 Thread Danny Chen (Jira)
Danny Chen created HUDI-6640:


 Summary: Non-blocking concurrency control
 Key: HUDI-6640
 URL: https://issues.apache.org/jira/browse/HUDI-6640
 Project: Apache Hudi
  Issue Type: Epic
  Components: core
Reporter: Danny Chen
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] voonhous commented on a diff in pull request #9357: [HUDI-6588] Fix duplicate fileId on TM failover and recovery

2023-08-03 Thread via GitHub


voonhous commented on code in PR #9357:
URL: https://github.com/apache/hudi/pull/9357#discussion_r1283900588


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteFunction.java:
##
@@ -448,7 +448,8 @@ private boolean flushBucket(DataBucket bucket) {
   }
 
   @SuppressWarnings("unchecked, rawtypes")
-  private void flushRemaining(boolean endInput) {
+  @VisibleForTesting
+  protected void flushRemaining(boolean endInput) {

Review Comment:
   Yeap, so we can override this function in `BucketStreamWriteTestFunction`.



##
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/bucket/BucketStreamWriteFunctionWithFailOverTest.java:
##
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.sink.bucket;
+
+import org.apache.hudi.client.common.HoodieFlinkEngineContext;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.configuration.FlinkOptions;
+import org.apache.hudi.configuration.OptionsInference;
+import org.apache.hudi.configuration.OptionsResolver;
+import org.apache.hudi.index.bucket.BucketIdentifier;
+import org.apache.hudi.sink.partitioner.BucketIndexPartitioner;
+import org.apache.hudi.sink.transform.Transformer;
+import org.apache.hudi.sink.utils.Pipelines;
+import org.apache.hudi.table.HoodieFlinkTable;
+import org.apache.hudi.util.AvroSchemaConverter;
+import org.apache.hudi.util.JsonDeserializationFunction;
+import org.apache.hudi.util.StreamerUtil;
+import org.apache.hudi.utils.FlinkMiniCluster;
+import org.apache.hudi.utils.TestConfigurations;
+import org.apache.hudi.utils.source.ContinuousFileSource;
+
+import org.apache.flink.api.common.JobStatus;
+import org.apache.flink.api.common.io.FilePathFilter;
+import org.apache.flink.api.common.restartstrategy.RestartStrategies;
+import org.apache.flink.api.common.typeinfo.BasicTypeInfo;
+import org.apache.flink.api.common.typeinfo.TypeInformation;
+import org.apache.flink.api.java.io.TextInputFormat;
+import org.apache.flink.configuration.Configuration;
+import org.apache.flink.core.execution.JobClient;
+import org.apache.flink.core.fs.Path;
+import org.apache.flink.streaming.api.CheckpointingMode;
+import org.apache.flink.streaming.api.datastream.DataStream;
+import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
+import org.apache.flink.streaming.api.functions.source.FileProcessingMode;
+import org.apache.flink.table.data.RowData;
+import org.apache.flink.table.types.logical.RowType;
+import org.apache.flink.util.TestLogger;
+import org.apache.hadoop.fs.FileStatus;
+import org.junit.jupiter.api.extension.ExtendWith;
+import org.junit.jupiter.api.io.TempDir;
+import org.junit.jupiter.params.ParameterizedTest;
+import org.junit.jupiter.params.provider.ValueSource;
+
+import java.io.File;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Objects;
+import java.util.concurrent.TimeUnit;
+import java.util.concurrent.atomic.AtomicBoolean;
+import java.util.stream.Collectors;
+
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+/**
+ * Integration test for Hudi on Flink with failover when performing writes 
using the bucket index.
+ */
+@ExtendWith(FlinkMiniCluster.class)
+public class BucketStreamWriteFunctionWithFailOverTest extends TestLogger {
+
+  @TempDir

Review Comment:
   Read through the test and i feel they are covering different scenarios. 
   
   The IT will fail on the first write instead of after the first write + will 
test bucket index instead of bucket index.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Reopened] (HUDI-2141) Integration flink metric in flink stream

2023-08-03 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen reopened HUDI-2141:
--

> Integration flink metric in flink stream
> 
>
> Key: HUDI-2141
> URL: https://issues.apache.org/jira/browse/HUDI-2141
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink, metrics
>Reporter: Zhaojing Yu
>Assignee: Zhaojing Yu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Now hoodie metrics can't work in flink stream because Designed for batch 
> processing,  integration flink metric in flink stream.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-2141) Integration flink metric in flink stream

2023-08-03 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-2141.

Resolution: Fixed

Fixed via master branch: ade9d0bcb9d7ad7adabfaeb5ff2f42bc0585fdb1

> Integration flink metric in flink stream
> 
>
> Key: HUDI-2141
> URL: https://issues.apache.org/jira/browse/HUDI-2141
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink, metrics
>Reporter: Zhaojing Yu
>Assignee: Zhaojing Yu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Now hoodie metrics can't work in flink stream because Designed for batch 
> processing,  integration flink metric in flink stream.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[hudi] branch master updated (bc583b41586 -> ade9d0bcb9d)

2023-08-03 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from bc583b41586 [HUDI-6609] Reverting multi writer checkpointing with 
HoodieStreamer (#9312)
 add ade9d0bcb9d [HUDI-2141] Support flink read metrics (#9350)

No new revisions were added by this update.

Summary of changes:
 hudi-flink-datasource/hudi-flink/pom.xml   |  6 ++
 .../hudi/metrics/FlinkStreamReadMetrics.java   | 96 ++
 .../apache/hudi/metrics/HoodieFlinkMetrics.java}   | 29 +++
 .../hudi/source/StreamReadMonitoringFunction.java  | 13 +++
 .../org/apache/hudi/source/StreamReadOperator.java | 14 
 packaging/hudi-flink-bundle/pom.xml|  6 ++
 6 files changed, 146 insertions(+), 18 deletions(-)
 create mode 100644 
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/metrics/FlinkStreamReadMetrics.java
 copy 
hudi-flink-datasource/hudi-flink/src/{test/java/org/apache/hudi/sink/utils/ScalaCollector.java
 => main/java/org/apache/hudi/metrics/HoodieFlinkMetrics.java} (62%)



[GitHub] [hudi] danny0405 merged pull request #9350: [HUDI-2141] Support flink read metrics

2023-08-03 Thread via GitHub


danny0405 merged PR #9350:
URL: https://github.com/apache/hudi/pull/9350


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9357: [HUDI-6588] Fix duplicate fileId on TM failover and recovery

2023-08-03 Thread via GitHub


danny0405 commented on code in PR #9357:
URL: https://github.com/apache/hudi/pull/9357#discussion_r1283890740


##
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/bucket/BucketStreamWriteFunctionWithFailOverTest.java:
##
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.sink.bucket;
+
+import org.apache.hudi.client.common.HoodieFlinkEngineContext;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.configuration.FlinkOptions;
+import org.apache.hudi.configuration.OptionsInference;
+import org.apache.hudi.configuration.OptionsResolver;
+import org.apache.hudi.index.bucket.BucketIdentifier;
+import org.apache.hudi.sink.partitioner.BucketIndexPartitioner;
+import org.apache.hudi.sink.transform.Transformer;
+import org.apache.hudi.sink.utils.Pipelines;
+import org.apache.hudi.table.HoodieFlinkTable;
+import org.apache.hudi.util.AvroSchemaConverter;
+import org.apache.hudi.util.JsonDeserializationFunction;
+import org.apache.hudi.util.StreamerUtil;
+import org.apache.hudi.utils.FlinkMiniCluster;
+import org.apache.hudi.utils.TestConfigurations;
+import org.apache.hudi.utils.source.ContinuousFileSource;
+
+import org.apache.flink.api.common.JobStatus;
+import org.apache.flink.api.common.io.FilePathFilter;
+import org.apache.flink.api.common.restartstrategy.RestartStrategies;
+import org.apache.flink.api.common.typeinfo.BasicTypeInfo;
+import org.apache.flink.api.common.typeinfo.TypeInformation;
+import org.apache.flink.api.java.io.TextInputFormat;
+import org.apache.flink.configuration.Configuration;
+import org.apache.flink.core.execution.JobClient;
+import org.apache.flink.core.fs.Path;
+import org.apache.flink.streaming.api.CheckpointingMode;
+import org.apache.flink.streaming.api.datastream.DataStream;
+import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
+import org.apache.flink.streaming.api.functions.source.FileProcessingMode;
+import org.apache.flink.table.data.RowData;
+import org.apache.flink.table.types.logical.RowType;
+import org.apache.flink.util.TestLogger;
+import org.apache.hadoop.fs.FileStatus;
+import org.junit.jupiter.api.extension.ExtendWith;
+import org.junit.jupiter.api.io.TempDir;
+import org.junit.jupiter.params.ParameterizedTest;
+import org.junit.jupiter.params.provider.ValueSource;
+
+import java.io.File;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Objects;
+import java.util.concurrent.TimeUnit;
+import java.util.concurrent.atomic.AtomicBoolean;
+import java.util.stream.Collectors;
+
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+/**
+ * Integration test for Hudi on Flink with failover when performing writes 
using the bucket index.
+ */
+@ExtendWith(FlinkMiniCluster.class)
+public class BucketStreamWriteFunctionWithFailOverTest extends TestLogger {
+
+  @TempDir

Review Comment:
   There is no need to add the IT I think, the 
`TestWriteCopyOnWrite#testSubtaskFails` can cover this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9357: [HUDI-6588] Fix duplicate fileId on TM failover and recovery

2023-08-03 Thread via GitHub


danny0405 commented on code in PR #9357:
URL: https://github.com/apache/hudi/pull/9357#discussion_r1283889835


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteFunction.java:
##
@@ -448,7 +448,8 @@ private boolean flushBucket(DataBucket bucket) {
   }
 
   @SuppressWarnings("unchecked, rawtypes")
-  private void flushRemaining(boolean endInput) {
+  @VisibleForTesting
+  protected void flushRemaining(boolean endInput) {

Review Comment:
   Do we need this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9343: Revert "[HUDI-6476] Improve the performance of getAllPartitionPaths (#9121)"

2023-08-03 Thread via GitHub


danny0405 commented on code in PR #9343:
URL: https://github.com/apache/hudi/pull/9343#discussion_r1283888081


##
hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java:
##
@@ -168,57 +167,66 @@ private List 
getPartitionPathWithPathPrefixUsingFilterExpression(String
   // TODO: Get the parallelism from HoodieWriteConfig
   int listingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, 
pathsToList.size());
 
-  // List all directories in parallel:
-  // if current dictionary contains PartitionMetadata, add it to result
-  // if current dictionary does not contain PartitionMetadata, add its 
subdirectory to queue to be processed.
+  // List all directories in parallel
   engineContext.setJobStatus(this.getClass().getSimpleName(), "Listing all 
partitions with prefix " + relativePathPrefix);
-  // result below holds a list of pair. first entry in the pair optionally 
holds the deduced list of partitions.
-  // and second entry holds optionally a directory path to be processed 
further.
-  List, Option>> result = 
engineContext.flatMap(pathsToList, path -> {
+  List dirToFileListing = engineContext.flatMap(pathsToList, 
path -> {
 FileSystem fileSystem = path.getFileSystem(hadoopConf.get());
-if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, path)) {
-  return 
Stream.of(Pair.of(Option.of(FSUtils.getRelativePartitionPath(dataBasePath.get(),
 path)), Option.empty()));
-}
-return Arrays.stream(fileSystem.listStatus(path, p -> {
-  try {
-return fileSystem.isDirectory(p) && 
!p.getName().equals(HoodieTableMetaClient.METAFOLDER_NAME);
-  } catch (IOException e) {
-// noop
-  }
-  return false;
-})).map(status -> Pair.of(Option.empty(), 
Option.of(status.getPath(;
+return Arrays.stream(fileSystem.listStatus(path));
   }, listingParallelism);
   pathsToList.clear();
 
-  partitionPaths.addAll(result.stream().filter(entry -> 
entry.getKey().isPresent())
-  .map(entry -> entry.getKey().get())
-  .filter(relativePartitionPath -> fullBoundExpr instanceof 
Predicates.TrueExpression
-  || (Boolean) fullBoundExpr.eval(
-  extractPartitionValues(partitionFields, 
relativePartitionPath, urlEncodePartitioningEnabled)))
-  .collect(Collectors.toList()));
-
-  Expression partialBoundExpr;
-  // If partitionPaths is nonEmpty, we're already at the last path level, 
and all paths
-  // are filtered already.
-  if (needPushDownExpressions && partitionPaths.isEmpty()) {
-// Here we assume the path level matches the number of partition 
columns, so we'll rebuild
-// new schema based on current path level.
-// e.g. partition columns are , if we're listing the 
second level, then
-// currentSchema would be 
-// `PartialBindVisitor` will bind reference if it can be found from 
`currentSchema`, otherwise
-// will change the expression to `alwaysTrue`. Can see 
`PartialBindVisitor` for details.
-Types.RecordType currentSchema = 
Types.RecordType.get(partitionFields.fields().subList(0, 
++currentPartitionLevel));
-PartialBindVisitor partialBindVisitor = new 
PartialBindVisitor(currentSchema, caseSensitive);
-partialBoundExpr = pushedExpr.accept(partialBindVisitor);
-  } else {
-partialBoundExpr = Predicates.alwaysTrue();
-  }
+  // if current dictionary contains PartitionMetadata, add it to result
+  // if current dictionary does not contain PartitionMetadata, add it to 
queue to be processed.
+  int fileListingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, 
dirToFileListing.size());
+  if (!dirToFileListing.isEmpty()) {
+// result below holds a list of pair. first entry in the pair 
optionally holds the deduced list of partitions.
+// and second entry holds optionally a directory path to be processed 
further.
+engineContext.setJobStatus(this.getClass().getSimpleName(), 
"Processing listed partitions");
+List, Option>> result = 
engineContext.map(dirToFileListing, fileStatus -> {
+  FileSystem fileSystem = 
fileStatus.getPath().getFileSystem(hadoopConf.get());
+  if (fileStatus.isDirectory()) {
+if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, 
fileStatus.getPath())) {

Review Comment:
   The revert makes sense because the code freeze is very near, but we still 
needs to give a cladification for it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #9327: [HUDI-6617] make HoodieRecordDelegate implement KryoSerializable

2023-08-03 Thread via GitHub


danny0405 commented on PR #9327:
URL: https://github.com/apache/hudi/pull/9327#issuecomment-1664854084

   Test have passed: 
https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=19024=results


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9359: [HUDI-6639] Rename hoodie.sql.write.operation to hoodie.spark.sql.insert.into.operation

2023-08-03 Thread via GitHub


hudi-bot commented on PR #9359:
URL: https://github.com/apache/hudi/pull/9359#issuecomment-1664852578

   
   ## CI report:
   
   * 3dd8c31863ff6b5dc918a23ff0d13041cbc60bd3 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19048)
 
   * 7308946c4344ca04736b2c83b505d3a159146541 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19050)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9359: [HUDI-6639] Rename hoodie.sql.write.operation to hoodie.spark.sql.insert.into.operation

2023-08-03 Thread via GitHub


hudi-bot commented on PR #9359:
URL: https://github.com/apache/hudi/pull/9359#issuecomment-1664847379

   
   ## CI report:
   
   * e8163de397fcdbf1b3194bcf696435be9d28c171 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19045)
 
   * 3dd8c31863ff6b5dc918a23ff0d13041cbc60bd3 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19048)
 
   * 7308946c4344ca04736b2c83b505d3a159146541 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9276: [HUDI-6635] Hudi Spark Integration Redesign MOR and Bootstrap reading

2023-08-03 Thread via GitHub


hudi-bot commented on PR #9276:
URL: https://github.com/apache/hudi/pull/9276#issuecomment-1664847144

   
   ## CI report:
   
   * 662f3b320ab6ea06462bad9a4448add1ec2f380a UNKNOWN
   * ef8eaadd4f817aa08253b938e19ab3fa61d27b5c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19046)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8697: [HUDI-5514] Improving usability/performance with out of box default for append only use-cases

2023-08-03 Thread via GitHub


hudi-bot commented on PR #8697:
URL: https://github.com/apache/hudi/pull/8697#issuecomment-1664846409

   
   ## CI report:
   
   * 214bc9b6f9c0d522f4196cca12daf4756dc96439 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19044)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] mansipp commented on a diff in pull request #9347: [HUDI-6638] Upgrade AWS Java SDK to V2

2023-08-03 Thread via GitHub


mansipp commented on code in PR #9347:
URL: https://github.com/apache/hudi/pull/9347#discussion_r1283879950


##
hudi-aws/src/main/java/org/apache/hudi/aws/sync/AWSGlueCatalogSyncClient.java:
##
@@ -143,31 +142,32 @@ public void addPartitionsToTable(String tableName, 
List partitionsToAdd)
 LOG.info("Adding " + partitionsToAdd.size() + " partition(s) in table " + 
tableId(databaseName, tableName));
 try {
   Table table = getTable(awsGlue, databaseName, tableName);
-  StorageDescriptor sd = table.getStorageDescriptor();
+  StorageDescriptor sd = table.storageDescriptor();
   List partitionInputs = 
partitionsToAdd.stream().map(partition -> {
-StorageDescriptor partitionSd = sd.clone();
 String fullPartitionPath = FSUtils.getPartitionPath(getBasePath(), 
partition).toString();
 List partitionValues = 
partitionValueExtractor.extractPartitionValuesInPath(partition);
-partitionSd.setLocation(fullPartitionPath);
-return new 
PartitionInput().withValues(partitionValues).withStorageDescriptor(partitionSd);
+StorageDescriptor partitionSD = sd.copy(copySd -> 
copySd.location(fullPartitionPath));
+return 
PartitionInput.builder().values(partitionValues).storageDescriptor(partitionSD).build();
   }).collect(Collectors.toList());
 
-  List> futures = new ArrayList<>();
+  List> futures = new 
ArrayList<>();
 
   for (List batch : 
CollectionUtils.batches(partitionInputs, MAX_PARTITIONS_PER_REQUEST)) {
-BatchCreatePartitionRequest request = new 
BatchCreatePartitionRequest();
-
request.withDatabaseName(databaseName).withTableName(tableName).withPartitionInputList(batch);
-futures.add(awsGlue.batchCreatePartitionAsync(request));
+BatchCreatePartitionRequest request = 
BatchCreatePartitionRequest.builder()
+
.databaseName(databaseName).tableName(tableName).partitionInputList(batch).build();
+futures.add(awsGlue.batchCreatePartition(request));
   }
 
-  for (Future future : futures) {
-BatchCreatePartitionResult result = future.get();
-if (CollectionUtils.nonEmpty(result.getErrors())) {
-  if (result.getErrors().stream().allMatch((error) -> 
"AlreadyExistsException".equals(error.getErrorDetail().getErrorCode( {
-LOG.warn("Partitions already exist in glue: " + 
result.getErrors());
+  for (CompletableFuture future : futures) {
+BatchCreatePartitionResponse response = future.get();
+if (CollectionUtils.nonEmpty(response.errors())) {
+  if (response.errors().stream()
+  .allMatch(
+  (error) -> 
"AlreadyExistsException".equals(error.errorDetail().errorCode( {

Review Comment:
   We haven't changed anything here other than making it compatible with sdk 
v2. Haven't run any test to specifically test this error case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] mansipp commented on a diff in pull request #9347: [HUDI-6638] Upgrade AWS Java SDK to V2

2023-08-03 Thread via GitHub


mansipp commented on code in PR #9347:
URL: https://github.com/apache/hudi/pull/9347#discussion_r1283877492


##
hudi-aws/src/main/java/org/apache/hudi/aws/transaction/lock/DynamoDBBasedLockProvider.java:
##
@@ -153,45 +153,53 @@ public LockItem getLock() {
 return lock;
   }
 
-  private AmazonDynamoDB getDynamoDBClient() {
+  private DynamoDbClient getDynamoDBClient() {
 String region = 
this.dynamoDBLockConfiguration.getString(DynamoDbBasedLockConfig.DYNAMODB_LOCK_REGION);
 String endpointURL = 
this.dynamoDBLockConfiguration.contains(DynamoDbBasedLockConfig.DYNAMODB_ENDPOINT_URL.key())
-  ? 
this.dynamoDBLockConfiguration.getString(DynamoDbBasedLockConfig.DYNAMODB_ENDPOINT_URL)
-  : 
RegionUtils.getRegion(region).getServiceEndpoint(AmazonDynamoDB.ENDPOINT_PREFIX);
-AwsClientBuilder.EndpointConfiguration dynamodbEndpoint =
-new AwsClientBuilder.EndpointConfiguration(endpointURL, region);
-return AmazonDynamoDBClientBuilder.standard()
-.withEndpointConfiguration(dynamodbEndpoint)
-
.withCredentials(HoodieAWSCredentialsProviderFactory.getAwsCredentialsProvider(dynamoDBLockConfiguration.getProps()))
+? 
this.dynamoDBLockConfiguration.getString(DynamoDbBasedLockConfig.DYNAMODB_ENDPOINT_URL)
+: 
DynamoDbClient.serviceMetadata().endpointFor(Region.of(region)).toString();
+
+if (!endpointURL.startsWith("https://;) || 
!endpointURL.startsWith("http://;)) {

Review Comment:
   We are getting this error if we don't specify http:// or https:// in sdk v2. 
Based on this doc dynamodb supports both http and https so added a check for 
both. https://docs.aws.amazon.com/general/latest/gr/ddb.html
   ```
   Caused by: java.lang.NullPointerException: The URI scheme of 
endpointOverride must not be null.
   at 
org.apache.hudi.software.amazon.awssdk.utils.Validate.paramNotNull(Validate.java:156)
   at 
org.apache.hudi.software.amazon.awssdk.core.client.builder.SdkDefaultClientBuilder.endpointOverride(SdkDefaultClientBuilder.java:445)
   at 
org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider.getDynamoDBClient(DynamoDBBasedLockProvider.java:163)
   at 
org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider.(DynamoDBBasedLockProvider.java:87)
   at 
org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider.(DynamoDBBasedLockProvider.java:77)
   ... 65 more
   ``` 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on pull request #9347: [HUDI-6638] Upgrade AWS Java SDK to V2

2023-08-03 Thread via GitHub


yihua commented on PR #9347:
URL: https://github.com/apache/hudi/pull/9347#issuecomment-1664836797

   @mansipp could you also check the Azure CI failure?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on a diff in pull request #9347: [HUDI-6638] Upgrade AWS Java SDK to V2

2023-08-03 Thread via GitHub


yihua commented on code in PR #9347:
URL: https://github.com/apache/hudi/pull/9347#discussion_r1283849039


##
pom.xml:
##
@@ -130,6 +130,8 @@
 1.5.6
 0.16
 0.8.0
+4.5.13
+4.4.13

Review Comment:
   Any reason we pick different versions here?



##
hudi-aws/src/main/java/org/apache/hudi/aws/sync/AWSGlueCatalogSyncClient.java:
##
@@ -143,31 +142,32 @@ public void addPartitionsToTable(String tableName, 
List partitionsToAdd)
 LOG.info("Adding " + partitionsToAdd.size() + " partition(s) in table " + 
tableId(databaseName, tableName));
 try {
   Table table = getTable(awsGlue, databaseName, tableName);
-  StorageDescriptor sd = table.getStorageDescriptor();
+  StorageDescriptor sd = table.storageDescriptor();
   List partitionInputs = 
partitionsToAdd.stream().map(partition -> {
-StorageDescriptor partitionSd = sd.clone();
 String fullPartitionPath = FSUtils.getPartitionPath(getBasePath(), 
partition).toString();
 List partitionValues = 
partitionValueExtractor.extractPartitionValuesInPath(partition);
-partitionSd.setLocation(fullPartitionPath);
-return new 
PartitionInput().withValues(partitionValues).withStorageDescriptor(partitionSd);
+StorageDescriptor partitionSD = sd.copy(copySd -> 
copySd.location(fullPartitionPath));
+return 
PartitionInput.builder().values(partitionValues).storageDescriptor(partitionSD).build();
   }).collect(Collectors.toList());
 
-  List> futures = new ArrayList<>();
+  List> futures = new 
ArrayList<>();
 
   for (List batch : 
CollectionUtils.batches(partitionInputs, MAX_PARTITIONS_PER_REQUEST)) {
-BatchCreatePartitionRequest request = new 
BatchCreatePartitionRequest();
-
request.withDatabaseName(databaseName).withTableName(tableName).withPartitionInputList(batch);
-futures.add(awsGlue.batchCreatePartitionAsync(request));
+BatchCreatePartitionRequest request = 
BatchCreatePartitionRequest.builder()
+
.databaseName(databaseName).tableName(tableName).partitionInputList(batch).build();
+futures.add(awsGlue.batchCreatePartition(request));
   }
 
-  for (Future future : futures) {
-BatchCreatePartitionResult result = future.get();
-if (CollectionUtils.nonEmpty(result.getErrors())) {
-  if (result.getErrors().stream().allMatch((error) -> 
"AlreadyExistsException".equals(error.getErrorDetail().getErrorCode( {
-LOG.warn("Partitions already exist in glue: " + 
result.getErrors());
+  for (CompletableFuture future : futures) {
+BatchCreatePartitionResponse response = future.get();
+if (CollectionUtils.nonEmpty(response.errors())) {
+  if (response.errors().stream()
+  .allMatch(
+  (error) -> 
"AlreadyExistsException".equals(error.errorDetail().errorCode( {

Review Comment:
   I assume the error name is the same here and this should still work.  Is the 
error case tested?



##
hudi-utilities/pom.xml:
##
@@ -450,6 +450,14 @@
   test
 
 
+
+
+
+  software.amazon.awssdk
+  sqs
+  ${aws.sdk.version}
+

Review Comment:
   Same question as before: 
https://github.com/apache/hudi/pull/8441/files#r1164644285
   Should this be moved to `hudi-aws` module (`hudi-utilities` module has 
already relied on `hudi-aws` module)?



##
hudi-aws/src/main/java/org/apache/hudi/aws/transaction/lock/DynamoDBBasedLockProvider.java:
##
@@ -153,45 +153,53 @@ public LockItem getLock() {
 return lock;
   }
 
-  private AmazonDynamoDB getDynamoDBClient() {
+  private DynamoDbClient getDynamoDBClient() {
 String region = 
this.dynamoDBLockConfiguration.getString(DynamoDbBasedLockConfig.DYNAMODB_LOCK_REGION);
 String endpointURL = 
this.dynamoDBLockConfiguration.contains(DynamoDbBasedLockConfig.DYNAMODB_ENDPOINT_URL.key())
-  ? 
this.dynamoDBLockConfiguration.getString(DynamoDbBasedLockConfig.DYNAMODB_ENDPOINT_URL)
-  : 
RegionUtils.getRegion(region).getServiceEndpoint(AmazonDynamoDB.ENDPOINT_PREFIX);
-AwsClientBuilder.EndpointConfiguration dynamodbEndpoint =
-new AwsClientBuilder.EndpointConfiguration(endpointURL, region);
-return AmazonDynamoDBClientBuilder.standard()
-.withEndpointConfiguration(dynamodbEndpoint)
-
.withCredentials(HoodieAWSCredentialsProviderFactory.getAwsCredentialsProvider(dynamoDBLockConfiguration.getProps()))
+? 
this.dynamoDBLockConfiguration.getString(DynamoDbBasedLockConfig.DYNAMODB_ENDPOINT_URL)
+: 
DynamoDbClient.serviceMetadata().endpointFor(Region.of(region)).toString();
+
+if (!endpointURL.startsWith("https://;) || 
!endpointURL.startsWith("http://;)) {

Review Comment:
   could the endpoint URL start without the HTTP prefix?



-- 
This is an automated message from the Apache Git 

[GitHub] [hudi] leesf commented on a diff in pull request #9199: [HUDI-6534]Support consistent hashing row writer

2023-08-03 Thread via GitHub


leesf commented on code in PR #9199:
URL: https://github.com/apache/hudi/pull/9199#discussion_r1283870818


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/ConsistentBucketBulkInsertDataInternalWriterHelper.java:
##
@@ -0,0 +1,120 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table.action.commit;
+
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.ConsistentHashingNode;
+import org.apache.hudi.common.model.HoodieConsistentHashingMetadata;
+import org.apache.hudi.common.model.HoodieFileGroupId;
+import org.apache.hudi.common.util.ValidationUtils;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.index.bucket.ConsistentBucketIdentifier;
+import org.apache.hudi.index.bucket.ConsistentBucketIndexUtils;
+import org.apache.hudi.io.storage.row.HoodieRowCreateHandle;
+import org.apache.hudi.table.HoodieTable;
+import 
org.apache.hudi.table.action.cluster.util.ConsistentHashingUpdateStrategyUtils;
+
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.types.StructType;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.util.Collections;
+import java.util.Set;
+import java.util.stream.Collectors;
+
+/**
+ * Helper class for native row writer for bulk_insert with consistent hashing 
bucket index.
+ */
+public class ConsistentBucketBulkInsertDataInternalWriterHelper extends 
BucketBulkInsertDataInternalWriterHelper {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(ConsistentBucketBulkInsertDataInternalWriterHelper.class);
+
+  public ConsistentBucketBulkInsertDataInternalWriterHelper(HoodieTable 
hoodieTable, HoodieWriteConfig writeConfig,

Review Comment:
   the constructor has too many args, which can wrap in a class?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] leesf commented on a diff in pull request #9199: [HUDI-6534]Support consistent hashing row writer

2023-08-03 Thread via GitHub


leesf commented on code in PR #9199:
URL: https://github.com/apache/hudi/pull/9199#discussion_r1283870066


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BucketBulkInsertDataInternalWriterHelper.java:
##
@@ -65,7 +71,6 @@ public void write(InternalRow row) throws IOException {
   int bucketId = BucketIdentifier.getBucketId(String.valueOf(recordKey), 
indexKeyFields, bucketNum);
   Pair fileId = Pair.of(partitionPath, bucketId);
   if (lastFileId == null || !lastFileId.equals(fileId)) {
-LOG.info("Creating new file for partition path " + partitionPath);

Review Comment:
   this log is useless then?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] leesf commented on a diff in pull request #9199: [HUDI-6534]Support consistent hashing row writer

2023-08-03 Thread via GitHub


leesf commented on code in PR #9199:
URL: https://github.com/apache/hudi/pull/9199#discussion_r1283867267


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/ConsistentBucketIndexBulkInsertPartitionerWithRows.java:
##
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.execution.bulkinsert;
+
+import org.apache.hudi.common.model.ConsistentHashingNode;
+import org.apache.hudi.common.model.HoodieConsistentHashingMetadata;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ValidationUtils;
+import org.apache.hudi.index.bucket.ConsistentBucketIdentifier;
+import org.apache.hudi.index.bucket.ConsistentBucketIndexUtils;
+import org.apache.hudi.index.bucket.HoodieSparkConsistentBucketIndex;
+import org.apache.hudi.keygen.BuiltinKeyGenerator;
+import org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory;
+import org.apache.hudi.table.BulkInsertPartitioner;
+import org.apache.hudi.table.ConsistentHashingBucketInsertPartitioner;
+import org.apache.hudi.table.HoodieTable;
+
+import org.apache.spark.Partitioner;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+import scala.Tuple2;
+
+/**
+ * Bulk_insert partitioner of Spark row using consistent hashing bucket index.
+ */
+public class ConsistentBucketIndexBulkInsertPartitionerWithRows
+implements BulkInsertPartitioner>, 
ConsistentHashingBucketInsertPartitioner {
+
+  private final HoodieTable table;
+
+  private final String indexKeyFields;
+
+  private final List fileIdPfxList = new ArrayList<>();
+  private final Map> hashingChildrenNodes;
+
+  private Map partitionToIdentifier;
+
+  private final Option keyGeneratorOpt;
+
+  private Map> partitionToFileIdPfxIdxMap;
+
+  private final RowRecordKeyExtractor extractor;
+
+  public ConsistentBucketIndexBulkInsertPartitionerWithRows(HoodieTable table, 
boolean populateMetaFields) {
+this.indexKeyFields = table.getConfig().getBucketIndexHashField();
+this.table = table;
+this.hashingChildrenNodes = new HashMap<>();
+if (!populateMetaFields) {
+  this.keyGeneratorOpt = 
HoodieSparkKeyGeneratorFactory.getKeyGenerator(table.getConfig().getProps());
+} else {
+  this.keyGeneratorOpt = Option.empty();
+}
+this.extractor = 
RowRecordKeyExtractor.getRowRecordKeyExtractor(populateMetaFields, 
keyGeneratorOpt);
+
ValidationUtils.checkArgument(table.getMetaClient().getTableType().equals(HoodieTableType.MERGE_ON_READ),
+"Consistent hash bucket index doesn't support CoW table");
+  }
+
+  private ConsistentBucketIdentifier getBucketIdentifier(String partition) {
+HoodieSparkConsistentBucketIndex index = 
(HoodieSparkConsistentBucketIndex) table.getIndex();
+HoodieConsistentHashingMetadata metadata = 
ConsistentBucketIndexUtils.loadOrCreateMetadata(this.table, partition, 
index.getNumBuckets());
+if (hashingChildrenNodes.containsKey(partition)) {
+  metadata.setChildrenNodes(hashingChildrenNodes.get(partition));
+}
+return new ConsistentBucketIdentifier(metadata);
+  }
+
+  @Override
+  public Dataset repartitionRecords(Dataset rows, int 
outputPartitions) {
+JavaRDD rowJavaRDD = rows.toJavaRDD();
+prepareRepartition(rowJavaRDD);
+Partitioner partitioner = new Partitioner() {
+  @Override
+  public int getPartition(Object key) {
+return (int) key;
+  }
+
+  @Override
+  public int numPartitions() {
+return fileIdPfxList.size();
+  }
+};
+
+return rows.sparkSession().createDataFrame(rowJavaRDD
+.mapToPair(row -> new Tuple2<>(getBucketId(row), row))
+.partitionBy(partitioner)
+.values(), rows.schema());
+  }
+
+  /**
+   * Prepare consistent hashing metadata for repartition
+   *
+   * @param rows input records
+   */
+  private void prepareRepartition(JavaRDD rows) {
+this.partitionToIdentifier = 

[GitHub] [hudi] leesf commented on a diff in pull request #9199: [HUDI-6534]Support consistent hashing row writer

2023-08-03 Thread via GitHub


leesf commented on code in PR #9199:
URL: https://github.com/apache/hudi/pull/9199#discussion_r1283865112


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/SparkConsistentBucketClusteringExecutionStrategy.java:
##
@@ -79,15 +94,19 @@ public HoodieData 
performClusteringWithRecordsRDD(HoodieData partitioner = new 
RDDConsistentBucketBulkInsertPartitioner<>(getHoodieTable(), strategyParams, 
preserveHoodieMetadata);
+addHashingChildNodes(partitioner, extraMetadata);
+
+return (HoodieData) SparkBulkInsertHelper.newInstance()
+.bulkInsert(inputRecords, instantTime, getHoodieTable(), newConfig, 
false, partitioner, true, numOutputGroups);
+  }
+
+  private void addHashingChildNodes(ConsistentHashingBucketInsertPartitioner 
partitioner, Map extraMetadata) {
 try {
   List nodes = 
ConsistentHashingNode.fromJsonString(extraMetadata.get(BaseConsistentHashingBucketClusteringPlanStrategy.METADATA_CHILD_NODE_KEY));

Review Comment:
   will the nodes be null and need check in addHashingChildrenNodes?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] amrishlal commented on a diff in pull request #9359: [HUDI-6639] Rename hoodie.sql.write.operation to hoodie.spark.sql.insert.into.operation

2023-08-03 Thread via GitHub


amrishlal commented on code in PR #9359:
URL: https://github.com/apache/hudi/pull/9359#discussion_r1283857986


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala:
##
@@ -194,9 +194,9 @@ trait ProvidesHoodieConfig extends Logging {
 val insertMode = 
InsertMode.of(combinedOpts.getOrElse(DataSourceWriteOptions.SQL_INSERT_MODE.key,
   DataSourceWriteOptions.SQL_INSERT_MODE.defaultValue()))
 val insertModeSet = combinedOpts.contains(SQL_INSERT_MODE.key)
-val sqlWriteOperationOpt = combinedOpts.get(SQL_WRITE_OPERATION.key())
+val sqlWriteOperationOpt = 
combinedOpts.get(SPARK_SQL_INSERT_INTO_OPERATION.key())
 val sqlWriteOperationSet = sqlWriteOperationOpt.nonEmpty
-val sqlWriteOperation = 
sqlWriteOperationOpt.getOrElse(SQL_WRITE_OPERATION.defaultValue())
+val sqlWriteOperation = 
sqlWriteOperationOpt.getOrElse(SPARK_SQL_INSERT_INTO_OPERATION.defaultValue())

Review Comment:
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9199: [HUDI-6534]Support consistent hashing row writer

2023-08-03 Thread via GitHub


hudi-bot commented on PR #9199:
URL: https://github.com/apache/hudi/pull/9199#issuecomment-1664818603

   
   ## CI report:
   
   * abf8721378220e0d669aee49599a231ceff34e19 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19030)
 
   * 884a71af797b71a7f5818472884e45f39f758328 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19049)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on pull request #9327: [HUDI-6617] make HoodieRecordDelegate implement KryoSerializable

2023-08-03 Thread via GitHub


nsivabalan commented on PR #9327:
URL: https://github.com/apache/hudi/pull/9327#issuecomment-1664806952

   @prashantwason can you review 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9359: [HUDI-6639] Rename hoodie.sql.write.operation to hoodie.spark.sql.insert.into.operation

2023-08-03 Thread via GitHub


hudi-bot commented on PR #9359:
URL: https://github.com/apache/hudi/pull/9359#issuecomment-1664806676

   
   ## CI report:
   
   * e8163de397fcdbf1b3194bcf696435be9d28c171 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19045)
 
   * 3dd8c31863ff6b5dc918a23ff0d13041cbc60bd3 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19048)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9199: [HUDI-6534]Support consistent hashing row writer

2023-08-03 Thread via GitHub


hudi-bot commented on PR #9199:
URL: https://github.com/apache/hudi/pull/9199#issuecomment-1664805693

   
   ## CI report:
   
   * abf8721378220e0d669aee49599a231ceff34e19 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19030)
 
   * 884a71af797b71a7f5818472884e45f39f758328 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on a diff in pull request #8441: Upgrade aws java sdk to v2

2023-08-03 Thread via GitHub


yihua commented on code in PR #8441:
URL: https://github.com/apache/hudi/pull/8441#discussion_r1283850979


##
hudi-aws/src/main/java/org/apache/hudi/aws/credentials/HoodieAWSCredentialsProviderFactory.java:
##
@@ -30,16 +30,23 @@
  * Factory class for Hoodie AWSCredentialsProvider.
  */
 public class HoodieAWSCredentialsProviderFactory {
-  public static AWSCredentialsProvider getAwsCredentialsProvider(Properties 
props) {
+  public static AwsCredentialsProvider getAwsCredentialsProvider(Properties 
props) {
 return getAwsCredentialsProviderChain(props);
   }
 
-  private static AWSCredentialsProvider 
getAwsCredentialsProviderChain(Properties props) {
-List providers = new ArrayList<>();
-providers.add(new HoodieConfigAWSCredentialsProvider(props));
-providers.add(new DefaultAWSCredentialsProviderChain());
-AWSCredentialsProviderChain providerChain = new 
AWSCredentialsProviderChain(providers);
-providerChain.setReuseLastProvider(true);
+  private static AwsCredentialsProvider 
getAwsCredentialsProviderChain(Properties props) {
+List providers = new ArrayList<>();
+HoodieConfigAWSCredentialsProvider hoodieConfigAWSCredentialsProvider = 
new HoodieConfigAWSCredentialsProvider(props);
+if (hoodieConfigAWSCredentialsProvider.resolveCredentials() != null) {
+  providers.add(hoodieConfigAWSCredentialsProvider);

Review Comment:
   Got it.  Then we can keep it this way.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9359: [HUDI-6639] Rename hoodie.sql.write.operation to hoodie.spark.sql.insert.into.operation

2023-08-03 Thread via GitHub


hudi-bot commented on PR #9359:
URL: https://github.com/apache/hudi/pull/9359#issuecomment-1664798077

   
   ## CI report:
   
   * e8163de397fcdbf1b3194bcf696435be9d28c171 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19045)
 
   * 3dd8c31863ff6b5dc918a23ff0d13041cbc60bd3 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9308: [HUDI-6606] Use record level index with SQL equality queries

2023-08-03 Thread via GitHub


hudi-bot commented on PR #9308:
URL: https://github.com/apache/hudi/pull/9308#issuecomment-1664797937

   
   ## CI report:
   
   * ae0002b81c71f77e0c19aeb3e5872b0eb399ddb6 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19043)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #9359: [HUDI-6639] Rename hoodie.sql.write.operation to hoodie.spark.sql.insert.into.operation

2023-08-03 Thread via GitHub


nsivabalan commented on code in PR #9359:
URL: https://github.com/apache/hudi/pull/9359#discussion_r1283833724


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala:
##
@@ -194,9 +194,9 @@ trait ProvidesHoodieConfig extends Logging {
 val insertMode = 
InsertMode.of(combinedOpts.getOrElse(DataSourceWriteOptions.SQL_INSERT_MODE.key,
   DataSourceWriteOptions.SQL_INSERT_MODE.defaultValue()))
 val insertModeSet = combinedOpts.contains(SQL_INSERT_MODE.key)
-val sqlWriteOperationOpt = combinedOpts.get(SQL_WRITE_OPERATION.key())
+val sqlWriteOperationOpt = 
combinedOpts.get(SPARK_SQL_INSERT_INTO_OPERATION.key())
 val sqlWriteOperationSet = sqlWriteOperationOpt.nonEmpty
-val sqlWriteOperation = 
sqlWriteOperationOpt.getOrElse(SQL_WRITE_OPERATION.defaultValue())
+val sqlWriteOperation = 
sqlWriteOperationOpt.getOrElse(SPARK_SQL_INSERT_INTO_OPERATION.defaultValue())

Review Comment:
   minor.
   can we rename the method deduceSqlWriteOperation to 
deduceSparkSqlInsetIntoWriteOperation



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] amrishlal commented on a diff in pull request #9324: [HUDI-6619] [WIP] Fix hudi-integ-test-bundle dependency on jackson jsk310 package.

2023-08-03 Thread via GitHub


amrishlal commented on code in PR #9324:
URL: https://github.com/apache/hudi/pull/9324#discussion_r1283805598


##
pom.xml:
##
@@ -98,8 +98,6 @@
 
${fasterxml.spark3.version}
 
${fasterxml.spark3.version}
 
${fasterxml.spark3.version}
-
-

Review Comment:
   @xushiyan Fixed. Also verified that the correct version of JSR package 
(matching what is including in spark 3.x runtime) is being included.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9261: [HUDI-6579] Adding support for upsert and deletes with spark datasource for pk less table

2023-08-03 Thread via GitHub


hudi-bot commented on PR #9261:
URL: https://github.com/apache/hudi/pull/9261#issuecomment-1664733692

   
   ## CI report:
   
   * 7d85ec69d154b5b02c81737212f415fc1aeded91 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19034)
 
   * 7e7efc78003b0e8ef5c2d809276796b7b987a35b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19047)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9261: [HUDI-6579] Adding support for upsert and deletes with spark datasource for pk less table

2023-08-03 Thread via GitHub


hudi-bot commented on PR #9261:
URL: https://github.com/apache/hudi/pull/9261#issuecomment-1664728409

   
   ## CI report:
   
   * 7d85ec69d154b5b02c81737212f415fc1aeded91 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19034)
 
   * 7e7efc78003b0e8ef5c2d809276796b7b987a35b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9347: [HUDI-6638] Upgrade AWS Java SDK to v2

2023-08-03 Thread via GitHub


hudi-bot commented on PR #9347:
URL: https://github.com/apache/hudi/pull/9347#issuecomment-1664723220

   
   ## CI report:
   
   * 5c1391571fbed3ce391399b7848f33b629455941 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19041)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9276: [HUDI-6635] Hudi Spark Integration Redesign MOR and Bootstrap reading

2023-08-03 Thread via GitHub


hudi-bot commented on PR #9276:
URL: https://github.com/apache/hudi/pull/9276#issuecomment-1664686919

   
   ## CI report:
   
   * 662f3b320ab6ea06462bad9a4448add1ec2f380a UNKNOWN
   * 1875a19fd05f413373eb1f2400f390706d62725e Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19042)
 
   * ef8eaadd4f817aa08253b938e19ab3fa61d27b5c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19046)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9276: [HUDI-6635] Hudi Spark Integration Redesign MOR and Bootstrap reading

2023-08-03 Thread via GitHub


hudi-bot commented on PR #9276:
URL: https://github.com/apache/hudi/pull/9276#issuecomment-1664679688

   
   ## CI report:
   
   * 662f3b320ab6ea06462bad9a4448add1ec2f380a UNKNOWN
   * af768285ff85e68a361482806a5de1e0b9c272a9 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19035)
 
   * 1875a19fd05f413373eb1f2400f390706d62725e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19042)
 
   * ef8eaadd4f817aa08253b938e19ab3fa61d27b5c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] mansipp commented on pull request #9347: [HUDI-6638] Upgrade AWS Java SDK to v2

2023-08-03 Thread via GitHub


mansipp commented on PR #9347:
URL: https://github.com/apache/hudi/pull/9347#issuecomment-1664649658

   Manually tested s3a path using EMR cluster. 
   
   ```scala
   spark-shell \
   --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
   --conf 
"spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog"
 \
   --conf 
"spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension" \
   --conf "spark.sql.hive.convertMetastoreParquet=false" \
   --jars 
/usr/lib/hudi/hudi-aws-bundle-0.13.1-amzn-1-SNAPSHOT.jar,/usr/lib/hudi/hudi-spark3-bundle_2.12-0.13.1-amzn-1-SNAPSHOT.jar
   ```
   
   ```scala
   import org.apache.spark.sql.SaveMode
   import org.apache.spark.sql.functions._
   import org.apache.hudi.DataSourceWriteOptions
   import org.apache.hudi.DataSourceReadOptions
   import org.apache.hudi.config.HoodieWriteConfig
   import org.apache.hudi.hive.MultiPartKeysValueExtractor
   import org.apache.hudi.hive.HiveSyncConfig
   import org.apache.hudi.sync.common.HoodieSyncConfig
   
   // Create a DataFrame
   var tableName = "mansi_s3a_hudi_test"
   var tablePath = "s3a:///tables/" + tableName
   val inputDF = Seq(
("100", "2015-01-01", "2015-01-01T13:51:39.340396Z"),
("101", "2015-01-01", "2015-01-01T12:14:58.597216Z"),
("102", "2015-01-01", "2015-01-01T13:51:40.417052Z"),
("103", "2015-01-01", "2015-01-01T13:51:40.519832Z"),
("104", "2015-01-02", "2015-01-01T12:15:00.512679Z"),
("105", "2015-01-02", "2015-01-01T13:51:42.248818Z")
).toDF("id", "creation_date", "last_update_time")
   
   //Specify common DataSourceWriteOptions in the single hudiOptions variable 
   val hudiOptions = Map[String,String](
 HoodieWriteConfig.TABLE_NAME -> tableName,
 DataSourceWriteOptions.TABLE_TYPE_OPT_KEY -> "COPY_ON_WRITE", 
 DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "id",
 DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "creation_date",
 DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "last_update_time",
 DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY -> "true",
 DataSourceWriteOptions.HIVE_TABLE_OPT_KEY -> tableName,
 DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY -> "creation_date",
 DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> 
classOf[MultiPartKeysValueExtractor].getName
   )
   
   // Write the DataFrame as a Hudi dataset
   (inputDF.write
   .format("org.apache.hudi")
   .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
   .options(hudiOptions)
   .mode(SaveMode.Overwrite)
   .save(tablePath))
   
   ``` 
   
![Image](https://github.com/apache/hudi/assets/107222979/672d5057-cbe5-43f7-b927-838510b1220c)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9359: [HUDI-6639] Rename hoodie.sql.write.operation to hoodie.spark.sql.insert.into.operation

2023-08-03 Thread via GitHub


hudi-bot commented on PR #9359:
URL: https://github.com/apache/hudi/pull/9359#issuecomment-1664628275

   
   ## CI report:
   
   * e8163de397fcdbf1b3194bcf696435be9d28c171 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19045)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] jonvex commented on a diff in pull request #9276: [HUDI-6635] Hudi Spark Integration Redesign MOR and Bootstrap reading

2023-08-03 Thread via GitHub


jonvex commented on code in PR #9276:
URL: https://github.com/apache/hudi/pull/9276#discussion_r1283702970


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBootstrapMORRelation.scala:
##
@@ -58,10 +59,13 @@ case class HoodieBootstrapMORRelation(override val 
sqlContext: SQLContext,
   override val optParams: Map[String, 
String],
   private val prunedDataSchema: 
Option[StructType] = None)
   extends BaseHoodieBootstrapRelation(sqlContext, userSchema, globPaths, 
metaClient,
-optParams, prunedDataSchema) {
+optParams, prunedDataSchema) with SparkAdapterSupport {
 
   override type Relation = HoodieBootstrapMORRelation
 
+  protected val mergeType: String = 
optParams.getOrElse(DataSourceReadOptions.REALTIME_MERGE.key,
+DataSourceReadOptions.REALTIME_MERGE.defaultValue)
+

Review Comment:
   Nope. Removed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xushiyan commented on a diff in pull request #9324: [HUDI-6619] [WIP] Fix hudi-integ-test-bundle dependency on jackson jsk310 package.

2023-08-03 Thread via GitHub


xushiyan commented on code in PR #9324:
URL: https://github.com/apache/hudi/pull/9324#discussion_r1283702886


##
pom.xml:
##
@@ -98,8 +98,6 @@
 
${fasterxml.spark3.version}
 
${fasterxml.spark3.version}
 
${fasterxml.spark3.version}
-
-

Review Comment:
   as this variable is removed, can you global search and make sure all refs 
are updated?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9359: [HUDI-6639] Rename hoodie.sql.write.operation to hoodie.spark.sql.insert.into.operation

2023-08-03 Thread via GitHub


hudi-bot commented on PR #9359:
URL: https://github.com/apache/hudi/pull/9359#issuecomment-1664617802

   
   ## CI report:
   
   * e8163de397fcdbf1b3194bcf696435be9d28c171 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9357: [HUDI-6588] Fix duplicate fileId on TM failover and recovery

2023-08-03 Thread via GitHub


hudi-bot commented on PR #9357:
URL: https://github.com/apache/hudi/pull/9357#issuecomment-1664608150

   
   ## CI report:
   
   * d8e159b823f516f584802bd3dacdaa782f185854 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19038)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] jonvex commented on a diff in pull request #9276: [HUDI-6635] Hudi Spark Integration Redesign MOR and Bootstrap reading

2023-08-03 Thread via GitHub


jonvex commented on code in PR #9276:
URL: https://github.com/apache/hudi/pull/9276#discussion_r1283684686


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala:
##
@@ -247,11 +247,23 @@ object DefaultSource {
   Option(schema)
 }
 
+
+
+
 if (metaClient.getCommitsTimeline.filterCompletedInstants.countInstants() 
== 0) {
   new EmptyRelation(sqlContext, resolveSchema(metaClient, parameters, 
Some(schema)))
 } else if (isCdcQuery) {
   CDCRelation.getCDCRelation(sqlContext, metaClient, parameters)
 } else {
+  lazy val newHudiFileFormatUtils = if 
(!parameters.getOrElse(LEGACY_HUDI_PARQUET_FILE_FORMAT.key,
+LEGACY_HUDI_PARQUET_FILE_FORMAT.defaultValue).toBoolean && (globPaths 
== null || globPaths.isEmpty)
+&& parameters.getOrElse(REALTIME_MERGE.key(), 
REALTIME_MERGE.defaultValue())
+.equalsIgnoreCase(REALTIME_PAYLOAD_COMBINE_OPT_VAL)) {

Review Comment:
   Yes. I wasn't able to get it to work correctly before the code freeze



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6639) Rename hoodie.sql.write.operation to hoodie.spark.sql.insert.into.operation

2023-08-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6639:
-
Labels: pull-request-available  (was: )

> Rename hoodie.sql.write.operation to hoodie.spark.sql.insert.into.operation
> ---
>
> Key: HUDI-6639
> URL: https://issues.apache.org/jira/browse/HUDI-6639
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Amrish Lal
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] amrishlal opened a new pull request, #9359: [HUDI-6639] Rename hoodie.sql.write.operation to hoodie.spark.sql.insert.into.operation

2023-08-03 Thread via GitHub


amrishlal opened a new pull request, #9359:
URL: https://github.com/apache/hudi/pull/9359

   Rename hoodie.sql.write.operation to hoodie.spark.sql.insert.into.operation
   
   ### Change Logs
   
   Rename hoodie.sql.write.operation to hoodie.spark.sql.insert.into.operation
   
   ### Impact
   
   Rename config to better describe what it is for.
   
   ### Risk level (write none, low medium or high below)
   
   None
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [X] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [X] Change Logs and Impact were stated clearly
   - [X] Adequate tests were added if applicable
   - [X] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-6639) Rename hoodie.sql.write.operation to hoodie.spark.sql.insert.into.operation

2023-08-03 Thread Amrish Lal (Jira)
Amrish Lal created HUDI-6639:


 Summary: Rename hoodie.sql.write.operation to 
hoodie.spark.sql.insert.into.operation
 Key: HUDI-6639
 URL: https://issues.apache.org/jira/browse/HUDI-6639
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Amrish Lal






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HUDI-6596) Propose rollback implementation changes to guard against concurrent jobs

2023-08-03 Thread Krishen Bhan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17750879#comment-17750879
 ] 

Krishen Bhan edited comment on HUDI-6596 at 8/3/23 8:15 PM:


Thanks for the reply!
> The table lock could become a bottleneck, potentially leading to performance 
> issues as other operations might be blocked too. It might be useful to 
> consider how frequently you expect concurrent rollbacks to occur and whether 
> this might create a performance problem.

Based on our use case at least, I would expect attempted concurrent rollbacks 
to be uncommon / an edge case for now, and more rare as my organization 
internally makes more fixes/changes to our orchestration process. 
But you bring up a good concern that because this locks around the 
scheduledRollback call , it will block other hudi writers  from progressing 
while it generates the rollback plan, which might be relatively time-consuming 
(since in spark HUDI 0.10 can lead to launching a new spark stage) at least 
compared to other places where HUDI holds the table lock. If this becomes a 
concern for other HUDI users (and we want HUDI OCC locking to avoid holding 
table lock while reading/creating an arbitrary/unbounded number of 
instants/data files in parallel), one solution I can think of would be to 
refactor HoodieTable scheduleRollback call such that we can first create the 
rollback plan before acquiring the table lock (in my proposed approach), and 
then later if we actually call HoodieTable scheduleRollback we just pass in 
this existing rollback plan. This way the actual "work" needed to generate 
rollback plan is done beforehand before locking, and we only check and update 
instant file(s) while under lock (and since in my proposed approach we anyway 
only schedule rollback if none has ever been scheduled before, I think that we 
shouldn't have to worry about the rollback plan we created becoming 
invalid/stale). Though I'm not sure how feasible this is since it since it may 
affect public APIs.

> Can we ensure rollbacks are idempotent in case of repeated failures or 
> retries?

Yes making sure rollbacks are idempotent ( in the sense that a pending rollback 
can keep on being retried until success) is a must. Both the current/original 
implementation and the proposed approach should address this. Both 
approaches/implementations handle the case where rollback is pending but the 
instant to rollback is gone (where we need to re-materialize the instant info 
and tell the rollback execution to not actually delete instants). Also, in the 
proposed approach here in step 3 we are directly using the pending rollback 
instant that we observe in the timeline, if one exists. Unfortunately the logic 
and my phrasing for step 3 is a bit awkward, since because the caller can pass 
in a 
pendingRollbackInfo
that it expects to be executed, I decided that we couldn't just ignore it, but 
rather we needed to make validate that this pendingRollbackInfo is the same as 
the pending rollback instant we just saw in the timeline, and abort the 
rollback if not. 

> Worth considering edge cases where heartbeats could become stale or be missed 
> (e.g., if a job crashes without properly closing its heartbeat). Handling 
> these scenarios gracefully will help ensure that rollbacks can still proceed 
> when needed. 

Yeah, if a failed rollback job doesn't clean up the heartbeat after failure, 
any rollback attempt right after (within the interval of  `[heartbeat timeout * 
(allowed heartbeat misses + 1)]` I think) will fail. The alternative (that I 
can think of) would be to simply allow for the chance of 2+ rollback jobs to 
work on the same rollback instant. The issue though is that even if the 
HoodieTable executeRollback implementation prevents the dataset from being 
corrupted and will just produce a retry-able failure, I thought that it would 
be noisy/tricky for a user to understand/debug. So I decided to add Step 4, 
since from my perspective/understanding it was a tradeoff between "always 
failing with an easy to understand retry-able exception" versus "rarely failing 
with a hard to understand retry-able exception". 


was (Author: JIRAUSER301521):
Thanks for the reply!
> The table lock could become a bottleneck, potentially leading to performance 
> issues as other operations might be blocked too. It might be useful to 
> consider how frequently you expect concurrent rollbacks to occur and whether 
> this might create a performance problem.

Based on our use case at least, I would expect attempted concurrent rollbacks 
to be uncommon / an edge case for now, and more rare as my organization 
internally makes more fixes/changes to our orchestration process. 
But you bring up a good concern that because this locks around the 
scheduledRollback call , it will block other hudi writers  from progressing 
while it generates the rollback plan, 

[jira] [Commented] (HUDI-6596) Propose rollback implementation changes to guard against concurrent jobs

2023-08-03 Thread Krishen Bhan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17750879#comment-17750879
 ] 

Krishen Bhan commented on HUDI-6596:


Thanks for the reply!
> The table lock could become a bottleneck, potentially leading to performance 
> issues as other operations might be blocked too. It might be useful to 
> consider how frequently you expect concurrent rollbacks to occur and whether 
> this might create a performance problem.

Based on our use case at least, I would expect attempted concurrent rollbacks 
to be uncommon / an edge case for now, and more rare as my organization 
internally makes more fixes/changes to our orchestration process. 
But you bring up a good concern that because this locks around the 
scheduledRollback call , it will block other hudi writers  from progressing 
while it generates the rollback plan, which might be relatively time-consuming 
(since in spark HUDI 0.10 can lead to launching a new spark stage) at least 
compared to other places where HUDI holds the table lock. If this becomes a 
concern for other HUDI users (and we want HUDI OCC locking to avoid holding 
table lock while reading/creating an arbitrary/unbounded number of 
instants/data files in parallel), one solution I can think of would be to 
refactor HoodieTable scheduleRollback call such that we can first create the 
rollback plan before acquiring the table lock (in my proposed approach), and 
then later if we actually call HoodieTable scheduleRollback we just pass in 
this existing rollback plan. This way the actual "work" needed to generate 
rollback plan is done beforehand before locking, and we only check and update 
instant file(s) while under lock (and since in my proposed approach we anyway 
only schedule rollback if none has ever been scheduled before, I think that we 
shouldn't have to worry about the rollback plan we created becoming 
invalid/stale). Though I'm not sure how feasible this is since it since it may 
affect public APIs.

> Can we ensure rollbacks are idempotent in case of repeated failures or 
> retries?

Yes making sure rollbacks are idempotent ( in the sense that a pending rollback 
can keep on being retried until success) is a must. Both the current/original 
implementation and the proposed approach should address this. Both 
approaches/implementations handle the case where rollback is pending but the 
instant to rollback is gone (where we need to re-materialize the instant info 
and tell the rollback execution to not actually delete instants). Also, in the 
proposed approach here in step 3 we are directly using the pending rollback 
instant that we observe in the timeline, if one exists. Unfortunately the logic 
and my phrasing for step 3 is a bit awkward, since because the caller can pass 
in a 
pendingRollbackInfo
that it expects to be executed, I decided that we couldn't just ignore it, but 
rather we needed to make validate that this pendingRollbackInfo is the same as 
the pending rollback instant we just saw in the timeline, and abort the 
rollback if not. 

> Worth considering edge cases where heartbeats could become stale or be missed 
> (e.g., if a job crashes without properly closing its heartbeat). Handling 
> these scenarios gracefully will help ensure that rollbacks can still proceed 
> when needed. 

Yeah, if a failed rollback job doesn't clean up the heartbeat after failure, 
any rollback attempt right after (within the interval of  `[heartbeat timeout * 
(allowed heartbeat misses + 1)]` I think) will fail. The alternative (that I 
can think of) would be to simply allow for the chance of 2+ rollback jobs to 
work on the same rollback instant. The issue though is that even if the 
HoodieTable executeRollback implementation prevents the dataset from being 
corrupted and will just produce a retry-able failure, I thought that it would 
be noisy/tricky for a user to understand/debug. So I decided to add Step 4, 
since from my perspective/understanding it was a tradeoff between "always 
failing with an easy to understand retry-able exception" versus "rarely failing 
with a hard to understand retry-able exception". 
Though I think the chance of "stale heartbeat" can be lowered by updating the 
commit code path to clean the heartbeat if an exception is raised (this isn't 
done in 0.10 I think, but not sure if its still like this in 0.14). In fact my 
organization internally does this with our internal forked/modified version of 
clustering replacecommit (we have modified replacecommit to not have an 
immutable plan, similar to commit). Though of course this isn't a guarantee see 
a HUDI spark job writer might fail with a lower level runtime error.

>  Propose rollback implementation changes to guard against concurrent jobs
> -
>
> Key: HUDI-6596
> URL: 

[GitHub] [hudi] yihua commented on a diff in pull request #9276: [HUDI-6635] Hudi Spark Integration Redesign MOR and Bootstrap reading

2023-08-03 Thread via GitHub


yihua commented on code in PR #9276:
URL: https://github.com/apache/hudi/pull/9276#discussion_r1283625541


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala:
##
@@ -247,11 +247,23 @@ object DefaultSource {
   Option(schema)
 }
 
+
+
+

Review Comment:
   nit: remove empty lines



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala:
##
@@ -262,16 +274,28 @@ object DefaultSource {
   new IncrementalRelation(sqlContext, parameters, userSchema, 
metaClient)
 
 case (MERGE_ON_READ, QUERY_TYPE_SNAPSHOT_OPT_VAL, false) =>
-  new MergeOnReadSnapshotRelation(sqlContext, parameters, metaClient, 
globPaths, userSchema)
+  if (newHudiFileFormatUtils.isEmpty || 
newHudiFileFormatUtils.get.hasSchemaOnRead) {

Review Comment:
   Instead of checking `newHudiFileFormatUtils.get.hasSchemaOnRead`, should 
`hoodie.schema.on.read.enable` config be checked when lazy creating 
`newHudiFileFormatUtils` and only check `newHudiFileFormatUtils.isEmpty` here 
for simplicity?



##
hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieCatalystPlansUtils.scala:
##
@@ -77,6 +78,9 @@ trait HoodieCatalystPlansUtils {
*/
   def unapplyMergeIntoTable(plan: LogicalPlan): Option[(LogicalPlan, 
LogicalPlan, Expression)]
 
+
+  def applyNewHoodieParquetFileFormatProjection(plan: LogicalPlan): LogicalPlan

Review Comment:
   nit: add docs to explain what this projection does, e.g. resolving the 
schema based on the resolver, and why this is needed for the new file format.



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala:
##
@@ -262,16 +274,28 @@ object DefaultSource {
   new IncrementalRelation(sqlContext, parameters, userSchema, 
metaClient)
 
 case (MERGE_ON_READ, QUERY_TYPE_SNAPSHOT_OPT_VAL, false) =>
-  new MergeOnReadSnapshotRelation(sqlContext, parameters, metaClient, 
globPaths, userSchema)
+  if (newHudiFileFormatUtils.isEmpty || 
newHudiFileFormatUtils.get.hasSchemaOnRead) {
+new MergeOnReadSnapshotRelation(sqlContext, parameters, 
metaClient, globPaths, userSchema)
+  } else {
+newHudiFileFormatUtils.get.getHadoopFsRelation(isMOR = true, 
isBootstrap = false)
+  }
 
 case (MERGE_ON_READ, QUERY_TYPE_INCREMENTAL_OPT_VAL, _) =>
   new MergeOnReadIncrementalRelation(sqlContext, parameters, 
metaClient, userSchema)
 
 case (MERGE_ON_READ, QUERY_TYPE_SNAPSHOT_OPT_VAL, true) =>
-  new HoodieBootstrapMORRelation(sqlContext, userSchema, globPaths, 
metaClient, parameters)
+  if (newHudiFileFormatUtils.isEmpty || 
newHudiFileFormatUtils.get.hasSchemaOnRead) {

Review Comment:
   Similar here and below.



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala:
##
@@ -87,6 +87,14 @@ object DataSourceReadOptions {
   s"payload implementation to merge (${REALTIME_PAYLOAD_COMBINE_OPT_VAL}) 
or skip merging altogether" +
   s"${REALTIME_SKIP_MERGE_OPT_VAL}")
 
+  val LEGACY_HUDI_PARQUET_FILE_FORMAT: ConfigProperty[String] = ConfigProperty

Review Comment:
   ```suggestion
 val USE_LEGACY_HUDI_PARQUET_FILE_FORMAT: ConfigProperty[String] = 
ConfigProperty
   ```



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala:
##
@@ -247,11 +247,23 @@ object DefaultSource {
   Option(schema)
 }
 
+
+
+
 if (metaClient.getCommitsTimeline.filterCompletedInstants.countInstants() 
== 0) {
   new EmptyRelation(sqlContext, resolveSchema(metaClient, parameters, 
Some(schema)))
 } else if (isCdcQuery) {
   CDCRelation.getCDCRelation(sqlContext, metaClient, parameters)
 } else {
+  lazy val newHudiFileFormatUtils = if 
(!parameters.getOrElse(LEGACY_HUDI_PARQUET_FILE_FORMAT.key,
+LEGACY_HUDI_PARQUET_FILE_FORMAT.defaultValue).toBoolean && (globPaths 
== null || globPaths.isEmpty)
+&& parameters.getOrElse(REALTIME_MERGE.key(), 
REALTIME_MERGE.defaultValue())
+.equalsIgnoreCase(REALTIME_PAYLOAD_COMBINE_OPT_VAL)) {

Review Comment:
   Is there any issue with `REALTIME_SKIP_MERGE_OPT_VAL` merge type?



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##
@@ -144,22 +146,41 @@ case class HoodieFileIndex(spark: SparkSession,
 // Prune the partition path by the partition filters
 // NOTE: Non-partitioned tables are assumed to consist from a single 
partition
 //   encompassing the whole table
-val prunedPartitions = listMatchingPartitionPaths(partitionFilters)
+val prunedPartitions = if (shouldBroadcast) {
+  
listMatchingPartitionPaths(convertFilterForTimestampKeyGenerator(metaClient, 
partitionFilters))
+} else {
+  

[GitHub] [hudi] hudi-bot commented on pull request #8697: [HUDI-5514] Improving usability/performance with out of box default for append only use-cases

2023-08-03 Thread via GitHub


hudi-bot commented on PR #8697:
URL: https://github.com/apache/hudi/pull/8697#issuecomment-1664552015

   
   ## CI report:
   
   * 7a0e3786173a124177688946c37011577fe97478 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18751)
 
   * 214bc9b6f9c0d522f4196cca12daf4756dc96439 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19044)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8697: [HUDI-5514] Improving usability/performance with out of box default for append only use-cases

2023-08-03 Thread via GitHub


hudi-bot commented on PR #8697:
URL: https://github.com/apache/hudi/pull/8697#issuecomment-1664533286

   
   ## CI report:
   
   * 7a0e3786173a124177688946c37011577fe97478 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18751)
 
   * 214bc9b6f9c0d522f4196cca12daf4756dc96439 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9337: [HUDI-6628] Rely on methods in HoodieBaseFile and HoodieLogFile instead of FSUtils when possible

2023-08-03 Thread via GitHub


hudi-bot commented on PR #9337:
URL: https://github.com/apache/hudi/pull/9337#issuecomment-1664517346

   
   ## CI report:
   
   * 306b6c94e2f4793f91ae9b6ffa3f102c8bc2a18e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19037)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9276: [HUDI-6635] Hudi Spark Integration Redesign MOR and Bootstrap reading

2023-08-03 Thread via GitHub


hudi-bot commented on PR #9276:
URL: https://github.com/apache/hudi/pull/9276#issuecomment-1664516647

   
   ## CI report:
   
   * 662f3b320ab6ea06462bad9a4448add1ec2f380a UNKNOWN
   * af768285ff85e68a361482806a5de1e0b9c272a9 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19035)
 
   * 1875a19fd05f413373eb1f2400f390706d62725e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19042)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9308: [HUDI-6606] Use record level index with SQL equality queries

2023-08-03 Thread via GitHub


hudi-bot commented on PR #9308:
URL: https://github.com/apache/hudi/pull/9308#issuecomment-1664516962

   
   ## CI report:
   
   * 5195761aea8a8d9cfc046978601d10a246507da8 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19028)
 
   * ae0002b81c71f77e0c19aeb3e5872b0eb399ddb6 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19043)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9261: [HUDI-6579] Adding support for upsert and deletes with spark datasource for pk less table

2023-08-03 Thread via GitHub


hudi-bot commented on PR #9261:
URL: https://github.com/apache/hudi/pull/9261#issuecomment-1664516511

   
   ## CI report:
   
   * 7d85ec69d154b5b02c81737212f415fc1aeded91 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19034)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch asf-site updated: [DOCS] Update bootstrap page (#9338)

2023-08-03 Thread bhavanisudha
This is an automated email from the ASF dual-hosted git repository.

bhavanisudha pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 8cd33fa3ad8 [DOCS] Update bootstrap page (#9338)
8cd33fa3ad8 is described below

commit 8cd33fa3ad842afd900b1aef00e93fdf1bbe6e7f
Author: Bhavani Sudha Saktheeswaran <2179254+bhasu...@users.noreply.github.com>
AuthorDate: Thu Aug 3 12:05:54 2023 -0700

[DOCS] Update bootstrap page (#9338)

* [HUDI-6112] Fix bugs in Doc Generation tool

- Add Config Param in Description
- Styling changes to fix table size and toc on side for better navigation
- Bug fix in basic configs page to merge spark datasource related read and 
write configs

* [DOCS] Update bootstrap page with configs
---
 website/docs/migration_guide.md| 49 +++---
 website/src/theme/DocPage/index.js |  2 +-
 2 files changed, 41 insertions(+), 10 deletions(-)

diff --git a/website/docs/migration_guide.md b/website/docs/migration_guide.md
index 31cbdbcb956..ce36f3a38f0 100644
--- a/website/docs/migration_guide.md
+++ b/website/docs/migration_guide.md
@@ -3,6 +3,9 @@ title: Bootstrapping
 keywords: [ hudi, migration, use case]
 summary: In this page, we will discuss some available tools for migrating your 
existing table into a Hudi table
 last_modified_at: 2019-12-30T15:59:57-04:00
+toc: true
+toc_min_heading_level: 2
+toc_max_heading_level: 4
 ---
 
 Hudi maintains metadata such as commit timeline and indexes to manage a table. 
The commit timelines helps to understand the actions happening on a table as 
well as the current state of a table. Indexes are used by Hudi to maintain a 
record key to file id mapping to efficiently locate a record. At the moment, 
Hudi supports writing only parquet columnar formats.
@@ -35,12 +38,20 @@ Import your existing table into a Hudi managed table. Since 
all the data is Hudi
 
 There are a few options when choosing this approach.
 
-**Option 1**
-Use the HoodieStreamer tool. HoodieStreamer supports bootstrap with 
--run-bootstrap command line option. There are two types of bootstrap,
-METADATA_ONLY and FULL_RECORD. METADATA_ONLY will generate just skeleton base 
files with keys/footers, avoiding full cost of rewriting the dataset.
-FULL_RECORD will perform a full copy/rewrite of the data as a Hudi table.
+ Using Hudi Streamer
+
+Use the [Hudi Streamer](/docs/hoodie_deltastreamer#hudi-streamer) tool. 
HoodieStreamer supports bootstrap with 
+--run-bootstrap command line option. There are two types of bootstrap, 
METADATA_ONLY and FULL_RECORD. METADATA_ONLY will
+generate just skeleton base files with keys/footers, avoiding full cost of 
rewriting the dataset. FULL_RECORD will 
+perform a full copy/rewrite of the data as a Hudi table.  Additionally, once 
can choose selective partitions using regex
+patterns to apply one of the above bootstrap modes. 
+
+Here is an example for running FULL_RECORD bootstrap on all partitions that 
match the regex pattern `.*` and keeping 
+hive style partition with HoodieStreamer. This example configures 
+[hoodie.bootstrap.mode.selector](https://hudi.apache.org/docs/configurations#hoodiebootstrapmodeselector)
 to 
+`org.apache.hudi.client.bootstrap.selector.BootstrapRegexModeSelector`  which 
allows applying `FULL_RECORD` bootstrap 
+mode to selective partitions based on the regex pattern 
[hoodie.bootstrap.mode.selector.regex](https://hudi.apache.org/docs/configurations#hoodiebootstrapmodeselectorregex)
 
-Here is an example for running FULL_RECORD bootstrap and keeping hive style 
partition with HoodieStreamer.
 ```
 spark-submit --master local \
 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
@@ -54,13 +65,14 @@ spark-submit --master local \
 --hoodie-conf hoodie.datasource.write.partitionpath.field=${PARTITION_FIELD} \
 --hoodie-conf hoodie.datasource.write.precombine.field=${PRECOMBINE_FILED} \
 --hoodie-conf 
hoodie.bootstrap.keygen.class=org.apache.hudi.keygen.SimpleKeyGenerator \
---hoodie-conf 
hoodie.bootstrap.full.input.provider=org.apache.hudi.bootstrap.SparkParquetBootstrapDataProvider
 \
 --hoodie-conf 
hoodie.bootstrap.mode.selector=org.apache.hudi.client.bootstrap.selector.BootstrapRegexModeSelector
 \
+--hoodie-conf hoodie.bootstrap.mode.selector.regex='.*' \
 --hoodie-conf hoodie.bootstrap.mode.selector.regex.mode=FULL_RECORD \
 --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true
 ``` 
 
-**Option 2**
+ Using Spark Datasource Writer
+
 For huge tables, this could be as simple as : 
 ```java
 for partition in [list of partitions in source table] {
@@ -69,7 +81,12 @@ for partition in [list of partitions in source table] {
 }
 ```  
 
-**Option 3**
+ Using Spark SQL CALL Procedure
+
+Refer to [Bootstrap 
procedure](https://hudi.apache.org/docs/next/procedures#bootstrap) for more 
details. 
+

[GitHub] [hudi] bhasudha merged pull request #9338: [DOCS] Update bootstrap page

2023-08-03 Thread via GitHub


bhasudha merged PR #9338:
URL: https://github.com/apache/hudi/pull/9338


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] amrishlal commented on a diff in pull request #9324: [HUDI-6619] [WIP] Fix hudi-integ-test-bundle dependency on jackson jsk310 package.

2023-08-03 Thread via GitHub


amrishlal commented on code in PR #9324:
URL: https://github.com/apache/hudi/pull/9324#discussion_r1283598675


##
pom.xml:
##
@@ -98,8 +98,6 @@
 
${fasterxml.spark3.version}
 
${fasterxml.spark3.version}
 
${fasterxml.spark3.version}
-
-

Review Comment:
   @danny0405 For now, we seem to be shading all the jackson packages and hence 
it seems like a good idea to shade jackson JSR package as well to keep the 
versions across jackson packages consistent (otherwise testing may become 
problematic). But yes, as a future item we could look into not shading any of 
the jackson packages at all for spark versions (>= 3.1 I think) where Jackson 
is already being included. We would also need to confirm on the Flink side (as 
I was seeing some errors in Flink validate bundles tests). For now though we 
would like to get our integration test setup up and running as that is blocked 
by this PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] bhasudha commented on pull request #9338: [DOCS] Update bootstrap page

2023-08-03 Thread via GitHub


bhasudha commented on PR #9338:
URL: https://github.com/apache/hudi/pull/9338#issuecomment-1664460516

   ![Screenshot 2023-08-03 at 11 40 19 
AM](https://github.com/apache/hudi/assets/2179254/2ef881a4-6537-4465-9a03-9e4fe7cf99f8)
   ![Screenshot 2023-08-03 at 11 40 28 
AM](https://github.com/apache/hudi/assets/2179254/06ce18e7-66fd-4534-9a18-ab188ab84de8)
   ![Screenshot 2023-08-03 at 11 40 36 
AM](https://github.com/apache/hudi/assets/2179254/99d417e7-630c-4bdb-b36a-6615874df51b)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



  1   2   3   >