[GitHub] [hudi] SteNicholas opened a new pull request, #7940: [HUDI-5787] HoodieHiveCatalog should not delete data for dropping external table

2023-02-14 Thread via GitHub


SteNicholas opened a new pull request, #7940:
URL: https://github.com/apache/hudi/pull/7940

   ### Change Logs
   
   `HoodieHiveCatalog` should not delete data when dropping the Hive external 
table, for example, the value of the 
`hoodie.datasource.hive_sync.create_managed_table` config is false.
   
   ### Impact
   
   `HoodieHiveCatalog` drops the external table without deleting the data.
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Change Logs and Impact were stated clearly
   - [x] Adequate tests were added if applicable
   - [x] CI passed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-5787) HoodieHiveCatalog should not delete data for dropping external table

2023-02-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-5787:
-
Labels: pull-request-available  (was: )

> HoodieHiveCatalog should not delete data for dropping external table
> 
>
> Key: HUDI-5787
> URL: https://issues.apache.org/jira/browse/HUDI-5787
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink
>Reporter: Nicholas Jiang
>Assignee: Nicholas Jiang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>
> HoodieHiveCatalog should not delete data when dropping the Hive external 
> table, for example, the value of the 
> 'hoodie.datasource.hive_sync.create_managed_table' config is false.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] eric9204 closed pull request #7205: [HUDI-5094] modify avro schema for drop partition fields case

2023-02-14 Thread via GitHub


eric9204 closed pull request #7205: [HUDI-5094] modify avro schema for drop 
partition fields case
URL: https://github.com/apache/hudi/pull/7205


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7940: [HUDI-5787] HoodieHiveCatalog should not delete data for dropping external table

2023-02-14 Thread via GitHub


hudi-bot commented on PR #7940:
URL: https://github.com/apache/hudi/pull/7940#issuecomment-1429334086

   
   ## CI report:
   
   * 5e9308f176e728950ddfe41a931b54eae4e6f40a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7868: [HUDI-1593] Add support for "show restores" and "show restore" commands in hudi-cli

2023-02-14 Thread via GitHub


hudi-bot commented on PR #7868:
URL: https://github.com/apache/hudi/pull/7868#issuecomment-1429341996

   
   ## CI report:
   
   * 943c91a266397e07f0aa10289ed78ef277ce42f9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15149)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7940: [HUDI-5787] HoodieHiveCatalog should not delete data for dropping external table

2023-02-14 Thread via GitHub


hudi-bot commented on PR #7940:
URL: https://github.com/apache/hudi/pull/7940#issuecomment-1429342471

   
   ## CI report:
   
   * 5e9308f176e728950ddfe41a931b54eae4e6f40a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15154)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] stream2000 closed pull request #7918: [MINOR] Fix spark sql run clean do not exit

2023-02-14 Thread via GitHub


stream2000 closed pull request #7918: [MINOR] Fix spark sql run clean do not 
exit
URL: https://github.com/apache/hudi/pull/7918


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] gaoshihang opened a new pull request, #7941: [HUDI-5786] Add a new config to specific spark write rdd storage level

2023-02-14 Thread via GitHub


gaoshihang opened a new pull request, #7941:
URL: https://github.com/apache/hudi/pull/7941

   ### Change Logs
   
   In BaseSparkCommitActionExecutor.java, This RDD is hardcoded as 
persist(MEMORY_AND_DISK_SER)
   `// TODO: Consistent contract in HoodieWriteClient regarding 
preppedRecord storage level handling
   JavaRDD> inputRDD = 
HoodieJavaRDD.getJavaRDD(inputRecords);
   if (inputRDD.getStorageLevel() == StorageLevel.NONE()) {
 inputRDD.persist(StorageLevel.MEMORY_AND_DISK_SER());
   } else {
 LOG.info("RDD PreppedRecords was persisted at: " + 
inputRDD.getStorageLevel());
   }`
   But if we want to change its storage level, we have no way to use a 
parameter.
   
   ### Impact
   Add a new config: hoodie.spark.write.storage.level.
   To specific the storage level of this RDD.
   1. the config's default value is MEMORY_AND_DISK_SER
   
![image](https://user-images.githubusercontent.com/20013931/218693697-6cd84e79-31ed-4bd2-bf4a-42add13fd9dd.png)
   
   2. If set the config to another storage level
   "hoodie.spark.write.storage.level": "DISK_ONLY"
   
![image](https://user-images.githubusercontent.com/20013931/218694143-b38042df-e089-47c1-97fa-ea07aa911ab7.png)
   
   3. If the storage level is wrong, It will throw exception
   like set "hoodie.spark.write.storage.level": "DISKE_ONLY"
   Caused by: java.lang.IllegalArgumentException: Invalid StorageLevel: 
DISKE_ONLY
   
   ### Risk level (write none, low medium or high below)
   Low
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-5786) Add a new config to specifies the cache level for the rdd spark write to hudi

2023-02-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-5786:
-
Labels: pull-request-available  (was: )

> Add a new config to specifies the cache level for the rdd spark write to hudi
> -
>
> Key: HUDI-5786
> URL: https://issues.apache.org/jira/browse/HUDI-5786
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: ShiHang Gao
>Priority: Major
>  Labels: pull-request-available
>
> Before building workload profile/doing partition and writing data, the cache 
> level of spark rdd is hard-coded for MEMORY_AND_DISK_SER, and a new 
> configuration is added to set this storage level



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] Zouxxyy commented on a diff in pull request #7872: [HUDI-5716] Cleaning up `Partitioner`s hierarchy

2023-02-14 Thread via GitHub


Zouxxyy commented on code in PR #7872:
URL: https://github.com/apache/hudi/pull/7872#discussion_r1105531548


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RowCustomColumnsSortPartitioner.java:
##
@@ -19,43 +19,70 @@
 package org.apache.hudi.execution.bulkinsert;
 
 import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.table.HoodieTableConfig;
 import org.apache.hudi.config.HoodieWriteConfig;
-import org.apache.hudi.table.BulkInsertPartitioner;
-
+import org.apache.spark.sql.Column;
 import org.apache.spark.sql.Dataset;
 import org.apache.spark.sql.Row;
+import scala.collection.JavaConverters;
 
 import java.util.Arrays;
+import java.util.stream.Collectors;
+
+import static org.apache.hudi.common.util.ValidationUtils.checkState;
+import static 
org.apache.hudi.execution.bulkinsert.RDDCustomColumnsSortPartitioner.getOrderByColumnNames;
 
 /**
- * A partitioner that does sorting based on specified column values for each 
spark partitions.
+ * A partitioner that does local sorting for each RDD partition based on the 
tuple of
+ * values of the columns configured for ordering.
  */
-public class RowCustomColumnsSortPartitioner implements 
BulkInsertPartitioner> {
+public class RowCustomColumnsSortPartitioner extends 
RepartitioningBulkInsertPartitionerBase> {
 
-  private final String[] sortColumnNames;
+  private final String[] orderByColumnNames;
 
-  public RowCustomColumnsSortPartitioner(HoodieWriteConfig config) {
-this.sortColumnNames = getSortColumnName(config);
+  public RowCustomColumnsSortPartitioner(HoodieWriteConfig config, 
HoodieTableConfig tableConfig) {
+super(tableConfig);
+this.orderByColumnNames = getOrderByColumnNames(config);
+
+checkState(orderByColumnNames.length > 0);
   }
 
-  public RowCustomColumnsSortPartitioner(String[] columnNames) {
-this.sortColumnNames = columnNames;
+  public RowCustomColumnsSortPartitioner(String[] columnNames, 
HoodieTableConfig tableConfig) {
+super(tableConfig);
+this.orderByColumnNames = columnNames;
+
+checkState(orderByColumnNames.length > 0);
   }
 
   @Override
-  public Dataset repartitionRecords(Dataset records, int 
outputSparkPartitions) {
-final String[] sortColumns = this.sortColumnNames;
-return records.sort(HoodieRecord.PARTITION_PATH_METADATA_FIELD, 
sortColumns)
-.coalesce(outputSparkPartitions);
+  public Dataset repartitionRecords(Dataset dataset, int 
targetPartitionNumHint) {
+Dataset repartitionedDataset;
+
+// NOTE: In case of partitioned table even "global" ordering (across all 
RDD partitions) could
+//   not change table's partitioning and therefore there's no point in 
doing global sorting
+//   across "physical" partitions, and instead we can reduce total 
amount of data being
+//   shuffled by doing do "local" sorting:

Review Comment:
   As far as I know, RowCustomColumnsSortPartitioner will only be used in 
`cluster`. At this time, the files in the same FG should already be in one 
physical partition.



##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/NonSortPartitioner.java:
##
@@ -31,38 +29,15 @@
  * 

Review Comment:
   `enforceNumOutputPartitions` has been deleted, maybe the expression here can 
be modified



##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/SparkBulkInsertPartitionerBase.java:
##
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.execution.bulkinsert;
+
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.table.BulkInsertPartitioner;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.HoodieUnsafeUtils$;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession$;
+
+public abstract class SparkBulkInsertPartitionerBase implements 
BulkInsertPartitioner {
+
+  protected static Dataset tryCoalesce(Dataset dataset, int 
targetPartitionNumHint) {
+// NOTE: In case incoming [[Dataset]]'s partition count matches the target 
one,
+//   we short

[GitHub] [hudi] kazdy commented on pull request #7935: [DRAFT] Add maven-build-cache-extension

2023-02-14 Thread via GitHub


kazdy commented on PR #7935:
URL: https://github.com/apache/hudi/pull/7935#issuecomment-1429415229

   @xushiyan would you be interested in introducing this extension?
   
   Although just adding the extension works, it comes with some configurations 
worth reviewing (excluding files etc.).
   It will require changes in CI like bumping mvn to 3.9. If you preferred to 
have a remote cache to speed up builds or rather build clean with no cache in 
CI. With a remote cache, it could reduce CI time a bit and leave more for 
running tests.
   I also see that mvn install failed in CI with OOM so that can be an issue.
   
   Anyways it seems like a pretty big change and it rather feels like something 
for Hudi commiters/maintainers ?
   What do you think? I'm willing to help of course :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] Zouxxyy commented on a diff in pull request #7872: [HUDI-5716] Cleaning up `Partitioner`s hierarchy

2023-02-14 Thread via GitHub


Zouxxyy commented on code in PR #7872:
URL: https://github.com/apache/hudi/pull/7872#discussion_r1105505809


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/SparkBulkInsertPartitionerBase.java:
##
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.execution.bulkinsert;
+
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.table.BulkInsertPartitioner;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.HoodieUnsafeUtils$;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession$;
+
+public abstract class SparkBulkInsertPartitionerBase implements 
BulkInsertPartitioner {
+
+  protected static Dataset tryCoalesce(Dataset dataset, int 
targetPartitionNumHint) {
+// NOTE: In case incoming [[Dataset]]'s partition count matches the target 
one,
+//   we short-circuit coalescing altogether (since this isn't done by 
Spark itself)
+if (targetPartitionNumHint > 0 && targetPartitionNumHint != 
HoodieUnsafeUtils$.MODULE$.getNumPartitions(dataset)) {
+  return dataset.coalesce(targetPartitionNumHint);
+}
+
+return dataset;
+  }
+
+  protected static  JavaRDD> 
tryCoalesce(JavaRDD> records,
+int 
targetPartitionNumHint) {
+// NOTE: In case incoming [[RDD]]'s partition count matches the target one,
+//   we short-circuit coalescing altogether (since this isn't done by 
Spark itself)
+if (targetPartitionNumHint > 0 && targetPartitionNumHint != 
records.getNumPartitions()) {

Review Comment:
   When `targetPartitionNumHint` > `records.getNumPartitions` `coalesce` is 
meaningless, maybe we can use `targetPartitionNumHint !< 
records.getNumPartitions()` as the condition
   
   In addition, I actually have a question, if `targetPartitionNumHint > 
records.getNumPartitions`, should we use `repartition` instead of `coalesce`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] Zouxxyy commented on a diff in pull request #7872: [HUDI-5716] Cleaning up `Partitioner`s hierarchy

2023-02-14 Thread via GitHub


Zouxxyy commented on code in PR #7872:
URL: https://github.com/apache/hudi/pull/7872#discussion_r1105505809


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/SparkBulkInsertPartitionerBase.java:
##
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.execution.bulkinsert;
+
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.table.BulkInsertPartitioner;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.HoodieUnsafeUtils$;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession$;
+
+public abstract class SparkBulkInsertPartitionerBase implements 
BulkInsertPartitioner {
+
+  protected static Dataset tryCoalesce(Dataset dataset, int 
targetPartitionNumHint) {
+// NOTE: In case incoming [[Dataset]]'s partition count matches the target 
one,
+//   we short-circuit coalescing altogether (since this isn't done by 
Spark itself)
+if (targetPartitionNumHint > 0 && targetPartitionNumHint != 
HoodieUnsafeUtils$.MODULE$.getNumPartitions(dataset)) {
+  return dataset.coalesce(targetPartitionNumHint);
+}
+
+return dataset;
+  }
+
+  protected static  JavaRDD> 
tryCoalesce(JavaRDD> records,
+int 
targetPartitionNumHint) {
+// NOTE: In case incoming [[RDD]]'s partition count matches the target one,
+//   we short-circuit coalescing altogether (since this isn't done by 
Spark itself)
+if (targetPartitionNumHint > 0 && targetPartitionNumHint != 
records.getNumPartitions()) {

Review Comment:
   When `targetPartitionNumHint` > `records.getNumPartitions` `coalesce` is 
meaningless, maybe we can use `targetPartitionNumHint < 
records.getNumPartitions()` as the condition
   
   In addition, I actually have a question, if `targetPartitionNumHint > 
records.getNumPartitions`, should we use `repartition` instead of `coalesce`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7941: [HUDI-5786] Add a new config to specific spark write rdd storage level

2023-02-14 Thread via GitHub


hudi-bot commented on PR #7941:
URL: https://github.com/apache/hudi/pull/7941#issuecomment-1429418042

   
   ## CI report:
   
   * 42e12ad1b6bebdcc3dc9d985e5be661b198f3f5c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] Zouxxyy commented on a diff in pull request #7872: [HUDI-5716] Cleaning up `Partitioner`s hierarchy

2023-02-14 Thread via GitHub


Zouxxyy commented on code in PR #7872:
URL: https://github.com/apache/hudi/pull/7872#discussion_r1105540207


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RowCustomColumnsSortPartitioner.java:
##
@@ -19,43 +19,70 @@
 package org.apache.hudi.execution.bulkinsert;
 
 import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.table.HoodieTableConfig;
 import org.apache.hudi.config.HoodieWriteConfig;
-import org.apache.hudi.table.BulkInsertPartitioner;
-
+import org.apache.spark.sql.Column;
 import org.apache.spark.sql.Dataset;
 import org.apache.spark.sql.Row;
+import scala.collection.JavaConverters;
 
 import java.util.Arrays;
+import java.util.stream.Collectors;
+
+import static org.apache.hudi.common.util.ValidationUtils.checkState;
+import static 
org.apache.hudi.execution.bulkinsert.RDDCustomColumnsSortPartitioner.getOrderByColumnNames;
 
 /**
- * A partitioner that does sorting based on specified column values for each 
spark partitions.
+ * A partitioner that does local sorting for each RDD partition based on the 
tuple of
+ * values of the columns configured for ordering.
  */
-public class RowCustomColumnsSortPartitioner implements 
BulkInsertPartitioner> {
+public class RowCustomColumnsSortPartitioner extends 
RepartitioningBulkInsertPartitionerBase> {
 
-  private final String[] sortColumnNames;
+  private final String[] orderByColumnNames;
 
-  public RowCustomColumnsSortPartitioner(HoodieWriteConfig config) {
-this.sortColumnNames = getSortColumnName(config);
+  public RowCustomColumnsSortPartitioner(HoodieWriteConfig config, 
HoodieTableConfig tableConfig) {
+super(tableConfig);
+this.orderByColumnNames = getOrderByColumnNames(config);
+
+checkState(orderByColumnNames.length > 0);
   }
 
-  public RowCustomColumnsSortPartitioner(String[] columnNames) {
-this.sortColumnNames = columnNames;
+  public RowCustomColumnsSortPartitioner(String[] columnNames, 
HoodieTableConfig tableConfig) {
+super(tableConfig);
+this.orderByColumnNames = columnNames;
+
+checkState(orderByColumnNames.length > 0);
   }
 
   @Override
-  public Dataset repartitionRecords(Dataset records, int 
outputSparkPartitions) {
-final String[] sortColumns = this.sortColumnNames;
-return records.sort(HoodieRecord.PARTITION_PATH_METADATA_FIELD, 
sortColumns)
-.coalesce(outputSparkPartitions);
+  public Dataset repartitionRecords(Dataset dataset, int 
targetPartitionNumHint) {
+Dataset repartitionedDataset;
+
+// NOTE: In case of partitioned table even "global" ordering (across all 
RDD partitions) could
+//   not change table's partitioning and therefore there's no point in 
doing global sorting
+//   across "physical" partitions, and instead we can reduce total 
amount of data being
+//   shuffled by doing do "local" sorting:
+//  - First, re-partitioning dataset such that "logical" 
partitions are aligned w/
+//  "physical" ones
+//  - Sorting locally w/in RDD ("logical") partitions
+//
+//   Non-partitioned tables will be globally sorted.
+if (isPartitionedTable) {
+  repartitionedDataset = 
dataset.repartition(handleTargetPartitionNumHint(targetPartitionNumHint),
+  new Column(HoodieRecord.PARTITION_PATH_METADATA_FIELD));
+} else {
+  repartitionedDataset = tryCoalesce(dataset, targetPartitionNumHint);

Review Comment:
   In addition, I wonder if `coalesce` can meet our needs. For example, if we 
want to modify the FG containing N files into M (M>N) files (using cluster), 
shuffle needs to happen anyway.



##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RowCustomColumnsSortPartitioner.java:
##
@@ -19,43 +19,70 @@
 package org.apache.hudi.execution.bulkinsert;
 
 import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.table.HoodieTableConfig;
 import org.apache.hudi.config.HoodieWriteConfig;
-import org.apache.hudi.table.BulkInsertPartitioner;
-
+import org.apache.spark.sql.Column;
 import org.apache.spark.sql.Dataset;
 import org.apache.spark.sql.Row;
+import scala.collection.JavaConverters;
 
 import java.util.Arrays;
+import java.util.stream.Collectors;
+
+import static org.apache.hudi.common.util.ValidationUtils.checkState;
+import static 
org.apache.hudi.execution.bulkinsert.RDDCustomColumnsSortPartitioner.getOrderByColumnNames;
 
 /**
- * A partitioner that does sorting based on specified column values for each 
spark partitions.
+ * A partitioner that does local sorting for each RDD partition based on the 
tuple of
+ * values of the columns configured for ordering.
  */
-public class RowCustomColumnsSortPartitioner implements 
BulkInsertPartitioner> {
+public class RowCustomColumnsSortPartitioner extends 
RepartitioningBulkInsertPartitionerBase> {
 
-  private final String[] sortColumnNames;
+  private final String[] ord

[GitHub] [hudi] Zouxxyy commented on a diff in pull request #7872: [HUDI-5716] Cleaning up `Partitioner`s hierarchy

2023-02-14 Thread via GitHub


Zouxxyy commented on code in PR #7872:
URL: https://github.com/apache/hudi/pull/7872#discussion_r1105540267


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RowCustomColumnsSortPartitioner.java:
##
@@ -19,43 +19,70 @@
 package org.apache.hudi.execution.bulkinsert;
 
 import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.table.HoodieTableConfig;
 import org.apache.hudi.config.HoodieWriteConfig;
-import org.apache.hudi.table.BulkInsertPartitioner;
-
+import org.apache.spark.sql.Column;
 import org.apache.spark.sql.Dataset;
 import org.apache.spark.sql.Row;
+import scala.collection.JavaConverters;
 
 import java.util.Arrays;
+import java.util.stream.Collectors;
+
+import static org.apache.hudi.common.util.ValidationUtils.checkState;
+import static 
org.apache.hudi.execution.bulkinsert.RDDCustomColumnsSortPartitioner.getOrderByColumnNames;
 
 /**
- * A partitioner that does sorting based on specified column values for each 
spark partitions.
+ * A partitioner that does local sorting for each RDD partition based on the 
tuple of
+ * values of the columns configured for ordering.
  */
-public class RowCustomColumnsSortPartitioner implements 
BulkInsertPartitioner> {
+public class RowCustomColumnsSortPartitioner extends 
RepartitioningBulkInsertPartitionerBase> {
 
-  private final String[] sortColumnNames;
+  private final String[] orderByColumnNames;
 
-  public RowCustomColumnsSortPartitioner(HoodieWriteConfig config) {
-this.sortColumnNames = getSortColumnName(config);
+  public RowCustomColumnsSortPartitioner(HoodieWriteConfig config, 
HoodieTableConfig tableConfig) {
+super(tableConfig);
+this.orderByColumnNames = getOrderByColumnNames(config);
+
+checkState(orderByColumnNames.length > 0);
   }
 
-  public RowCustomColumnsSortPartitioner(String[] columnNames) {
-this.sortColumnNames = columnNames;
+  public RowCustomColumnsSortPartitioner(String[] columnNames, 
HoodieTableConfig tableConfig) {
+super(tableConfig);
+this.orderByColumnNames = columnNames;
+
+checkState(orderByColumnNames.length > 0);
   }
 
   @Override
-  public Dataset repartitionRecords(Dataset records, int 
outputSparkPartitions) {
-final String[] sortColumns = this.sortColumnNames;
-return records.sort(HoodieRecord.PARTITION_PATH_METADATA_FIELD, 
sortColumns)
-.coalesce(outputSparkPartitions);
+  public Dataset repartitionRecords(Dataset dataset, int 
targetPartitionNumHint) {
+Dataset repartitionedDataset;
+
+// NOTE: In case of partitioned table even "global" ordering (across all 
RDD partitions) could
+//   not change table's partitioning and therefore there's no point in 
doing global sorting
+//   across "physical" partitions, and instead we can reduce total 
amount of data being
+//   shuffled by doing do "local" sorting:
+//  - First, re-partitioning dataset such that "logical" 
partitions are aligned w/
+//  "physical" ones
+//  - Sorting locally w/in RDD ("logical") partitions
+//
+//   Non-partitioned tables will be globally sorted.
+if (isPartitionedTable) {
+  repartitionedDataset = 
dataset.repartition(handleTargetPartitionNumHint(targetPartitionNumHint),
+  new Column(HoodieRecord.PARTITION_PATH_METADATA_FIELD));
+} else {
+  repartitionedDataset = tryCoalesce(dataset, targetPartitionNumHint);

Review Comment:
   In addition, I wonder if `coalesce` can meet our needs. For example, if we 
want to modify the FG containing N files into M (M>N) files (using `cluster`), 
shuffle needs to happen anyway.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7941: [HUDI-5786] Add a new config to specific spark write rdd storage level

2023-02-14 Thread via GitHub


hudi-bot commented on PR #7941:
URL: https://github.com/apache/hudi/pull/7941#issuecomment-1429430076

   
   ## CI report:
   
   * 42e12ad1b6bebdcc3dc9d985e5be661b198f3f5c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15155)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] Zouxxyy commented on a diff in pull request #7872: [HUDI-5716] Cleaning up `Partitioner`s hierarchy

2023-02-14 Thread via GitHub


Zouxxyy commented on code in PR #7872:
URL: https://github.com/apache/hudi/pull/7872#discussion_r1105531548


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RowCustomColumnsSortPartitioner.java:
##
@@ -19,43 +19,70 @@
 package org.apache.hudi.execution.bulkinsert;
 
 import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.table.HoodieTableConfig;
 import org.apache.hudi.config.HoodieWriteConfig;
-import org.apache.hudi.table.BulkInsertPartitioner;
-
+import org.apache.spark.sql.Column;
 import org.apache.spark.sql.Dataset;
 import org.apache.spark.sql.Row;
+import scala.collection.JavaConverters;
 
 import java.util.Arrays;
+import java.util.stream.Collectors;
+
+import static org.apache.hudi.common.util.ValidationUtils.checkState;
+import static 
org.apache.hudi.execution.bulkinsert.RDDCustomColumnsSortPartitioner.getOrderByColumnNames;
 
 /**
- * A partitioner that does sorting based on specified column values for each 
spark partitions.
+ * A partitioner that does local sorting for each RDD partition based on the 
tuple of
+ * values of the columns configured for ordering.
  */
-public class RowCustomColumnsSortPartitioner implements 
BulkInsertPartitioner> {
+public class RowCustomColumnsSortPartitioner extends 
RepartitioningBulkInsertPartitionerBase> {
 
-  private final String[] sortColumnNames;
+  private final String[] orderByColumnNames;
 
-  public RowCustomColumnsSortPartitioner(HoodieWriteConfig config) {
-this.sortColumnNames = getSortColumnName(config);
+  public RowCustomColumnsSortPartitioner(HoodieWriteConfig config, 
HoodieTableConfig tableConfig) {
+super(tableConfig);
+this.orderByColumnNames = getOrderByColumnNames(config);
+
+checkState(orderByColumnNames.length > 0);
   }
 
-  public RowCustomColumnsSortPartitioner(String[] columnNames) {
-this.sortColumnNames = columnNames;
+  public RowCustomColumnsSortPartitioner(String[] columnNames, 
HoodieTableConfig tableConfig) {
+super(tableConfig);
+this.orderByColumnNames = columnNames;
+
+checkState(orderByColumnNames.length > 0);
   }
 
   @Override
-  public Dataset repartitionRecords(Dataset records, int 
outputSparkPartitions) {
-final String[] sortColumns = this.sortColumnNames;
-return records.sort(HoodieRecord.PARTITION_PATH_METADATA_FIELD, 
sortColumns)
-.coalesce(outputSparkPartitions);
+  public Dataset repartitionRecords(Dataset dataset, int 
targetPartitionNumHint) {
+Dataset repartitionedDataset;
+
+// NOTE: In case of partitioned table even "global" ordering (across all 
RDD partitions) could
+//   not change table's partitioning and therefore there's no point in 
doing global sorting
+//   across "physical" partitions, and instead we can reduce total 
amount of data being
+//   shuffled by doing do "local" sorting:

Review Comment:
   As far as I know, RowCustomColumnsSortPartitioner will only be used in 
`cluster`. At this time, the files in the same `clusteringGroup` should already 
be in one physical partition.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] Zouxxyy commented on a diff in pull request #7872: [HUDI-5716] Cleaning up `Partitioner`s hierarchy

2023-02-14 Thread via GitHub


Zouxxyy commented on code in PR #7872:
URL: https://github.com/apache/hudi/pull/7872#discussion_r1105540267


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RowCustomColumnsSortPartitioner.java:
##
@@ -19,43 +19,70 @@
 package org.apache.hudi.execution.bulkinsert;
 
 import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.table.HoodieTableConfig;
 import org.apache.hudi.config.HoodieWriteConfig;
-import org.apache.hudi.table.BulkInsertPartitioner;
-
+import org.apache.spark.sql.Column;
 import org.apache.spark.sql.Dataset;
 import org.apache.spark.sql.Row;
+import scala.collection.JavaConverters;
 
 import java.util.Arrays;
+import java.util.stream.Collectors;
+
+import static org.apache.hudi.common.util.ValidationUtils.checkState;
+import static 
org.apache.hudi.execution.bulkinsert.RDDCustomColumnsSortPartitioner.getOrderByColumnNames;
 
 /**
- * A partitioner that does sorting based on specified column values for each 
spark partitions.
+ * A partitioner that does local sorting for each RDD partition based on the 
tuple of
+ * values of the columns configured for ordering.
  */
-public class RowCustomColumnsSortPartitioner implements 
BulkInsertPartitioner> {
+public class RowCustomColumnsSortPartitioner extends 
RepartitioningBulkInsertPartitionerBase> {
 
-  private final String[] sortColumnNames;
+  private final String[] orderByColumnNames;
 
-  public RowCustomColumnsSortPartitioner(HoodieWriteConfig config) {
-this.sortColumnNames = getSortColumnName(config);
+  public RowCustomColumnsSortPartitioner(HoodieWriteConfig config, 
HoodieTableConfig tableConfig) {
+super(tableConfig);
+this.orderByColumnNames = getOrderByColumnNames(config);
+
+checkState(orderByColumnNames.length > 0);
   }
 
-  public RowCustomColumnsSortPartitioner(String[] columnNames) {
-this.sortColumnNames = columnNames;
+  public RowCustomColumnsSortPartitioner(String[] columnNames, 
HoodieTableConfig tableConfig) {
+super(tableConfig);
+this.orderByColumnNames = columnNames;
+
+checkState(orderByColumnNames.length > 0);
   }
 
   @Override
-  public Dataset repartitionRecords(Dataset records, int 
outputSparkPartitions) {
-final String[] sortColumns = this.sortColumnNames;
-return records.sort(HoodieRecord.PARTITION_PATH_METADATA_FIELD, 
sortColumns)
-.coalesce(outputSparkPartitions);
+  public Dataset repartitionRecords(Dataset dataset, int 
targetPartitionNumHint) {
+Dataset repartitionedDataset;
+
+// NOTE: In case of partitioned table even "global" ordering (across all 
RDD partitions) could
+//   not change table's partitioning and therefore there's no point in 
doing global sorting
+//   across "physical" partitions, and instead we can reduce total 
amount of data being
+//   shuffled by doing do "local" sorting:
+//  - First, re-partitioning dataset such that "logical" 
partitions are aligned w/
+//  "physical" ones
+//  - Sorting locally w/in RDD ("logical") partitions
+//
+//   Non-partitioned tables will be globally sorted.
+if (isPartitionedTable) {
+  repartitionedDataset = 
dataset.repartition(handleTargetPartitionNumHint(targetPartitionNumHint),
+  new Column(HoodieRecord.PARTITION_PATH_METADATA_FIELD));
+} else {
+  repartitionedDataset = tryCoalesce(dataset, targetPartitionNumHint);

Review Comment:
   In addition, I wonder if `coalesce` can meet our needs. For example, if we 
want to modify the `clusteringGroup` containing N partitions into M (M>N) files 
(using `cluster`), shuffle needs to happen anyway.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] Zouxxyy commented on a diff in pull request #7872: [HUDI-5716] Cleaning up `Partitioner`s hierarchy

2023-02-14 Thread via GitHub


Zouxxyy commented on code in PR #7872:
URL: https://github.com/apache/hudi/pull/7872#discussion_r1105540267


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RowCustomColumnsSortPartitioner.java:
##
@@ -19,43 +19,70 @@
 package org.apache.hudi.execution.bulkinsert;
 
 import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.table.HoodieTableConfig;
 import org.apache.hudi.config.HoodieWriteConfig;
-import org.apache.hudi.table.BulkInsertPartitioner;
-
+import org.apache.spark.sql.Column;
 import org.apache.spark.sql.Dataset;
 import org.apache.spark.sql.Row;
+import scala.collection.JavaConverters;
 
 import java.util.Arrays;
+import java.util.stream.Collectors;
+
+import static org.apache.hudi.common.util.ValidationUtils.checkState;
+import static 
org.apache.hudi.execution.bulkinsert.RDDCustomColumnsSortPartitioner.getOrderByColumnNames;
 
 /**
- * A partitioner that does sorting based on specified column values for each 
spark partitions.
+ * A partitioner that does local sorting for each RDD partition based on the 
tuple of
+ * values of the columns configured for ordering.
  */
-public class RowCustomColumnsSortPartitioner implements 
BulkInsertPartitioner> {
+public class RowCustomColumnsSortPartitioner extends 
RepartitioningBulkInsertPartitionerBase> {
 
-  private final String[] sortColumnNames;
+  private final String[] orderByColumnNames;
 
-  public RowCustomColumnsSortPartitioner(HoodieWriteConfig config) {
-this.sortColumnNames = getSortColumnName(config);
+  public RowCustomColumnsSortPartitioner(HoodieWriteConfig config, 
HoodieTableConfig tableConfig) {
+super(tableConfig);
+this.orderByColumnNames = getOrderByColumnNames(config);
+
+checkState(orderByColumnNames.length > 0);
   }
 
-  public RowCustomColumnsSortPartitioner(String[] columnNames) {
-this.sortColumnNames = columnNames;
+  public RowCustomColumnsSortPartitioner(String[] columnNames, 
HoodieTableConfig tableConfig) {
+super(tableConfig);
+this.orderByColumnNames = columnNames;
+
+checkState(orderByColumnNames.length > 0);
   }
 
   @Override
-  public Dataset repartitionRecords(Dataset records, int 
outputSparkPartitions) {
-final String[] sortColumns = this.sortColumnNames;
-return records.sort(HoodieRecord.PARTITION_PATH_METADATA_FIELD, 
sortColumns)
-.coalesce(outputSparkPartitions);
+  public Dataset repartitionRecords(Dataset dataset, int 
targetPartitionNumHint) {
+Dataset repartitionedDataset;
+
+// NOTE: In case of partitioned table even "global" ordering (across all 
RDD partitions) could
+//   not change table's partitioning and therefore there's no point in 
doing global sorting
+//   across "physical" partitions, and instead we can reduce total 
amount of data being
+//   shuffled by doing do "local" sorting:
+//  - First, re-partitioning dataset such that "logical" 
partitions are aligned w/
+//  "physical" ones
+//  - Sorting locally w/in RDD ("logical") partitions
+//
+//   Non-partitioned tables will be globally sorted.
+if (isPartitionedTable) {
+  repartitionedDataset = 
dataset.repartition(handleTargetPartitionNumHint(targetPartitionNumHint),
+  new Column(HoodieRecord.PARTITION_PATH_METADATA_FIELD));
+} else {
+  repartitionedDataset = tryCoalesce(dataset, targetPartitionNumHint);

Review Comment:
   In addition, I wonder if `coalesce` can meet our needs. For example, if we 
want to modify the `clusteringGroup` containing N logical partitions into M 
(M>N) files (using `cluster`), shuffle needs to happen anyway.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] Zouxxyy commented on a diff in pull request #7872: [HUDI-5716] Cleaning up `Partitioner`s hierarchy

2023-02-14 Thread via GitHub


Zouxxyy commented on code in PR #7872:
URL: https://github.com/apache/hudi/pull/7872#discussion_r1105540267


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RowCustomColumnsSortPartitioner.java:
##
@@ -19,43 +19,70 @@
 package org.apache.hudi.execution.bulkinsert;
 
 import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.table.HoodieTableConfig;
 import org.apache.hudi.config.HoodieWriteConfig;
-import org.apache.hudi.table.BulkInsertPartitioner;
-
+import org.apache.spark.sql.Column;
 import org.apache.spark.sql.Dataset;
 import org.apache.spark.sql.Row;
+import scala.collection.JavaConverters;
 
 import java.util.Arrays;
+import java.util.stream.Collectors;
+
+import static org.apache.hudi.common.util.ValidationUtils.checkState;
+import static 
org.apache.hudi.execution.bulkinsert.RDDCustomColumnsSortPartitioner.getOrderByColumnNames;
 
 /**
- * A partitioner that does sorting based on specified column values for each 
spark partitions.
+ * A partitioner that does local sorting for each RDD partition based on the 
tuple of
+ * values of the columns configured for ordering.
  */
-public class RowCustomColumnsSortPartitioner implements 
BulkInsertPartitioner> {
+public class RowCustomColumnsSortPartitioner extends 
RepartitioningBulkInsertPartitionerBase> {
 
-  private final String[] sortColumnNames;
+  private final String[] orderByColumnNames;
 
-  public RowCustomColumnsSortPartitioner(HoodieWriteConfig config) {
-this.sortColumnNames = getSortColumnName(config);
+  public RowCustomColumnsSortPartitioner(HoodieWriteConfig config, 
HoodieTableConfig tableConfig) {
+super(tableConfig);
+this.orderByColumnNames = getOrderByColumnNames(config);
+
+checkState(orderByColumnNames.length > 0);
   }
 
-  public RowCustomColumnsSortPartitioner(String[] columnNames) {
-this.sortColumnNames = columnNames;
+  public RowCustomColumnsSortPartitioner(String[] columnNames, 
HoodieTableConfig tableConfig) {
+super(tableConfig);
+this.orderByColumnNames = columnNames;
+
+checkState(orderByColumnNames.length > 0);
   }
 
   @Override
-  public Dataset repartitionRecords(Dataset records, int 
outputSparkPartitions) {
-final String[] sortColumns = this.sortColumnNames;
-return records.sort(HoodieRecord.PARTITION_PATH_METADATA_FIELD, 
sortColumns)
-.coalesce(outputSparkPartitions);
+  public Dataset repartitionRecords(Dataset dataset, int 
targetPartitionNumHint) {
+Dataset repartitionedDataset;
+
+// NOTE: In case of partitioned table even "global" ordering (across all 
RDD partitions) could
+//   not change table's partitioning and therefore there's no point in 
doing global sorting
+//   across "physical" partitions, and instead we can reduce total 
amount of data being
+//   shuffled by doing do "local" sorting:
+//  - First, re-partitioning dataset such that "logical" 
partitions are aligned w/
+//  "physical" ones
+//  - Sorting locally w/in RDD ("logical") partitions
+//
+//   Non-partitioned tables will be globally sorted.
+if (isPartitionedTable) {
+  repartitionedDataset = 
dataset.repartition(handleTargetPartitionNumHint(targetPartitionNumHint),
+  new Column(HoodieRecord.PARTITION_PATH_METADATA_FIELD));
+} else {
+  repartitionedDataset = tryCoalesce(dataset, targetPartitionNumHint);

Review Comment:
   In addition, I wonder if `coalesce` can meet our needs. For example, if we 
want to modify the `clusteringGroup` containing N logical partitions into M 
(M>N) (using `cluster`), shuffle needs to happen anyway.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #6121: [HUDI-4406] Support Flink compaction/clustering write error resolvement to avoid data loss

2023-02-14 Thread via GitHub


danny0405 commented on PR #6121:
URL: https://github.com/apache/hudi/pull/6121#issuecomment-1429471911

   The failed test case `TestHoodieTableFactory#testTableTypeCheck` is 
unrelated with this patch and I test it locally to pass, would merge the PR 
soon ~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 merged pull request #6121: [HUDI-4406] Support Flink compaction/clustering write error resolvement to avoid data loss

2023-02-14 Thread via GitHub


danny0405 merged PR #6121:
URL: https://github.com/apache/hudi/pull/6121


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch master updated: [HUDI-4406] Support Flink compaction/clustering write error resolvement to avoid data loss (#6121)

2023-02-14 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new ed6b7f6aedc [HUDI-4406] Support Flink compaction/clustering write 
error resolvement to avoid data loss (#6121)
ed6b7f6aedc is described below

commit ed6b7f6aedc2cba0f753a4ee130cef860ecb0801
Author: Chenshizhi <107476116+chens...@users.noreply.github.com>
AuthorDate: Tue Feb 14 18:15:18 2023 +0800

[HUDI-4406] Support Flink compaction/clustering write error resolvement to 
avoid data loss (#6121)
---
 .../main/java/org/apache/hudi/configuration/FlinkOptions.java |  6 +++---
 .../org/apache/hudi/sink/clustering/ClusteringCommitSink.java | 11 +++
 .../org/apache/hudi/sink/compact/CompactionCommitSink.java| 11 +++
 3 files changed, 25 insertions(+), 3 deletions(-)

diff --git 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java
 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java
index e447692fc98..9cdeb963d53 100644
--- 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java
+++ 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java
@@ -382,9 +382,9 @@ public class FlinkOptions extends HoodieConfig {
   .key("write.ignore.failed")
   .booleanType()
   .defaultValue(false)
-  .withDescription("Flag to indicate whether to ignore any non exception 
error (e.g. writestatus error). within a checkpoint batch.\n"
-  + "By default false.  Turning this on, could hide the write status 
errors while the spark checkpoint moves ahead. \n"
-  + "  So, would recommend users to use this with caution.");
+  .withDescription("Flag to indicate whether to ignore any non exception 
error (e.g. writestatus error). within a checkpoint batch. \n"
+  + "By default false. Turning this on, could hide the write status 
errors while the flink checkpoint moves ahead. \n"
+  + "So, would recommend users to use this with caution.");
 
   public static final ConfigOption RECORD_KEY_FIELD = ConfigOptions
   .key(KeyGeneratorOptions.RECORDKEY_FIELD_NAME.key())
diff --git 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/clustering/ClusteringCommitSink.java
 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/clustering/ClusteringCommitSink.java
index eb567d89f18..3f392de1527 100644
--- 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/clustering/ClusteringCommitSink.java
+++ 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/clustering/ClusteringCommitSink.java
@@ -147,6 +147,17 @@ public class ClusteringCommitSink extends 
CleanFunction {
 .flatMap(Collection::stream)
 .collect(Collectors.toList());
 
+long numErrorRecords = 
statuses.stream().map(WriteStatus::getTotalErrorRecords).reduce(Long::sum).orElse(0L);
+
+if (numErrorRecords > 0 && 
!this.conf.getBoolean(FlinkOptions.IGNORE_FAILED)) {
+  // handle failure case
+  LOG.error("Got {} error records during clustering of instant {},\n"
+  + "option '{}' is configured as false,"
+  + "rolls back the clustering", numErrorRecords, instant, 
FlinkOptions.IGNORE_FAILED.key());
+  ClusteringUtil.rollbackClustering(table, writeClient, instant);
+  return;
+}
+
 HoodieWriteMetadata> writeMetadata = new 
HoodieWriteMetadata<>();
 writeMetadata.setWriteStatuses(statuses);
 
writeMetadata.setWriteStats(statuses.stream().map(WriteStatus::getStat).collect(Collectors.toList()));
diff --git 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/CompactionCommitSink.java
 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/CompactionCommitSink.java
index 1e05dce6076..0e9bc54f8fb 100644
--- 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/CompactionCommitSink.java
+++ 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/CompactionCommitSink.java
@@ -157,6 +157,17 @@ public class CompactionCommitSink extends 
CleanFunction {
 .flatMap(Collection::stream)
 .collect(Collectors.toList());
 
+long numErrorRecords = 
statuses.stream().map(WriteStatus::getTotalErrorRecords).reduce(Long::sum).orElse(0L);
+
+if (numErrorRecords > 0 && 
!this.conf.getBoolean(FlinkOptions.IGNORE_FAILED)) {
+  // handle failure case
+  LOG.error("Got {} error records during compaction of instant {},\n"
+  + "option '{}' is configured as false,"
+  + "rolls back the compaction", numErrorRecords, instant, 
FlinkOptions.IGNORE_FAILED.key());
+  CompactionUtil.rollbackCompaction(table, instant);
+  return;
+}
+
 HoodieCo

[jira] [Closed] (HUDI-4406) Support compaction commit write error resolvement to avoid data loss

2023-02-14 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-4406.

Resolution: Fixed

Fixed via master branch: ed6b7f6aedc2cba0f753a4ee130cef860ecb0801

> Support compaction commit write error resolvement to avoid data loss
> 
>
> Key: HUDI-4406
> URL: https://issues.apache.org/jira/browse/HUDI-4406
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink
>Affects Versions: 0.12.0
>Reporter: Shizhi Chen
>Assignee: Shizhi Chen
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.1, 0.14.0
>
>
> Currently CompactionCommitSink commit or rollback logics doesn't take the 
> writestatus error under consideration (only consider null writestatus), which 
> actually will cause data loss when compacting the delta commit log files into 
> the new versioned data files.
> eg. org.apache.hudi.io.HoodieMergeHandle#writeRecord will lead to data loss 
> from log files due to Exceptions.
> {code:java}
> ```java
>   protected boolean writeRecord(HoodieRecord hoodieRecord, 
> Option indexedRecord, boolean isDelete) {
>     Option recordMetadata = hoodieRecord.getData().getMetadata();
>     if (!partitionPath.equals(hoodieRecord.getPartitionPath())) {
>       HoodieUpsertException failureEx = new HoodieUpsertException("mismatched 
> partition path, record partition: "
>           + hoodieRecord.getPartitionPath() + " but trying to insert into 
> partition: " + partitionPath);
>       writeStatus.markFailure(hoodieRecord, failureEx, recordMetadata);
>       return false;
>     }
>     try {
>       if (indexedRecord.isPresent() && !isDelete) {
>         writeToFile(hoodieRecord.getKey(), (GenericRecord) 
> indexedRecord.get(), preserveMetadata && useWriterSchemaForCompaction);
>         recordsWritten++;
>       } else {
>         recordsDeleted++;
>       }
>       writeStatus.markSuccess(hoodieRecord, recordMetadata);
>       // deflate record payload after recording success. This will help users 
> access payload as a
>       // part of marking
>       // record successful.
>       hoodieRecord.deflate();
>       return true;
>     } catch (Exception e) {
>       LOG.error("Error writing record  " + hoodieRecord, e);
>       writeStatus.markFailure(hoodieRecord, e, recordMetadata);
>     }
>     return false;
>   }{code}
> And it's known that StreamWriteOperatorCoordinator has related commit or 
> rollback handle process. 
> So this pr will:
> a)  Also add writestatus error as rollback reason for CompactionCommitSink 
> compaction rollback to avoid data loss
> b) Unify the handle procedure for write commit policy with its implementions, 
> as described in org.apache.hudi.commit.policy.WriteCommitPolicy, which is 
> consolidated with that of StreamWriteOperatorCoordinator.
> c) All control whether data quality or ingestion stability should be in high 
> priority through FlinkOptions#IGNORE_FAILED.
> And, we suggest that FlinkOptions#IGNORE_FAILED be in true by default to 
> avoid data loss.
> d) Optimize and fix some tiny bugs for log traces when commiting on error or 
> rolling back.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4406) Support compaction commit write error resolvement to avoid data loss

2023-02-14 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-4406:
-
Fix Version/s: 0.13.1
   0.14.0

> Support compaction commit write error resolvement to avoid data loss
> 
>
> Key: HUDI-4406
> URL: https://issues.apache.org/jira/browse/HUDI-4406
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink
>Affects Versions: 0.12.0
>Reporter: Shizhi Chen
>Assignee: Shizhi Chen
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.1, 0.14.0
>
>
> Currently CompactionCommitSink commit or rollback logics doesn't take the 
> writestatus error under consideration (only consider null writestatus), which 
> actually will cause data loss when compacting the delta commit log files into 
> the new versioned data files.
> eg. org.apache.hudi.io.HoodieMergeHandle#writeRecord will lead to data loss 
> from log files due to Exceptions.
> {code:java}
> ```java
>   protected boolean writeRecord(HoodieRecord hoodieRecord, 
> Option indexedRecord, boolean isDelete) {
>     Option recordMetadata = hoodieRecord.getData().getMetadata();
>     if (!partitionPath.equals(hoodieRecord.getPartitionPath())) {
>       HoodieUpsertException failureEx = new HoodieUpsertException("mismatched 
> partition path, record partition: "
>           + hoodieRecord.getPartitionPath() + " but trying to insert into 
> partition: " + partitionPath);
>       writeStatus.markFailure(hoodieRecord, failureEx, recordMetadata);
>       return false;
>     }
>     try {
>       if (indexedRecord.isPresent() && !isDelete) {
>         writeToFile(hoodieRecord.getKey(), (GenericRecord) 
> indexedRecord.get(), preserveMetadata && useWriterSchemaForCompaction);
>         recordsWritten++;
>       } else {
>         recordsDeleted++;
>       }
>       writeStatus.markSuccess(hoodieRecord, recordMetadata);
>       // deflate record payload after recording success. This will help users 
> access payload as a
>       // part of marking
>       // record successful.
>       hoodieRecord.deflate();
>       return true;
>     } catch (Exception e) {
>       LOG.error("Error writing record  " + hoodieRecord, e);
>       writeStatus.markFailure(hoodieRecord, e, recordMetadata);
>     }
>     return false;
>   }{code}
> And it's known that StreamWriteOperatorCoordinator has related commit or 
> rollback handle process. 
> So this pr will:
> a)  Also add writestatus error as rollback reason for CompactionCommitSink 
> compaction rollback to avoid data loss
> b) Unify the handle procedure for write commit policy with its implementions, 
> as described in org.apache.hudi.commit.policy.WriteCommitPolicy, which is 
> consolidated with that of StreamWriteOperatorCoordinator.
> c) All control whether data quality or ingestion stability should be in high 
> priority through FlinkOptions#IGNORE_FAILED.
> And, we suggest that FlinkOptions#IGNORE_FAILED be in true by default to 
> avoid data loss.
> d) Optimize and fix some tiny bugs for log traces when commiting on error or 
> rolling back.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] danny0405 commented on a diff in pull request #7940: [HUDI-5787] HoodieHiveCatalog should not delete data for dropping external table

2023-02-14 Thread via GitHub


danny0405 commented on code in PR #7940:
URL: https://github.com/apache/hudi/pull/7940#discussion_r1105583908


##
hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/ddl/HMSDDLExecutor.java:
##
@@ -137,12 +137,12 @@ public void createTable(String tableName, MessageType 
storageSchema, String inpu
 
   if (!syncConfig.getBoolean(HIVE_CREATE_MANAGED_TABLE)) {
 newTb.putToParameters("EXTERNAL", "TRUE");
+newTb.setTableType(TableType.EXTERNAL_TABLE.toString());
   }
 
   for (Map.Entry entry : tableProperties.entrySet()) {
 newTb.putToParameters(entry.getKey(), entry.getValue());
   }
-  newTb.setTableType(TableType.EXTERNAL_TABLE.toString());
   client.createTable(newTb);

Review Comment:
   Nice catch, can we add a test case for it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #7940: [HUDI-5787] HoodieHiveCatalog should not delete data for dropping external table

2023-02-14 Thread via GitHub


danny0405 commented on code in PR #7940:
URL: https://github.com/apache/hudi/pull/7940#discussion_r1105584401


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/catalog/HoodieHiveCatalog.java:
##
@@ -656,12 +657,10 @@ public void dropTable(ObjectPath tablePath, boolean 
ignoreIfNotExists)
   client.dropTable(
   tablePath.getDatabaseName(),
   tablePath.getObjectName(),
-  // Indicate whether associated data should be deleted.
-  // Set to 'true' for now because Flink tables shouldn't have data in 
Hive. Can
-  // be changed later if necessary
-  true,
+  // External table drops only the metadata, should not delete the 
underlying data.
+  
!TableType.EXTERNAL_TABLE.name().equals(getHiveTable(tablePath).getTableType().toUpperCase(Locale.ROOT)),

Review Comment:
   Can we also add a test case in `TestHoodieHiveCatalog`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #7894: [HUDI-5729] Fix RowDataKeyGen method getRecordKey

2023-02-14 Thread via GitHub


danny0405 commented on PR #7894:
URL: https://github.com/apache/hudi/pull/7894#issuecomment-1429484584

   There is a test failure: 
`TestRowDataKeyGen.testRecoredKeyContainsTimestamp:201`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on issue #7936: [SUPPORT]Flink HiveCatalog should respect 'managed_table' options to avoid deleting data unexpectable.

2023-02-14 Thread via GitHub


danny0405 commented on issue #7936:
URL: https://github.com/apache/hudi/issues/7936#issuecomment-1429487379

   I see a fix patch: https://github.com/apache/hudi/pull/7940


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 merged pull request #7633: [HUDI-5737] Fix Deletes issued without any prior commits

2023-02-14 Thread via GitHub


danny0405 merged PR #7633:
URL: https://github.com/apache/hudi/pull/7633


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch master updated (ed6b7f6aedc -> 4f8f2d8dc5c)

2023-02-14 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from ed6b7f6aedc [HUDI-4406] Support Flink compaction/clustering write 
error resolvement to avoid data loss (#6121)
 add 4f8f2d8dc5c [HUDI-5737] Fix Deletes issued without any prior commits 
(#7633)

No new revisions were added by this update.

Summary of changes:
 .../src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java   | 2 +-
 .../hudi/client/functional/TestHoodieClientOnCopyOnWriteStorage.java  | 4 +---
 2 files changed, 2 insertions(+), 4 deletions(-)



[jira] [Closed] (HUDI-5737) Fix Deletes issued without any prior commits

2023-02-14 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-5737.

Resolution: Fixed

Fixed via master branch: 4f8f2d8dc5c449f70562f0a629a1765dbdac7235

> Fix Deletes issued without any prior commits
> 
>
> Key: HUDI-5737
> URL: https://issues.apache.org/jira/browse/HUDI-5737
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.1, 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-5737) Fix Deletes issued without any prior commits

2023-02-14 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen reassigned HUDI-5737:


Assignee: Danny Chen

> Fix Deletes issued without any prior commits
> 
>
> Key: HUDI-5737
> URL: https://issues.apache.org/jira/browse/HUDI-5737
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.1, 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] LinMingQiang commented on a diff in pull request #7907: [HUDI-5672][RFC-61] Lockless multi writer support

2023-02-14 Thread via GitHub


LinMingQiang commented on code in PR #7907:
URL: https://github.com/apache/hudi/pull/7907#discussion_r1101638612


##
rfc/rfc-61/rfc-61.md:
##
@@ -0,0 +1,98 @@
+# RFC-61: Lockless Multi Writer
+
+## Proposers
+- @danny0405
+- @ForwardXu
+- @SteNicholas
+
+## Approvers
+-
+
+## Status
+
+JIRA: [Lockless multi writer 
support](https://issues.apache.org/jira/browse/HUDI-5672)
+
+## Abstract
+As you know, Hudi already supports basic OCC with abundant lock providers.
+But for multi streaming ingestion writers, the OCC does not work well because 
the conflicts happen in very high frequency.
+Expand it a little bit, with hashing index, all the writers have deterministic 
hashing algorithm for distributing the records by primary keys,
+all the keys are evenly distributed in all the data buckets, for a single data 
flushing in one writer, almost all the data buckets are appended with new 
inputs,
+so the conflict would very possibility happen for mul-writer because almost 
all the data buckets are being written by multiple writers at the same time;
+For bloom filter index, things are different, but remember that we have a 
small file load rebalance strategy to writer into the **small** bucket in 
higher priority,
+that means, multiple writers prune to write into the same **small** buckets at 
the same time, that's how conflicts happen.
+
+In general, for multiple streaming writers ingestion, explicit lock is not 
very capable of putting into production, in this RFC, we propse a lockless 
solution for streaming ingestion.
+
+## Background
+
+Streaming jobs are naturally suitable for data ingestion, it has no complexity 
of pipeline orchestration and has a smother write workload.
+Most of the raw data set we are handling today are generating all the time in 
streaming way.
+
+Based on that, many requests for multiple writers' ingestion are derived. With 
multi-writer ingestion, several streaming events with the same schema can be 
drained into one Hudi table,
+the Hudi table kind of becomes a UNION table view for all the input data set. 
This is a very common use case because in reality, the data sets are usually 
scattered all over the data sources.
+
+Another very useful use case we wanna unlock is the real-time data set join. 
One of the biggest pain point in streaming computation is the dataset join,
+the engine like Flink has basic supports for all kind of SQL JOINs, but it 
stores the input records within its inner state-backend which is a huge cost 
for pure data join with no additional computations.
+In [HUDI-3304](https://issues.apache.org/jira/browse/HUDI-3304), we introduced 
a `PartialUpdateAvroPayload`, in combination with the lockless multi-writer,
+we can implement N-ways data sources join in real-time! Hudi would take care 
of the payload join during compaction service procedure.
+
+## Design
+
+### The Precondition
+
+ MOR Table Type Is Required
+
+The table type must be `MERGE_ON_READ`, so that we can defer the conflict 
resolution to the compaction phase. The compaction service would resolve the 
conflicts of the same keys by respecting the event time sequence of the events.
+
+ Deterministic Bucketing Strategy
+
+Determistic bucketing strategy is required, because the same records keys from 
different writers are desired to be distributed into the same bucket, not only 
for UPSERTs, but also for all the new INSERTs.

Review Comment:
   If we are using a MOR table but not a bucket index layout, can we support 
lockless multi writer for INSERT? Will there be any problem?
   It will cause multiple writers to write to the same file and cause conflicts.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #7907: [HUDI-5672][RFC-61] Lockless multi writer support

2023-02-14 Thread via GitHub


danny0405 commented on code in PR #7907:
URL: https://github.com/apache/hudi/pull/7907#discussion_r1105598730


##
rfc/rfc-61/rfc-61.md:
##
@@ -0,0 +1,98 @@
+# RFC-61: Lockless Multi Writer
+
+## Proposers
+- @danny0405
+- @ForwardXu
+- @SteNicholas
+
+## Approvers
+-
+
+## Status
+
+JIRA: [Lockless multi writer 
support](https://issues.apache.org/jira/browse/HUDI-5672)
+
+## Abstract
+As you know, Hudi already supports basic OCC with abundant lock providers.
+But for multi streaming ingestion writers, the OCC does not work well because 
the conflicts happen in very high frequency.
+Expand it a little bit, with hashing index, all the writers have deterministic 
hashing algorithm for distributing the records by primary keys,
+all the keys are evenly distributed in all the data buckets, for a single data 
flushing in one writer, almost all the data buckets are appended with new 
inputs,
+so the conflict would very possibility happen for mul-writer because almost 
all the data buckets are being written by multiple writers at the same time;
+For bloom filter index, things are different, but remember that we have a 
small file load rebalance strategy to writer into the **small** bucket in 
higher priority,
+that means, multiple writers prune to write into the same **small** buckets at 
the same time, that's how conflicts happen.
+
+In general, for multiple streaming writers ingestion, explicit lock is not 
very capable of putting into production, in this RFC, we propse a lockless 
solution for streaming ingestion.
+
+## Background
+
+Streaming jobs are naturally suitable for data ingestion, it has no complexity 
of pipeline orchestration and has a smother write workload.
+Most of the raw data set we are handling today are generating all the time in 
streaming way.
+
+Based on that, many requests for multiple writers' ingestion are derived. With 
multi-writer ingestion, several streaming events with the same schema can be 
drained into one Hudi table,
+the Hudi table kind of becomes a UNION table view for all the input data set. 
This is a very common use case because in reality, the data sets are usually 
scattered all over the data sources.
+
+Another very useful use case we wanna unlock is the real-time data set join. 
One of the biggest pain point in streaming computation is the dataset join,
+the engine like Flink has basic supports for all kind of SQL JOINs, but it 
stores the input records within its inner state-backend which is a huge cost 
for pure data join with no additional computations.
+In [HUDI-3304](https://issues.apache.org/jira/browse/HUDI-3304), we introduced 
a `PartialUpdateAvroPayload`, in combination with the lockless multi-writer,
+we can implement N-ways data sources join in real-time! Hudi would take care 
of the payload join during compaction service procedure.
+
+## Design
+
+### The Precondition
+
+ MOR Table Type Is Required
+
+The table type must be `MERGE_ON_READ`, so that we can defer the conflict 
resolution to the compaction phase. The compaction service would resolve the 
conflicts of the same keys by respecting the event time sequence of the events.
+
+ Deterministic Bucketing Strategy
+
+Determistic bucketing strategy is required, because the same records keys from 
different writers are desired to be distributed into the same bucket, not only 
for UPSERTs, but also for all the new INSERTs.

Review Comment:
   Yeah, you guys got it, just to ensure that all the changes for one record 
key happens in one FileGroup.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #7886: [HUDI-5726]Fix timestamp field is 8 hours longer than the time

2023-02-14 Thread via GitHub


danny0405 commented on code in PR #7886:
URL: https://github.com/apache/hudi/pull/7886#discussion_r1105601355


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##
@@ -614,6 +614,14 @@ public class HoodieWriteConfig extends HoodieConfig {
   .withDocumentation("Whether to enable commit conflict checking or not 
during early "
   + "conflict detection.");
 
+  public static final ConfigProperty WRITE_BULK_INSERT_UTC_TIMEZONE = 
ConfigProperty
+  .key("hoodie.bulk_insert.write.utc-timezone")
+  .defaultValue(false)
+  .sinceVersion("0.14.0")
+  .withDocumentation("Whether to use UTC timezone to the conversion 
between epoch for bulk insert tasks.\n"
+  + "By default false use local timezone  ");
+

Review Comment:
   We shoud add this in `HoodieWriteConfig`, this is a Flink specific config 
option.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] GallonREX commented on issue #7925: [SUPPORT]hudi 0.8 upgrade to hudi 0.12 report java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes

2023-02-14 Thread via GitHub


GallonREX commented on issue #7925:
URL: https://github.com/apache/hudi/issues/7925#issuecomment-1429499069

   with same data.
   if only hudi 0.12 can be multi-write successfully
   if only hudi 0.8 can be multi-write successfully
   Use hudi0.12 to single-write the existing hudi0.8 table successfully
   But using hudi0.12 to multi-write the existing hudi0.8 table failed
   Part of the same partition data is written between multiple writers


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7931: [HUDI-5773] Support archive command for spark sql

2023-02-14 Thread via GitHub


hudi-bot commented on PR #7931:
URL: https://github.com/apache/hudi/pull/7931#issuecomment-1429505947

   
   ## CI report:
   
   * dc75fbc2066961664e2348703fcfb16b3281cba9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15151)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7933: [HUDI-5774] Support for adding labels to prometheus metrics

2023-02-14 Thread via GitHub


hudi-bot commented on PR #7933:
URL: https://github.com/apache/hudi/pull/7933#issuecomment-1429506010

   
   ## CI report:
   
   * 62dd45b5f189526ba0595abb566c38ed3bcadad6 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15140)
 
   * a02b393674ed4ae07d1eed67560f126ac06e178c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #7752: [MINOR] De-duplicating Iterator implementations

2023-02-14 Thread via GitHub


danny0405 commented on code in PR #7752:
URL: https://github.com/apache/hudi/pull/7752#discussion_r1105614300


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordReader.java:
##
@@ -807,16 +809,17 @@ private ClosableIteratorWithSchema 
getRecordsIterator(
 
 Option, Schema>> 
schemaEvolutionTransformerOpt =
 composeEvolvedSchemaTransformer(dataBlock);
+
 // In case when schema has been evolved original persisted records will 
have to be
 // transformed to adhere to the new schema
-if (schemaEvolutionTransformerOpt.isPresent()) {
-  return ClosableIteratorWithSchema.newInstance(
-  new CloseableMappingIterator<>(blockRecordsIterator,
-  schemaEvolutionTransformerOpt.get().getLeft()),
-  schemaEvolutionTransformerOpt.get().getRight());
-} else {
-  return ClosableIteratorWithSchema.newInstance(blockRecordsIterator, 
dataBlock.getSchema());
-}
+Function transformer =
+schemaEvolutionTransformerOpt.map(Pair::getLeft)
+.orElse(Function.identity());
+
+Schema schema = schemaEvolutionTransformerOpt.map(Pair::getRight)
+.orElse(dataBlock.getSchema());
+

Review Comment:
   The code looks clean but for use case when 
`schemaEvolutionTransformerOpt.isPresent()` is false, there is unnecessary 
invocation of `Function.identity()`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7933: [HUDI-5774] Support for adding labels to prometheus metrics

2023-02-14 Thread via GitHub


hudi-bot commented on PR #7933:
URL: https://github.com/apache/hudi/pull/7933#issuecomment-1429515776

   
   ## CI report:
   
   * 62dd45b5f189526ba0595abb566c38ed3bcadad6 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15140)
 
   * a02b393674ed4ae07d1eed67560f126ac06e178c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15158)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #7752: [MINOR] De-duplicating Iterator implementations

2023-02-14 Thread via GitHub


danny0405 commented on code in PR #7752:
URL: https://github.com/apache/hudi/pull/7752#discussion_r1105616902


##
hudi-common/src/main/java/org/apache/hudi/common/util/collection/CloseableMappingIterator.java:
##
@@ -22,8 +22,8 @@
 
 import java.util.function.Function;
 
-// TODO java-doc
-public class CloseableMappingIterator extends MappingIterator 
implements ClosableIterator {
+public class CloseableMappingIterator extends MappingIterator

Review Comment:
   Can we add some document for this `CloseableMappingIterator` class?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #7752: [MINOR] De-duplicating Iterator implementations

2023-02-14 Thread via GitHub


danny0405 commented on code in PR #7752:
URL: https://github.com/apache/hudi/pull/7752#discussion_r1105619086


##
hudi-common/src/main/java/org/apache/hudi/common/util/ClosableIterator.java:
##
@@ -24,8 +24,29 @@
  * An iterator that give a chance to release resources.
  *
  * @param  The return type
+ *
+ * TODO move under common.util.collection
  */

Review Comment:
   Not a big fan of `TODO`s, people rarely pay attention and fix it, just my 
personal opinion, either fix it in this PR or create a JIRA issue to track it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] XuQianJin-Stars merged pull request #7931: [HUDI-5773] Support archive command for spark sql

2023-02-14 Thread via GitHub


XuQianJin-Stars merged PR #7931:
URL: https://github.com/apache/hudi/pull/7931


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch master updated: [HUDI-5773] Support archive command for spark sql (#7931)

2023-02-14 Thread forwardxu
This is an automated email from the ASF dual-hosted git repository.

forwardxu pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 87b97369b68 [HUDI-5773] Support archive command for spark sql (#7931)
87b97369b68 is described below

commit 87b97369b68913ee312db12d373be72f40c93135
Author: ForwardXu 
AuthorDate: Tue Feb 14 19:14:48 2023 +0800

[HUDI-5773] Support archive command for spark sql (#7931)
---
 .../org/apache/hudi/cli/ArchiveExecutorUtils.java  | 69 +++
 .../procedures/ArchiveCommitsProcedure.scala   | 79 ++
 .../hudi/command/procedures/HoodieProcedures.scala |  1 +
 .../procedure/TestArchiveCommitsProcedure.scala| 71 +++
 4 files changed, 220 insertions(+)

diff --git 
a/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/cli/ArchiveExecutorUtils.java
 
b/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/cli/ArchiveExecutorUtils.java
new file mode 100644
index 000..0a23c811a01
--- /dev/null
+++ 
b/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/cli/ArchiveExecutorUtils.java
@@ -0,0 +1,69 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.cli;
+
+import org.apache.hudi.client.HoodieTimelineArchiver;
+import org.apache.hudi.client.common.HoodieSparkEngineContext;
+import org.apache.hudi.common.config.HoodieMetadataConfig;
+import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.common.model.HoodieAvroPayload;
+import org.apache.hudi.config.HoodieArchivalConfig;
+import org.apache.hudi.config.HoodieCleanConfig;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.table.HoodieSparkTable;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+import java.io.IOException;
+
+/**
+ * Archive Utils.
+ */
+public final class ArchiveExecutorUtils {
+  private static final Logger LOG = 
LogManager.getLogger(ArchiveExecutorUtils.class);
+
+  private ArchiveExecutorUtils() {
+  }
+
+  public static int archive(JavaSparkContext jsc,
+   int minCommits,
+   int maxCommits,
+   int commitsRetained,
+   boolean enableMetadata,
+   String basePath) {
+HoodieWriteConfig config = 
HoodieWriteConfig.newBuilder().withPath(basePath)
+
.withArchivalConfig(HoodieArchivalConfig.newBuilder().archiveCommitsWith(minCommits,
 maxCommits).build())
+
.withCleanConfig(HoodieCleanConfig.newBuilder().retainCommits(commitsRetained).build())
+.withEmbeddedTimelineServerEnabled(false)
+
.withMetadataConfig(HoodieMetadataConfig.newBuilder().enable(enableMetadata).build())
+.build();
+HoodieEngineContext context = new HoodieSparkEngineContext(jsc);
+HoodieSparkTable table = 
HoodieSparkTable.create(config, context);
+try {
+  HoodieTimelineArchiver archiver = new HoodieTimelineArchiver(config, 
table);
+  archiver.archiveIfRequired(context, true);
+} catch (IOException ioe) {
+  LOG.error("Failed to archive with IOException: " + ioe);
+  return -1;
+}
+return 0;
+  }
+}
diff --git 
a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ArchiveCommitsProcedure.scala
 
b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ArchiveCommitsProcedure.scala
new file mode 100644
index 000..b097c942ad2
--- /dev/null
+++ 
b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ArchiveCommitsProcedure.scala
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICE

[GitHub] [hudi] hudi-bot commented on pull request #7918: [MINOR] Fix spark sql run clean do not exit

2023-02-14 Thread via GitHub


hudi-bot commented on PR #7918:
URL: https://github.com/apache/hudi/pull/7918#issuecomment-1429581418

   
   ## CI report:
   
   * c98ed3e5c65e10a42964f0c917b575cae25cc2b2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15152)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7935: [DRAFT] Add maven-build-cache-extension

2023-02-14 Thread via GitHub


hudi-bot commented on PR #7935:
URL: https://github.com/apache/hudi/pull/7935#issuecomment-1429594341

   
   ## CI report:
   
   * 28a41cd538a50171538b8898adcc799628f4fd60 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15144)
 
   * 2cdbfce41efe237cd5a99e4ca69a3b5272dba1d8 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] lokeshj1703 commented on issue #7877: [SUPPORT] Hudi examples: An exception or error caused a run to abort: org.apache.hadoop.security.HadoopKerberosName.setRuleMechanism

2023-02-14 Thread via GitHub


lokeshj1703 commented on issue #7877:
URL: https://github.com/apache/hudi/issues/7877#issuecomment-1429604354

   @damiano1996 you need to add below dependency in your build.
   
   ```
 "org.apache.spark" %% "spark-hive" % sparkVersion % Provided,
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7918: [MINOR] Fix spark sql run clean do not exit

2023-02-14 Thread via GitHub


hudi-bot commented on PR #7918:
URL: https://github.com/apache/hudi/pull/7918#issuecomment-1429605231

   
   ## CI report:
   
   * c98ed3e5c65e10a42964f0c917b575cae25cc2b2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15152)
 
   * f694a549ea265813f05767d69269fda2bb1ef279 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7935: [DRAFT] Add maven-build-cache-extension

2023-02-14 Thread via GitHub


hudi-bot commented on PR #7935:
URL: https://github.com/apache/hudi/pull/7935#issuecomment-1429605393

   
   ## CI report:
   
   * 28a41cd538a50171538b8898adcc799628f4fd60 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15144)
 
   * 2cdbfce41efe237cd5a99e4ca69a3b5272dba1d8 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15160)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7918: [MINOR] Fix spark sql run clean do not exit

2023-02-14 Thread via GitHub


hudi-bot commented on PR #7918:
URL: https://github.com/apache/hudi/pull/7918#issuecomment-1429617131

   
   ## CI report:
   
   * c98ed3e5c65e10a42964f0c917b575cae25cc2b2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15152)
 
   * f694a549ea265813f05767d69269fda2bb1ef279 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15161)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] codope opened a new pull request, #7942: [HUDI-5753] Add docs for record payload

2023-02-14 Thread via GitHub


codope opened a new pull request, #7942:
URL: https://github.com/apache/hudi/pull/7942

   ### Change Logs
   
   Documentation about record payload under `Concepts` section.
   
   ### Impact
   
   Only docs.
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   Adds a new page under `Concepts` section of Hudi website.
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-5753) Add feature docs for Record Payload

2023-02-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-5753:
-
Labels: pull-request-available  (was: )

> Add feature docs for Record Payload
> ---
>
> Key: HUDI-5753
> URL: https://issues.apache.org/jira/browse/HUDI-5753
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5753) Add feature docs for Record Payload

2023-02-14 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-5753:
--
Status: In Progress  (was: Open)

> Add feature docs for Record Payload
> ---
>
> Key: HUDI-5753
> URL: https://issues.apache.org/jira/browse/HUDI-5753
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #7938: [HUDI-5785] Enhance Spark Datasource tests

2023-02-14 Thread via GitHub


hudi-bot commented on PR #7938:
URL: https://github.com/apache/hudi/pull/7938#issuecomment-1429676630

   
   ## CI report:
   
   * 4e7da703304c7783e9771e931e39854adf6458d6 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15153)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7940: [HUDI-5787] HoodieHiveCatalog should not delete data for dropping external table

2023-02-14 Thread via GitHub


hudi-bot commented on PR #7940:
URL: https://github.com/apache/hudi/pull/7940#issuecomment-1429751974

   
   ## CI report:
   
   * 5e9308f176e728950ddfe41a931b54eae4e6f40a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15154)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7941: [HUDI-5786] Add a new config to specific spark write rdd storage level

2023-02-14 Thread via GitHub


hudi-bot commented on PR #7941:
URL: https://github.com/apache/hudi/pull/7941#issuecomment-1429771976

   
   ## CI report:
   
   * 42e12ad1b6bebdcc3dc9d985e5be661b198f3f5c Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15155)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] stathismar commented on pull request #7687: [HUDI-5606] Update to handle deletes in postgres debezium

2023-02-14 Thread via GitHub


stathismar commented on PR #7687:
URL: https://github.com/apache/hudi/pull/7687#issuecomment-1429799215

   Hello. I'm interested in this fix (i.e. To my understanding Debezium 
Postgres integration is broken right now, since `DELETES` are not being 
handled/tracked). I'm wondering why this fix does not targets `apache:master` 
and instead it targets 
[apache:release-0.12.1](https://github.com/apache/hudi/tree/release-0.12.1). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7940: [HUDI-5787] HoodieHiveCatalog should not delete data for dropping external table

2023-02-14 Thread via GitHub


hudi-bot commented on PR #7940:
URL: https://github.com/apache/hudi/pull/7940#issuecomment-1429854410

   
   ## CI report:
   
   * 5e9308f176e728950ddfe41a931b54eae4e6f40a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15154)
 
   * 178389c4d898860a1794547ac30b28636df957dd UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7940: [HUDI-5787] HoodieHiveCatalog should not delete data for dropping external table

2023-02-14 Thread via GitHub


hudi-bot commented on PR #7940:
URL: https://github.com/apache/hudi/pull/7940#issuecomment-1429867437

   
   ## CI report:
   
   * 5e9308f176e728950ddfe41a931b54eae4e6f40a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15154)
 
   * 178389c4d898860a1794547ac30b28636df957dd Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15162)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-5747) 0.13.0 release note part 5

2023-02-14 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-5747:
--
Status: In Progress  (was: Open)

> 0.13.0 release note part 5
> --
>
> Key: HUDI-5747
> URL: https://issues.apache.org/jira/browse/HUDI-5747
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Sagar Sumit
>Priority: Blocker
> Fix For: 0.13.0
>
>
> Release Highlights -> New Source Support in Deltastreamer
> Release Highlights -> Support for Partial Payload Update



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] jonvex commented on a diff in pull request #7881: [HUDI-5723] Automate and standardize enum configs

2023-02-14 Thread via GitHub


jonvex commented on code in PR #7881:
URL: https://github.com/apache/hudi/pull/7881#discussion_r1106001981


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/bootstrap/BootstrapMode.java:
##
@@ -18,18 +18,27 @@
 
 package org.apache.hudi.client.bootstrap;
 
+import org.apache.hudi.common.config.EnumDefault;
+import org.apache.hudi.common.config.EnumDescription;
+import org.apache.hudi.common.config.EnumFieldDescription;
+
 /**
  * Identifies different types of bootstrap.
  */
+@EnumDescription("Bootstrap types")

Review Comment:
   Need more explanation on what bootstrap is



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] jonvex commented on a diff in pull request #7881: [HUDI-5723] Automate and standardize enum configs

2023-02-14 Thread via GitHub


jonvex commented on code in PR #7881:
URL: https://github.com/apache/hudi/pull/7881#discussion_r1106005114


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##
@@ -692,9 +687,17 @@ public static SpatialCurveCompositionStrategyType 
fromValue(String value) {
   /**
* Layout optimization strategies such as Z-order/Hilbert space-curves, etc
*/
+  @EnumDescription("Determines ordering strategy for records layout 
optimization")
   public enum LayoutOptimizationStrategy {

Review Comment:
   I don't update the actual property because it requires "linear, z-order or 
hilbert" instead of the enum names LINEAR, ZORDER, HILBERT



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] BalaMahesh commented on pull request #7687: [HUDI-5606] Update to handle deletes in postgres debezium

2023-02-14 Thread via GitHub


BalaMahesh commented on PR #7687:
URL: https://github.com/apache/hudi/pull/7687#issuecomment-1429961376

   > Hello. I'm interested in this fix (i.e. To my understanding Debezium 
Postgres integration is broken right now, since `DELETES` are not being 
handled/tracked). I'm wondering why this fix does not targets `apache:master` 
and instead it targets 
[apache:release-0.12.1](https://github.com/apache/hudi/tree/release-0.12.1).
   
   we are using this fix in production from 0.12.1 branch, same pr to master 
from 0.12.1 has lot many changes. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] jonvex commented on a diff in pull request #7881: [HUDI-5723] Automate and standardize enum configs

2023-02-14 Thread via GitHub


jonvex commented on code in PR #7881:
URL: https://github.com/apache/hudi/pull/7881#discussion_r1106006641


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java:
##
@@ -74,11 +74,10 @@ public class HoodieIndexConfig extends HoodieConfig {
   .noDefaultValue()

Review Comment:
   Don't know what to do for valid values because INMEMORY isn't allowed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] jonvex commented on a diff in pull request #7881: [HUDI-5723] Automate and standardize enum configs

2023-02-14 Thread via GitHub


jonvex commented on code in PR #7881:
URL: https://github.com/apache/hudi/pull/7881#discussion_r1106007724


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##
@@ -187,16 +178,18 @@ public class HoodieWriteConfig extends HoodieConfig {
 
   public static final ConfigProperty TIMELINE_LAYOUT_VERSION_NUM = 
ConfigProperty
   .key("hoodie.timeline.layout.version")
-  .defaultValue(Integer.toString(TimelineLayoutVersion.VERSION_1))
+  .defaultValue(Integer.toString(TimelineLayoutVersion.CURR_VERSION))
+  
.withValidValues(Integer.toString(TimelineLayoutVersion.VERSION_0),Integer.toString(TimelineLayoutVersion.VERSION_1))
   .sinceVersion("0.5.1")
   .withDocumentation("Controls the layout of the timeline. Version 0 
relied on renames, Version 1 (default) models "
   + "the timeline as an immutable log relying only on atomic writes 
for object storage.");
 
   public static final ConfigProperty BASE_FILE_FORMAT = 
ConfigProperty
   .key("hoodie.table.base.file.format")
-  .defaultValue(HoodieFileFormat.PARQUET)
-  .withAlternatives("hoodie.table.ro.file.format")
-  .withDocumentation("Base file format to store all the base file data.");
+  .defaultValue(getEnumDefault(HoodieFileFormat.class))
+  .withValidValues("PARQUET", "ORC", "HFILE")

Review Comment:
   ENUM had a non base file format so needed to manually put



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] jonvex commented on a diff in pull request #7881: [HUDI-5723] Automate and standardize enum configs

2023-02-14 Thread via GitHub


jonvex commented on code in PR #7881:
URL: https://github.com/apache/hudi/pull/7881#discussion_r1106008290


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##
@@ -1912,6 +1871,10 @@ public Option 
getLogDataBlockFormat() {
 .map(HoodieLogBlock.HoodieLogBlockType::fromId);
   }
 
+  public void testEnums() {

Review Comment:
   need to remove



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] jonvex commented on a diff in pull request #7881: [HUDI-5723] Automate and standardize enum configs

2023-02-14 Thread via GitHub


jonvex commented on code in PR #7881:
URL: https://github.com/apache/hudi/pull/7881#discussion_r1106009875


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndex.java:
##
@@ -152,11 +155,51 @@ public boolean requiresTagging(WriteOperationType 
operationType) {
   public void close() {
   }
 
+  @EnumDescription("Determines how records are indexed. Default is engine 
specific.")
   public enum IndexType {

Review Comment:
   Need to look closely at these annotations, don't know some of them well at al



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] jonvex commented on a diff in pull request #7881: [HUDI-5723] Automate and standardize enum configs

2023-02-14 Thread via GitHub


jonvex commented on code in PR #7881:
URL: https://github.com/apache/hudi/pull/7881#discussion_r1106011282


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleaningTriggerStrategy.java:
##
@@ -18,7 +18,14 @@
 
 package org.apache.hudi.table.action.clean;
 
+import org.apache.hudi.common.config.EnumDefault;
+import org.apache.hudi.common.config.EnumDescription;
+import org.apache.hudi.common.config.EnumFieldDescription;
+
+@EnumDescription("Controls when cleaning is scheduled")
 public enum CleaningTriggerStrategy {

Review Comment:
   Only 1 option, but seems like we could add others pretty easily in the 
future. We could hide the config for now?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] jonvex commented on a diff in pull request #7881: [HUDI-5723] Automate and standardize enum configs

2023-02-14 Thread via GitHub


jonvex commented on code in PR #7881:
URL: https://github.com/apache/hudi/pull/7881#discussion_r1106012580


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/ClusteringPlanPartitionFilterMode.java:
##
@@ -18,12 +18,31 @@
 
 package org.apache.hudi.table.action.cluster;
 
+import org.apache.hudi.common.config.EnumDefault;
+import org.apache.hudi.common.config.EnumDescription;
+import org.apache.hudi.common.config.EnumFieldDescription;
+
 /**
  * Clustering partition filter mode
  */
+@EnumDescription("Partition filter mode used in the creation of clustering 
plan.")
 public enum ClusteringPlanPartitionFilterMode {

Review Comment:
   Really changed these, the documentation was pretty rough and had to read the 
code to figure out what is happening



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] jonvex commented on a diff in pull request #7881: [HUDI-5723] Automate and standardize enum configs

2023-02-14 Thread via GitHub


jonvex commented on code in PR #7881:
URL: https://github.com/apache/hudi/pull/7881#discussion_r1106014556


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/CompactionTriggerStrategy.java:
##
@@ -18,15 +18,31 @@
 
 package org.apache.hudi.table.action.compact;
 
+import org.apache.hudi.common.config.EnumDefault;
+import org.apache.hudi.common.config.EnumDescription;
+import org.apache.hudi.common.config.EnumFieldDescription;
+
+@EnumDescription("Controls when compaction is scheduled")
 public enum CompactionTriggerStrategy {

Review Comment:
   I just put the comments into these annotations. I'm pretty confused by the 
differences between NUM_COMMITS and NUM_COMMITS_AFTER_LAST_REQUEST especially. 
I can fix the wording after we figure out what is going on here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] jonvex commented on a diff in pull request #7881: [HUDI-5723] Automate and standardize enum configs

2023-02-14 Thread via GitHub


jonvex commented on code in PR #7881:
URL: https://github.com/apache/hudi/pull/7881#discussion_r1106015149


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/storage/HoodieStorageLayout.java:
##
@@ -50,7 +53,14 @@ public HoodieStorageLayout(HoodieWriteConfig config) {
*/
   public abstract boolean writeOperationSupported(WriteOperationType 
operationType);
 
+  @EnumDescription("Determines how the files are organized within a table")
   public enum LayoutType {

Review Comment:
   Need to fix the wording, don't understand this one



##
hudi-client/hudi-client-common/src/test/java/org/apache/hudi/config/TestHoodieWriteConfig.java:
##
@@ -115,6 +115,11 @@ public void 
testDefaultClusteringExecutionStrategyClassAccordingToEngineType() {
 EngineType.JAVA, 
HoodieClusteringConfig.JAVA_SORT_AND_SIZE_EXECUTION_STRATEGY));
   }
 
+  @Test void testEnumConfigs() {

Review Comment:
   need to remove



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] jonvex commented on a diff in pull request #7881: [HUDI-5723] Automate and standardize enum configs

2023-02-14 Thread via GitHub


jonvex commented on code in PR #7881:
URL: https://github.com/apache/hudi/pull/7881#discussion_r1106016160


##
hudi-common/src/main/java/org/apache/hudi/common/bloom/BloomFilterTypeCode.java:
##
@@ -18,11 +18,21 @@
 
 package org.apache.hudi.common.bloom;
 
+import org.apache.hudi.common.config.EnumDefault;
+import org.apache.hudi.common.config.EnumDescription;
+import org.apache.hudi.common.config.EnumFieldDescription;
+
 /**
  * Bloom filter type codes.
  * Please do not change the order of the entries.
  */
+@EnumDescription("Filter type used by Bloom filter")
 public enum BloomFilterTypeCode {

Review Comment:
   Need more detail in these descriptions



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-5774) Fix prometheus configs for metadata table and support metric labels

2023-02-14 Thread Lokesh Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain updated HUDI-5774:
--
Summary: Fix prometheus configs for metadata table and support metric 
labels  (was: Support for adding labels to prometheus metrics)

> Fix prometheus configs for metadata table and support metric labels
> ---
>
> Key: HUDI-5774
> URL: https://issues.apache.org/jira/browse/HUDI-5774
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] jonvex commented on a diff in pull request #7881: [HUDI-5723] Automate and standardize enum configs

2023-02-14 Thread via GitHub


jonvex commented on code in PR #7881:
URL: https://github.com/apache/hudi/pull/7881#discussion_r1106021736


##
hudi-common/src/main/java/org/apache/hudi/common/table/marker/MarkerType.java:
##
@@ -19,10 +19,23 @@
 
 package org.apache.hudi.common.table.marker;
 
+import org.apache.hudi.common.config.EnumDefault;
+import org.apache.hudi.common.config.EnumDescription;
+import org.apache.hudi.common.config.EnumFieldDescription;
+
 /**
  * Marker type indicating how markers are stored in the file system.
  */
+@EnumDescription("Marker type indicating how markers are stored in the file 
system.")
 public enum MarkerType {

Review Comment:
   Probably need a bit of help with wording on this one



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] jonvex commented on a diff in pull request #7881: [HUDI-5723] Automate and standardize enum configs

2023-02-14 Thread via GitHub


jonvex commented on code in PR #7881:
URL: https://github.com/apache/hudi/pull/7881#discussion_r1106022582


##
hudi-common/src/main/java/org/apache/hudi/common/util/queue/DisruptorWaitStrategyType.java:
##
@@ -27,35 +30,50 @@
 /**
  * Enum for the type of waiting strategy in Disruptor Queue.
  */
+@EnumDescription("Type of waiting strategy in the Disruptor Queue")

Review Comment:
   Need to explain what Disruptor Queue is



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] Zouxxyy commented on a diff in pull request #7872: [HUDI-5716] Cleaning up `Partitioner`s hierarchy

2023-02-14 Thread via GitHub


Zouxxyy commented on code in PR #7872:
URL: https://github.com/apache/hudi/pull/7872#discussion_r1105540267


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RowCustomColumnsSortPartitioner.java:
##
@@ -19,43 +19,70 @@
 package org.apache.hudi.execution.bulkinsert;
 
 import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.table.HoodieTableConfig;
 import org.apache.hudi.config.HoodieWriteConfig;
-import org.apache.hudi.table.BulkInsertPartitioner;
-
+import org.apache.spark.sql.Column;
 import org.apache.spark.sql.Dataset;
 import org.apache.spark.sql.Row;
+import scala.collection.JavaConverters;
 
 import java.util.Arrays;
+import java.util.stream.Collectors;
+
+import static org.apache.hudi.common.util.ValidationUtils.checkState;
+import static 
org.apache.hudi.execution.bulkinsert.RDDCustomColumnsSortPartitioner.getOrderByColumnNames;
 
 /**
- * A partitioner that does sorting based on specified column values for each 
spark partitions.
+ * A partitioner that does local sorting for each RDD partition based on the 
tuple of
+ * values of the columns configured for ordering.
  */
-public class RowCustomColumnsSortPartitioner implements 
BulkInsertPartitioner> {
+public class RowCustomColumnsSortPartitioner extends 
RepartitioningBulkInsertPartitionerBase> {
 
-  private final String[] sortColumnNames;
+  private final String[] orderByColumnNames;
 
-  public RowCustomColumnsSortPartitioner(HoodieWriteConfig config) {
-this.sortColumnNames = getSortColumnName(config);
+  public RowCustomColumnsSortPartitioner(HoodieWriteConfig config, 
HoodieTableConfig tableConfig) {
+super(tableConfig);
+this.orderByColumnNames = getOrderByColumnNames(config);
+
+checkState(orderByColumnNames.length > 0);
   }
 
-  public RowCustomColumnsSortPartitioner(String[] columnNames) {
-this.sortColumnNames = columnNames;
+  public RowCustomColumnsSortPartitioner(String[] columnNames, 
HoodieTableConfig tableConfig) {
+super(tableConfig);
+this.orderByColumnNames = columnNames;
+
+checkState(orderByColumnNames.length > 0);
   }
 
   @Override
-  public Dataset repartitionRecords(Dataset records, int 
outputSparkPartitions) {
-final String[] sortColumns = this.sortColumnNames;
-return records.sort(HoodieRecord.PARTITION_PATH_METADATA_FIELD, 
sortColumns)
-.coalesce(outputSparkPartitions);
+  public Dataset repartitionRecords(Dataset dataset, int 
targetPartitionNumHint) {
+Dataset repartitionedDataset;
+
+// NOTE: In case of partitioned table even "global" ordering (across all 
RDD partitions) could
+//   not change table's partitioning and therefore there's no point in 
doing global sorting
+//   across "physical" partitions, and instead we can reduce total 
amount of data being
+//   shuffled by doing do "local" sorting:
+//  - First, re-partitioning dataset such that "logical" 
partitions are aligned w/
+//  "physical" ones
+//  - Sorting locally w/in RDD ("logical") partitions
+//
+//   Non-partitioned tables will be globally sorted.
+if (isPartitionedTable) {
+  repartitionedDataset = 
dataset.repartition(handleTargetPartitionNumHint(targetPartitionNumHint),
+  new Column(HoodieRecord.PARTITION_PATH_METADATA_FIELD));
+} else {
+  repartitionedDataset = tryCoalesce(dataset, targetPartitionNumHint);

Review Comment:
   In addition, I wonder if `coalesce` can meet our needs. For example, if we 
want to modify the `clusteringGroup` containing N logical partitions into M 
(M>N) (using `cluster`), shuffle needs to happen anyway.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] jonvex commented on a diff in pull request #7881: [HUDI-5723] Automate and standardize enum configs

2023-02-14 Thread via GitHub


jonvex commented on code in PR #7881:
URL: https://github.com/apache/hudi/pull/7881#discussion_r1106023367


##
hudi-common/src/main/java/org/apache/hudi/common/util/queue/ExecutorType.java:
##
@@ -18,27 +18,27 @@
 
 package org.apache.hudi.common.util.queue;
 
+import org.apache.hudi.common.config.EnumDefault;
+import org.apache.hudi.common.config.EnumDescription;
+import org.apache.hudi.common.config.EnumFieldDescription;
+
 /**
  * Types of {@link org.apache.hudi.common.util.queue.HoodieExecutor}.
  */
+@EnumDescription("Types of executor that implements 
org.apache.hudi.common.util.queue.HoodieExecutor")

Review Comment:
   Maybe explain what this is for



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] jonvex commented on a diff in pull request #7881: [HUDI-5723] Automate and standardize enum configs

2023-02-14 Thread via GitHub


jonvex commented on code in PR #7881:
URL: https://github.com/apache/hudi/pull/7881#discussion_r1106023838


##
hudi-common/src/main/java/org/apache/hudi/keygen/constant/KeyGeneratorType.java:
##
@@ -18,47 +18,41 @@
 
 package org.apache.hudi.keygen.constant;
 
+import org.apache.hudi.common.config.EnumDefault;
+import org.apache.hudi.common.config.EnumDescription;
+import org.apache.hudi.common.config.EnumFieldDescription;
+
 import java.util.ArrayList;
 import java.util.Arrays;
 import java.util.List;
 
 /**
  * Types of {@link org.apache.hudi.keygen.KeyGenerator}.
  */
+@EnumDescription("Key generator class, that implements 
`org.apache.hudi.keygen.KeyGenerator`")
 public enum KeyGeneratorType {

Review Comment:
   probably should explain what key generator is for



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #7929: [HUDI-5754] Add new sources to deltastreamer docs

2023-02-14 Thread via GitHub


nsivabalan commented on code in PR #7929:
URL: https://github.com/apache/hudi/pull/7929#discussion_r1106032816


##
website/docs/hoodie_deltastreamer.md:
##
@@ -340,6 +388,26 @@ to trigger/processing of new or changed data as soon as it 
is available on S3.
 
 Insert code sample from this blog: 
https://hudi.apache.org/blog/2021/08/23/s3-events-source/#configuration-and-setup
 
+### GCS Events
+Google Cloud Storage (GCS) service provides an event notification mechanism 
which will post notifications when certain
+events happen in your GCS bucket. You can read more at [Pub/Sub 
Notifications](https://cloud.google.com/storage/docs/pubsub-notifications/).
+GCS will put these events in a Cloud Pub/Sub topic. Apache Hudi provides a 
GcsEventsSource that can read from Cloud Pub/Sub
+to trigger/processing of new or changed data as soon as it is available on GCS.
+
+ Setup
+A detailed guide on [How to use the 
system](https://docs.google.com/document/d/1VfvtdvhXw6oEHPgZ_4Be2rkPxIzE0kBCNUiVDsXnSAA/edit#heading=h.tpmqk5oj0crt)
 is available.
+A high level overview of the same is provided below.
+1. Configure Cloud Storage Pub/Sub Notifications for the bucket. Follow 
Google’s documentation here: 
[https://cloud.google.com/storage/docs/reporting-changes](reporting changes)
+2. Create a Pub/Sub subscription corresponding to the topic
+3. Note the GCP Project Id, the Pub/Sub Subscription Id and use them for the 
following Hoodie configurations:
+   1. hoodie.deltastreamer.source.gcs.project.id=GCP_PROJECT_ID
+   2. hoodie.deltastreamer.source.gcs.subscription.id=SUSBCRIPTION_ID
+   3. Start the `GcsEventsSource` using the `HoodieDeltaStreamer` utility with 
--source-class parameter as
+  `org.apache.hudi.utilities.sources.GcsEventsSource` and 
hoodie.deltastreamer.source.cloud.meta.ack=true, and path related

Review Comment:
   nit: within single quotes



##
website/docs/hoodie_deltastreamer.md:
##
@@ -340,6 +388,26 @@ to trigger/processing of new or changed data as soon as it 
is available on S3.
 
 Insert code sample from this blog: 
https://hudi.apache.org/blog/2021/08/23/s3-events-source/#configuration-and-setup
 
+### GCS Events
+Google Cloud Storage (GCS) service provides an event notification mechanism 
which will post notifications when certain
+events happen in your GCS bucket. You can read more at [Pub/Sub 
Notifications](https://cloud.google.com/storage/docs/pubsub-notifications/).
+GCS will put these events in a Cloud Pub/Sub topic. Apache Hudi provides a 
GcsEventsSource that can read from Cloud Pub/Sub
+to trigger/processing of new or changed data as soon as it is available on GCS.
+
+ Setup
+A detailed guide on [How to use the 
system](https://docs.google.com/document/d/1VfvtdvhXw6oEHPgZ_4Be2rkPxIzE0kBCNUiVDsXnSAA/edit#heading=h.tpmqk5oj0crt)
 is available.

Review Comment:
   we usually don't link to adhoc google docs. it should be part of the RFC. 
curious to know why not update the RFC only ? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #7942: [HUDI-5753] Add docs for record payload

2023-02-14 Thread via GitHub


nsivabalan commented on code in PR #7942:
URL: https://github.com/apache/hudi/pull/7942#discussion_r1106035940


##
website/docs/record_payload.md:
##
@@ -0,0 +1,97 @@
+---
+title: Record Payload 
+keywords: [hudi, merge, upsert, precombine]
+---
+
+## Record Payload
+
+One of the core features of Hudi is the ability to incrementally upsert data, 
deduplicate and merge records on the fly.
+Additionally, users can implement their custom logic to merge the input 
records with the record on storage. Record
+payload is an abstract representation of a Hudi record that allows the 
aforementioned capability. As we shall see below,
+Hudi provides out-of-box support for different payloads for different use 
cases, and a new record merger API for
+optimized payload handling. But, first let us understand how record payload is 
used in the Hudi upsert path.
+
+
+
+
+
+Figure above shows the main stages that records go through while being written 
to the Hudi table. In the precombining
+stage, Hudi performs any deduplication based on the payload implementation and 
precombine key configured by the user.
+Further, on index lookup, Hudi identifies which records are being updated and 
the record payload implementation tells
+Hudi how to merge the incoming record with the existing record on storage.
+
+### Existing Payloads
+
+ OverwriteWithLatestAvroPayload
+
+This is the default record payload implementation. It picks the record with 
the greatest value (determined by calling
+.compareTo() on the value of precombine key) to break ties and simply picks 
the latest record while merging. This gives

Review Comment:
   actually there is little more to this. lets land this doc for 0.13.0. but as 
an immediate follow up, address these comments. 
   we have precombine and combineAndGetUpdate method used in diff occasions. 
   so calling out just preCombine may not be right. bcoz, when merging w/ whats 
in storage, we ignore the preCombine value specifically in this 
payoad(OverwriteWithLatestAvroPayload)



##
website/docs/record_payload.md:
##
@@ -0,0 +1,97 @@
+---
+title: Record Payload 
+keywords: [hudi, merge, upsert, precombine]
+---
+
+## Record Payload
+
+One of the core features of Hudi is the ability to incrementally upsert data, 
deduplicate and merge records on the fly.
+Additionally, users can implement their custom logic to merge the input 
records with the record on storage. Record
+payload is an abstract representation of a Hudi record that allows the 
aforementioned capability. As we shall see below,
+Hudi provides out-of-box support for different payloads for different use 
cases, and a new record merger API for
+optimized payload handling. But, first let us understand how record payload is 
used in the Hudi upsert path.
+
+
+
+
+
+Figure above shows the main stages that records go through while being written 
to the Hudi table. In the precombining
+stage, Hudi performs any deduplication based on the payload implementation and 
precombine key configured by the user.
+Further, on index lookup, Hudi identifies which records are being updated and 
the record payload implementation tells
+Hudi how to merge the incoming record with the existing record on storage.
+
+### Existing Payloads
+
+ OverwriteWithLatestAvroPayload
+
+This is the default record payload implementation. It picks the record with 
the greatest value (determined by calling
+.compareTo() on the value of precombine key) to break ties and simply picks 
the latest record while merging. This gives
+latest-write-wins style semantics.
+
+ EventTimeAvroPayload
+
+Some use cases require merging records by event time and thus event time plays 
the role of an ordering field. This
+payload is particularly useful in the case of late-arriving data. For such use 
cases, users need to set
+the [payload event time field](/docs/configurations#RECORD_PAYLOAD) 
configuration.
+
+ ExpressionPayload
+
+This payload is very useful when you want to merge or delete records based on 
some conditional expression, especially

Review Comment:
   should we remove this from this list. I thought its meant to be used only 
internally. can anyone directly set expression payload for their table? 



##
website/docs/record_payload.md:
##
@@ -0,0 +1,97 @@
+---
+title: Record Payload 
+keywords: [hudi, merge, upsert, precombine]
+---
+
+## Record Payload
+
+One of the core features of Hudi is the ability to incrementally upsert data, 
deduplicate and merge records on the fly.
+Additionally, users can implement their custom logic to merge the input 
records with the record on storage. Record
+payload is an abstract representation of a Hudi record that allows the 
aforementioned capability. As we shall see below,
+Hudi provides out-of-box support for different payloads for different use 
cases, and a new record merger API for
+optimized payload handling. But, first let us understand how record payload is 
used in 

[GitHub] [hudi] parisni commented on issue #7117: [SUPPORT] parquet bloom filters not supported by hudi

2023-02-14 Thread via GitHub


parisni commented on issue #7117:
URL: https://github.com/apache/hudi/issues/7117#issuecomment-1430008061

   hi @nsivabalan bloom at read time is an useful feature for read performance. 
thought ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] pramodbiligiri commented on a diff in pull request #7929: [HUDI-5754] Add new sources to deltastreamer docs

2023-02-14 Thread via GitHub


pramodbiligiri commented on code in PR #7929:
URL: https://github.com/apache/hudi/pull/7929#discussion_r1106058509


##
website/docs/hoodie_deltastreamer.md:
##
@@ -340,6 +388,26 @@ to trigger/processing of new or changed data as soon as it 
is available on S3.
 
 Insert code sample from this blog: 
https://hudi.apache.org/blog/2021/08/23/s3-events-source/#configuration-and-setup
 
+### GCS Events
+Google Cloud Storage (GCS) service provides an event notification mechanism 
which will post notifications when certain
+events happen in your GCS bucket. You can read more at [Pub/Sub 
Notifications](https://cloud.google.com/storage/docs/pubsub-notifications/).
+GCS will put these events in a Cloud Pub/Sub topic. Apache Hudi provides a 
GcsEventsSource that can read from Cloud Pub/Sub
+to trigger/processing of new or changed data as soon as it is available on GCS.
+
+ Setup
+A detailed guide on [How to use the 
system](https://docs.google.com/document/d/1VfvtdvhXw6oEHPgZ_4Be2rkPxIzE0kBCNUiVDsXnSAA/edit#heading=h.tpmqk5oj0crt)
 is available.

Review Comment:
   This is the only publicly available reference doc for this feature. It was 
contributed to OSS (by me) a while after being developed. There's an older 
version of this doc but it is not public.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-5789) Azure-ci seems to be timing out in TestDisruptorExecutionInSpark

2023-02-14 Thread Jonathan Vexler (Jira)
Jonathan Vexler created HUDI-5789:
-

 Summary: Azure-ci seems to be timing out in 
TestDisruptorExecutionInSpark
 Key: HUDI-5789
 URL: https://issues.apache.org/jira/browse/HUDI-5789
 Project: Apache Hudi
  Issue Type: Bug
  Components: tests-ci
Reporter: Jonathan Vexler


If you take a look at the most recent timeouts, TestDisruptorExecutionInSpark 
is the last to be run

https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=15107&view=logs&j=600e7de6-e133-5e69-e615-50ee129b3c08&t=bbbd7bcc-ae73-56b8-887a-cd2d6deaafc7&l=11699
https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=15096&view=logs&j=600e7de6-e133-5e69-e615-50ee129b3c08&t=bbbd7bcc-ae73-56b8-887a-cd2d6deaafc7&l=11765
https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=15094&view=logs&j=600e7de6-e133-5e69-e615-50ee129b3c08&t=bbbd7bcc-ae73-56b8-887a-cd2d6deaafc7&l=11565
https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=15078&view=logs&j=600e7de6-e133-5e69-e615-50ee129b3c08&t=bbbd7bcc-ae73-56b8-887a-cd2d6deaafc7&l=11628



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5790) testHoodieAsyncClusteringJobWithScheduleAndExecute is flakey

2023-02-14 Thread Jonathan Vexler (Jira)
Jonathan Vexler created HUDI-5790:
-

 Summary: testHoodieAsyncClusteringJobWithScheduleAndExecute is 
flakey
 Key: HUDI-5790
 URL: https://issues.apache.org/jira/browse/HUDI-5790
 Project: Apache Hudi
  Issue Type: Test
Reporter: Jonathan Vexler


for example: 
https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=15157&view=logs&j=dcedfe73-9485-5cc5-817a-73b61fc5dcb0&t=746585d8-b50a-55c3-26c5-517d93af9934&l=27668
{code:java}
java.util.concurrent.ExecutionException: java.lang.RuntimeException: 
org.apache.hudi.exception.HoodieException
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at 
org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamer.deltaStreamerTestRunner(TestHoodieDeltaStreamer.java:901)
at 
org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamer.deltaStreamerTestRunner(TestHoodieDeltaStreamer.java:884)
at 
org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamer.deltaStreamerTestRunner(TestHoodieDeltaStreamer.java:929)
at 
org.apache.hudi.utilities.deltastreamer.TestHoodieDeltaStreamer.testHoodieAsyncClusteringJobWithScheduleAndExecute(TestHoodieDeltaStreamer.java:1369)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:688)
at 
org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131)
at 
org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:149)
at 
org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:140)
at 
org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestTemplateMethod(TimeoutExtension.java:92)
at 
org.junit.jupiter.engine.execution.ExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(ExecutableInvoker.java:115)
at 
org.junit.jupiter.engine.execution.ExecutableInvoker.lambda$invoke$0(ExecutableInvoker.java:105)
at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106)
at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64)
at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45)
at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37)
at 
org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:104)
at 
org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:98)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$6(TestMethodTestDescriptor.java:210)
at 
org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:206)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:131)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:65)
at 
org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$5(NodeTestTask.java:139)
at 
org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
at 
org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$7(NodeTestTask.java:129)
at 
org.junit.platform.engine.support.hierarchical.Node.around(Node.java:137)
at 
org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$8(NodeTestTask.java:127)
at 
org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
at 
org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:126)
at 
org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTask.java:84)
at 
org.junit.platform.engine.support.hierarchical.SameThreadHierarchicalTestExecutorService.submit(SameThreadHierarchicalTestExecutorService.java:32)
  

[jira] [Updated] (HUDI-5789) Azure-ci seems to be timing out in TestDisruptorExecutionInSpark

2023-02-14 Thread Jonathan Vexler (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Vexler updated HUDI-5789:
--
 Epic Link: HUDI-4302
Issue Type: Test  (was: Bug)
  Priority: Blocker  (was: Critical)

> Azure-ci seems to be timing out in TestDisruptorExecutionInSpark
> 
>
> Key: HUDI-5789
> URL: https://issues.apache.org/jira/browse/HUDI-5789
> Project: Apache Hudi
>  Issue Type: Test
>  Components: tests-ci
>Reporter: Jonathan Vexler
>Priority: Blocker
>
> If you take a look at the most recent timeouts, TestDisruptorExecutionInSpark 
> is the last to be run
> https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=15107&view=logs&j=600e7de6-e133-5e69-e615-50ee129b3c08&t=bbbd7bcc-ae73-56b8-887a-cd2d6deaafc7&l=11699
> https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=15096&view=logs&j=600e7de6-e133-5e69-e615-50ee129b3c08&t=bbbd7bcc-ae73-56b8-887a-cd2d6deaafc7&l=11765
> https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=15094&view=logs&j=600e7de6-e133-5e69-e615-50ee129b3c08&t=bbbd7bcc-ae73-56b8-887a-cd2d6deaafc7&l=11565
> https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=15078&view=logs&j=600e7de6-e133-5e69-e615-50ee129b3c08&t=bbbd7bcc-ae73-56b8-887a-cd2d6deaafc7&l=11628



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] jonvex opened a new pull request, #7943: [MINOR] enable publish test results

2023-02-14 Thread via GitHub


jonvex opened a new pull request, #7943:
URL: https://github.com/apache/hudi/pull/7943

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] jonvex commented on pull request #7943: [MINOR] enable publish test results

2023-02-14 Thread via GitHub


jonvex commented on PR #7943:
URL: https://github.com/apache/hudi/pull/7943#issuecomment-1430106863

   Lets see what happens in the azure pipeline


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] Leoyzen commented on a diff in pull request #7940: [HUDI-5787] HoodieHiveCatalog should not delete data for dropping external table

2023-02-14 Thread via GitHub


Leoyzen commented on code in PR #7940:
URL: https://github.com/apache/hudi/pull/7940#discussion_r1106127884


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/catalog/HoodieHiveCatalog.java:
##
@@ -656,12 +657,10 @@ public void dropTable(ObjectPath tablePath, boolean 
ignoreIfNotExists)
   client.dropTable(
   tablePath.getDatabaseName(),
   tablePath.getObjectName(),
-  // Indicate whether associated data should be deleted.
-  // Set to 'true' for now because Flink tables shouldn't have data in 
Hive. Can
-  // be changed later if necessary
-  true,
+  // External table drops only the metadata, should not delete the 
underlying data.
+  
!TableType.EXTERNAL_TABLE.name().equals(getHiveTable(tablePath).getTableType().toUpperCase(Locale.ROOT)),

Review Comment:
   This line should be reverted.
   
   This should be set true because there is determine condition in 
HiveMetaStore impl whether deleting data by "deleteData" AND "is not external 
table" .
   So set "deleteData" to true is expected.
   ```JAVA
   boolean shouldDeleteData = deleteData && !isExternal;
   ```
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #7752: [MINOR] De-duplicating Iterator implementations

2023-02-14 Thread via GitHub


alexeykudinkin commented on code in PR #7752:
URL: https://github.com/apache/hudi/pull/7752#discussion_r1106128276


##
hudi-common/src/main/java/org/apache/hudi/common/util/collection/CloseableMappingIterator.java:
##
@@ -22,8 +22,8 @@
 
 import java.util.function.Function;
 
-// TODO java-doc
-public class CloseableMappingIterator extends MappingIterator 
implements ClosableIterator {
+public class CloseableMappingIterator extends MappingIterator

Review Comment:
   Can you help me understand what you want to have clarified in there?



##
hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordReader.java:
##
@@ -807,16 +809,17 @@ private ClosableIteratorWithSchema 
getRecordsIterator(
 
 Option, Schema>> 
schemaEvolutionTransformerOpt =
 composeEvolvedSchemaTransformer(dataBlock);
+
 // In case when schema has been evolved original persisted records will 
have to be
 // transformed to adhere to the new schema
-if (schemaEvolutionTransformerOpt.isPresent()) {
-  return ClosableIteratorWithSchema.newInstance(
-  new CloseableMappingIterator<>(blockRecordsIterator,
-  schemaEvolutionTransformerOpt.get().getLeft()),
-  schemaEvolutionTransformerOpt.get().getRight());
-} else {
-  return ClosableIteratorWithSchema.newInstance(blockRecordsIterator, 
dataBlock.getSchema());
-}
+Function transformer =
+schemaEvolutionTransformerOpt.map(Pair::getLeft)
+.orElse(Function.identity());
+
+Schema schema = schemaEvolutionTransformerOpt.map(Pair::getRight)
+.orElse(dataBlock.getSchema());
+

Review Comment:
   Not sure i understand what your concern is: we now duplicate much less code 
at the expense of having empty transformer (which would be optimized out by the 
JVM)



##
hudi-common/src/main/java/org/apache/hudi/common/util/ClosableIterator.java:
##
@@ -24,8 +24,29 @@
  * An iterator that give a chance to release resources.
  *
  * @param  The return type
+ *
+ * TODO move under common.util.collection
  */

Review Comment:
   I get your point, but not everyone is the same -- i do follow-up on the 
TODOs so all in all i find this still helpful even when there's no Jira ticket 
associated (it's a minor change that doesn't seem to be worth creating a Jira 
for it)



##
hudi-common/src/main/java/org/apache/hudi/common/util/ClosableIterator.java:
##
@@ -24,8 +24,29 @@
  * An iterator that give a chance to release resources.
  *
  * @param  The return type
+ *
+ * TODO move under common.util.collection
  */

Review Comment:
   This one actually i've left while doing another change, now since this PR 
dedicated to the cleanup i will take it up in this one



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] jonvex commented on pull request #7943: [MINOR] enable publish test results

2023-02-14 Thread via GitHub


jonvex commented on PR #7943:
URL: https://github.com/apache/hudi/pull/7943#issuecomment-1430117180

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on pull request #7938: [HUDI-5785] Enhance Spark Datasource tests

2023-02-14 Thread via GitHub


yihua commented on PR #7938:
URL: https://github.com/apache/hudi/pull/7938#issuecomment-1430118933

   > Let's start the process of unifying all of the utilities to make sure 
we're not getting bitten by the same thing again 
https://github.com/apache/hudi/pull/7702/files#diff-93d5c78a2db3470cef4a643a3b41b8b97876f411310a5653d232525c87a6d749
   
   I created this to unify all APIs to construct Spark configs: 
[HUDI-5788](https://issues.apache.org/jira/browse/HUDI-5788)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7933: [HUDI-5774] Fix prometheus configs for metadata table and support metric labels

2023-02-14 Thread via GitHub


hudi-bot commented on PR #7933:
URL: https://github.com/apache/hudi/pull/7933#issuecomment-1430119513

   
   ## CI report:
   
   * a02b393674ed4ae07d1eed67560f126ac06e178c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15158)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7938: [HUDI-5785] Enhance Spark Datasource tests

2023-02-14 Thread via GitHub


hudi-bot commented on PR #7938:
URL: https://github.com/apache/hudi/pull/7938#issuecomment-1430119584

   
   ## CI report:
   
   * 4e7da703304c7783e9771e931e39854adf6458d6 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7938: [HUDI-5785] Enhance Spark Datasource tests

2023-02-14 Thread via GitHub


hudi-bot commented on PR #7938:
URL: https://github.com/apache/hudi/pull/7938#issuecomment-1430128767

   
   ## CI report:
   
   * 4e7da703304c7783e9771e931e39854adf6458d6 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15153)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7943: [MINOR] enable publish test results

2023-02-14 Thread via GitHub


hudi-bot commented on PR #7943:
URL: https://github.com/apache/hudi/pull/7943#issuecomment-1430128874

   
   ## CI report:
   
   * 42fc00a913f1fa8b6594f3220934104c786e246b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-5787) HMSDDLExecutor should set table type to EXTERNAL_TABLE when setting hoodie.datasource.hive_sync.create_managed_table to false

2023-02-14 Thread Nicholas Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Jiang updated HUDI-5787:
-
Summary: HMSDDLExecutor should set table type to EXTERNAL_TABLE when 
setting hoodie.datasource.hive_sync.create_managed_table to false  (was: 
HoodieHiveCatalog should not delete data for dropping external table)

> HMSDDLExecutor should set table type to EXTERNAL_TABLE when setting 
> hoodie.datasource.hive_sync.create_managed_table to false
> -
>
> Key: HUDI-5787
> URL: https://issues.apache.org/jira/browse/HUDI-5787
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink
>Reporter: Nicholas Jiang
>Assignee: Nicholas Jiang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>
> HoodieHiveCatalog should not delete data when dropping the Hive external 
> table, for example, the value of the 
> 'hoodie.datasource.hive_sync.create_managed_table' config is false.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5787) HMSDDLExecutor should set table type to EXTERNAL_TABLE when hoodie.datasource.hive_sync.create_managed_table of sync config is false

2023-02-14 Thread Nicholas Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Jiang updated HUDI-5787:
-
Summary: HMSDDLExecutor should set table type to EXTERNAL_TABLE when 
hoodie.datasource.hive_sync.create_managed_table of sync config is false  (was: 
HMSDDLExecutor should set table type to EXTERNAL_TABLE when setting 
hoodie.datasource.hive_sync.create_managed_table to false)

> HMSDDLExecutor should set table type to EXTERNAL_TABLE when 
> hoodie.datasource.hive_sync.create_managed_table of sync config is false
> 
>
> Key: HUDI-5787
> URL: https://issues.apache.org/jira/browse/HUDI-5787
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink
>Reporter: Nicholas Jiang
>Assignee: Nicholas Jiang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>
> HoodieHiveCatalog should not delete data when dropping the Hive external 
> table, for example, the value of the 
> 'hoodie.datasource.hive_sync.create_managed_table' config is false.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5787) HMSDDLExecutor should set table type to EXTERNAL_TABLE when hoodie.datasource.hive_sync.create_managed_table of sync config is false

2023-02-14 Thread Nicholas Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Jiang updated HUDI-5787:
-
Description: HMSDDLExecutor should set the table type of Hive table to 
EXTERNAL_TABLE when hoodie.datasource.hive_sync.create_managed_table of sync 
config is set to false.  (was: HoodieHiveCatalog should not delete data when 
dropping the Hive external table, for example, the value of the 
'hoodie.datasource.hive_sync.create_managed_table' config is false.)

> HMSDDLExecutor should set table type to EXTERNAL_TABLE when 
> hoodie.datasource.hive_sync.create_managed_table of sync config is false
> 
>
> Key: HUDI-5787
> URL: https://issues.apache.org/jira/browse/HUDI-5787
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink
>Reporter: Nicholas Jiang
>Assignee: Nicholas Jiang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.1
>
>
> HMSDDLExecutor should set the table type of Hive table to EXTERNAL_TABLE when 
> hoodie.datasource.hive_sync.create_managed_table of sync config is set to 
> false.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   >