Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

2024-03-27 Thread via GitHub


yihua commented on PR #10876:
URL: https://github.com/apache/hudi/pull/10876#issuecomment-2024077244

   > > We have #7146 which also attempted to solve the same problem. Should we 
close #7146 and prefer this one?
   > 
   > That does not solve the problem as the sorting (of the input batch) is 
thrown away by the hashing based mapping of the record to a specific bucket. 
This tries to solve the problem by implementing a new partitioner 
`UpsertSortPartitioner`, derived from `UpsertPartitioner`, which preserves the 
sorted nature of the input batch (by assigning a contiguous range of sorted 
input records to a single bucket/spark-partition)
   
   Then #7146 can be deprecated?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

2024-03-25 Thread via GitHub


bhat-vinay commented on PR #10876:
URL: https://github.com/apache/hudi/pull/10876#issuecomment-2019436395

   > We have #7146 which also attempted to solve the same problem. Should we 
close #7146 and prefer this one?
   
   That does not solve the problem as the sorting (of the input batch) is 
thrown away by the hashing based mapping of the record to a specific bucket. 
This tries to solve the problem by implementing a new partitioner 
`UpsertSortPartitioner`, derived from `UpsertPartitioner`, which preserves the 
sorted nature of the input batch (by assigning a contiguous range of sorted 
input records to a single bucket/spark-partition)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

2024-03-25 Thread via GitHub


yihua commented on PR #10876:
URL: https://github.com/apache/hudi/pull/10876#issuecomment-2019412865

   We have #7146 which also attempted to solve the same problem.  Should we 
close #7146 and prefer this one?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

2024-03-24 Thread via GitHub


hudi-bot commented on PR #10876:
URL: https://github.com/apache/hudi/pull/10876#issuecomment-2016810146

   
   ## CI report:
   
   * 2c83cfaf2bdaef6b5075989992aeeff8052461ed UNKNOWN
   * 9329d8d43e9274478e64a0d40cbe7a5a0362ec90 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23010)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

2024-03-24 Thread via GitHub


hudi-bot commented on PR #10876:
URL: https://github.com/apache/hudi/pull/10876#issuecomment-2016795397

   
   ## CI report:
   
   * 2c83cfaf2bdaef6b5075989992aeeff8052461ed UNKNOWN
   * e2296a2de6391dee42a83d390410eb71f193d55c Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23004)
 
   * 9329d8d43e9274478e64a0d40cbe7a5a0362ec90 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23010)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

2024-03-24 Thread via GitHub


hudi-bot commented on PR #10876:
URL: https://github.com/apache/hudi/pull/10876#issuecomment-2016793663

   
   ## CI report:
   
   * 2c83cfaf2bdaef6b5075989992aeeff8052461ed UNKNOWN
   * e2296a2de6391dee42a83d390410eb71f193d55c Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23004)
 
   * 9329d8d43e9274478e64a0d40cbe7a5a0362ec90 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

2024-03-23 Thread via GitHub


hudi-bot commented on PR #10876:
URL: https://github.com/apache/hudi/pull/10876#issuecomment-2016391528

   
   ## CI report:
   
   * 2c83cfaf2bdaef6b5075989992aeeff8052461ed UNKNOWN
   * e2296a2de6391dee42a83d390410eb71f193d55c Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23004)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

2024-03-22 Thread via GitHub


hudi-bot commented on PR #10876:
URL: https://github.com/apache/hudi/pull/10876#issuecomment-2016360644

   
   ## CI report:
   
   * 2c83cfaf2bdaef6b5075989992aeeff8052461ed UNKNOWN
   * a84507191a942c5d8c98610958ca48f47188bc48 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22994)
 
   * e2296a2de6391dee42a83d390410eb71f193d55c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23004)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

2024-03-22 Thread via GitHub


hudi-bot commented on PR #10876:
URL: https://github.com/apache/hudi/pull/10876#issuecomment-2016357819

   
   ## CI report:
   
   * 2c83cfaf2bdaef6b5075989992aeeff8052461ed UNKNOWN
   * a84507191a942c5d8c98610958ca48f47188bc48 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22994)
 
   * e2296a2de6391dee42a83d390410eb71f193d55c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

2024-03-22 Thread via GitHub


bhat-vinay commented on code in PR #10876:
URL: https://github.com/apache/hudi/pull/10876#discussion_r1536561023


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##
@@ -480,6 +480,20 @@ public class HoodieWriteConfig extends HoodieConfig {
   .markAdvanced()
   .withDocumentation(BulkInsertSortMode.class);
 
+  public static final ConfigProperty INSERT_SORT = ConfigProperty

Review Comment:
   Already handled by setting valid values for the config property.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

2024-03-22 Thread via GitHub


bhat-vinay commented on code in PR #10876:
URL: https://github.com/apache/hudi/pull/10876#discussion_r1536560898


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java:
##
@@ -411,4 +427,90 @@ public Partitioner getLayoutPartitioner(WorkloadProfile 
profile, String layoutPa
   protected void 
runPrecommitValidators(HoodieWriteMetadata> 
writeMetadata) {
 SparkValidatorUtils.runValidators(config, writeMetadata, context, table, 
instantTime);
   }
+
+  private HoodieData 
sortAndMapPartitionsAsRDD(HoodieData> dedupedRecords, 
Partitioner partitioner) {
+JavaPairRDD, HoodieRecord> mappedRDD = 
getSortedIndexedRecords(dedupedRecords);
+JavaPairRDD, HoodieRecord> partitionedRDD;
+if (table.requireSortedRecords()) {
+  // Partition and sort within each partition as a single step. This is 
faster than partitioning first and then
+  // applying a sort.
+  Comparator> comparator = 
(Comparator> & Serializable) (t1, t2) -> {
+HoodieKey key1 = t1._1();
+HoodieKey key2 = t2._1();
+return key1.getRecordKey().compareTo(key2.getRecordKey());
+  };
+  partitionedRDD = 
mappedRDD.repartitionAndSortWithinPartitions(partitioner, comparator);
+} else {
+  // Partition only
+  partitionedRDD = mappedRDD.partitionBy(partitioner);
+}
+
+return 
HoodieJavaRDD.of(partitionedRDD.map(Tuple2::_2).mapPartitionsWithIndex((partition,
 recordItr) -> {
+  if (WriteOperationType.isChangingRecords(operationType)) {
+return handleUpsertPartition(instantTime, partition, recordItr, 
partitioner);
+  } else {
+return handleInsertPartition(instantTime, partition, recordItr, 
partitioner);
+  }
+}, true).flatMap(List::iterator));
+  }
+
+  private boolean operationRequiresSorting() {
+return operationType == WriteOperationType.INSERT && 
config.getBoolean(INSERT_SORT);

Review Comment:
   The current implementation (in this PR) does not support sorting for UPSERT 
operation



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

2024-03-22 Thread via GitHub


vinothchandar commented on code in PR #10876:
URL: https://github.com/apache/hudi/pull/10876#discussion_r1536245569


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java:
##
@@ -411,4 +427,90 @@ public Partitioner getLayoutPartitioner(WorkloadProfile 
profile, String layoutPa
   protected void 
runPrecommitValidators(HoodieWriteMetadata> 
writeMetadata) {
 SparkValidatorUtils.runValidators(config, writeMetadata, context, table, 
instantTime);
   }
+
+  private HoodieData 
sortAndMapPartitionsAsRDD(HoodieData> dedupedRecords, 
Partitioner partitioner) {

Review Comment:
   yes lets rename. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

2024-03-22 Thread via GitHub


vinothchandar commented on code in PR #10876:
URL: https://github.com/apache/hudi/pull/10876#discussion_r1536245244


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java:
##
@@ -230,6 +236,10 @@ protected Partitioner getPartitioner(WorkloadProfile 
profile) {
   }
 
   private HoodieData 
mapPartitionsAsRDD(HoodieData> dedupedRecords, Partitioner 
partitioner) {
+if (operationRequiresSorting()) {

Review Comment:
   upsert is updates and inserts. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

2024-03-22 Thread via GitHub


vinothchandar commented on code in PR #10876:
URL: https://github.com/apache/hudi/pull/10876#discussion_r1536245082


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##
@@ -480,6 +480,20 @@ public class HoodieWriteConfig extends HoodieConfig {
   .markAdvanced()
   .withDocumentation(BulkInsertSortMode.class);
 
+  public static final ConfigProperty INSERT_SORT = ConfigProperty

Review Comment:
   lets make sure we throw an exception for the unsupported mode.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

2024-03-22 Thread via GitHub


hudi-bot commented on PR #10876:
URL: https://github.com/apache/hudi/pull/10876#issuecomment-2015437643

   
   ## CI report:
   
   * 2c83cfaf2bdaef6b5075989992aeeff8052461ed UNKNOWN
   * a84507191a942c5d8c98610958ca48f47188bc48 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22994)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

2024-03-22 Thread via GitHub


hudi-bot commented on PR #10876:
URL: https://github.com/apache/hudi/pull/10876#issuecomment-2015350367

   
   ## CI report:
   
   * b802619f011c1d9ef5b334ecf67ab7df74964e08 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22958)
 
   * 2c83cfaf2bdaef6b5075989992aeeff8052461ed UNKNOWN
   * a84507191a942c5d8c98610958ca48f47188bc48 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

2024-03-22 Thread via GitHub


hudi-bot commented on PR #10876:
URL: https://github.com/apache/hudi/pull/10876#issuecomment-2015337228

   
   ## CI report:
   
   * b802619f011c1d9ef5b334ecf67ab7df74964e08 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22958)
 
   * 2c83cfaf2bdaef6b5075989992aeeff8052461ed UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

2024-03-22 Thread via GitHub


bhat-vinay commented on PR #10876:
URL: https://github.com/apache/hudi/pull/10876#issuecomment-2015327684

   > IIUC this adds additional shuffle and a new job? I'd like to understand 
how we think this impacts the current insert DAG. Yet to review the new 
partitioner, will do once I hear back on these.
   
   Yes, there is a sorting stage (global sort of the input batch) which might 
add a shuffle. New job is to assign sequentially increasing indexes for the 
sorted records (which the `UpsertSortPartitioner` relies on to ensure that 
sorted nature of the input batch is preserved while still handling small files 
as efficiently as possible). Not sure if this is what you meant.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

2024-03-22 Thread via GitHub


bhat-vinay commented on code in PR #10876:
URL: https://github.com/apache/hudi/pull/10876#discussion_r1535756962


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java:
##
@@ -411,4 +427,90 @@ public Partitioner getLayoutPartitioner(WorkloadProfile 
profile, String layoutPa
   protected void 
runPrecommitValidators(HoodieWriteMetadata> 
writeMetadata) {
 SparkValidatorUtils.runValidators(config, writeMetadata, context, table, 
instantTime);
   }
+
+  private HoodieData 
sortAndMapPartitionsAsRDD(HoodieData> dedupedRecords, 
Partitioner partitioner) {

Review Comment:
   > lets UT this method?
   
   Done.
   
   > also rename? this is performing the actual write . sortIfNeededAndWrite ?
   
   Borrowed the name from `mapPartitionsAsRDD(...)` which also performs a write 
and does not make it explicit. I can rename this if it makes it easier to read



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

2024-03-22 Thread via GitHub


bhat-vinay commented on code in PR #10876:
URL: https://github.com/apache/hudi/pull/10876#discussion_r1535754607


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java:
##
@@ -411,4 +427,90 @@ public Partitioner getLayoutPartitioner(WorkloadProfile 
profile, String layoutPa
   protected void 
runPrecommitValidators(HoodieWriteMetadata> 
writeMetadata) {
 SparkValidatorUtils.runValidators(config, writeMetadata, context, table, 
instantTime);
   }
+
+  private HoodieData 
sortAndMapPartitionsAsRDD(HoodieData> dedupedRecords, 
Partitioner partitioner) {
+JavaPairRDD, HoodieRecord> mappedRDD = 
getSortedIndexedRecords(dedupedRecords);
+JavaPairRDD, HoodieRecord> partitionedRDD;
+if (table.requireSortedRecords()) {
+  // Partition and sort within each partition as a single step. This is 
faster than partitioning first and then
+  // applying a sort.
+  Comparator> comparator = 
(Comparator> & Serializable) (t1, t2) -> {
+HoodieKey key1 = t1._1();
+HoodieKey key2 = t2._1();
+return key1.getRecordKey().compareTo(key2.getRecordKey());
+  };
+  partitionedRDD = 
mappedRDD.repartitionAndSortWithinPartitions(partitioner, comparator);
+} else {
+  // Partition only
+  partitionedRDD = mappedRDD.partitionBy(partitioner);
+}
+
+return 
HoodieJavaRDD.of(partitionedRDD.map(Tuple2::_2).mapPartitionsWithIndex((partition,
 recordItr) -> {
+  if (WriteOperationType.isChangingRecords(operationType)) {
+return handleUpsertPartition(instantTime, partition, recordItr, 
partitioner);
+  } else {
+return handleInsertPartition(instantTime, partition, recordItr, 
partitioner);
+  }
+}, true).flatMap(List::iterator));
+  }
+
+  private boolean operationRequiresSorting() {
+return operationType == WriteOperationType.INSERT && 
config.getBoolean(INSERT_SORT);
+  }
+
+  private JavaPairRDD, HoodieRecord> 
getSortedIndexedRecords(HoodieData> dedupedRecords) {
+// Get any user specified sort columns
+String customSortColField = 
config.getString(INSERT_USER_DEFINED_SORT_COLUMNS);
+
+String[] sortColumns;
+if (!isNullOrEmpty(customSortColField)) {
+  // Extract user specified sort-column fields as an array
+  sortColumns = Arrays.stream(customSortColField.split(","))
+  .map(String::trim).toArray(String[]::new);
+} else {
+  // Use record-key as sort column
+  sortColumns = 
Arrays.stream(HoodieRecord.HoodieMetadataField.RECORD_KEY_METADATA_FIELD.getFieldName().split(","))

Review Comment:
   > left a comment already. idk how this works for partitioned tables?
   
   I ma not sure I understand. Why will it not work for partitioned tables?
   
   > do we need the .split(,"). here
   
   That block was to placate the compiler. Reworked and removed it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

2024-03-22 Thread via GitHub


bhat-vinay commented on code in PR #10876:
URL: https://github.com/apache/hudi/pull/10876#discussion_r1535751978


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java:
##
@@ -411,4 +427,90 @@ public Partitioner getLayoutPartitioner(WorkloadProfile 
profile, String layoutPa
   protected void 
runPrecommitValidators(HoodieWriteMetadata> 
writeMetadata) {
 SparkValidatorUtils.runValidators(config, writeMetadata, context, table, 
instantTime);
   }
+
+  private HoodieData 
sortAndMapPartitionsAsRDD(HoodieData> dedupedRecords, 
Partitioner partitioner) {
+JavaPairRDD, HoodieRecord> mappedRDD = 
getSortedIndexedRecords(dedupedRecords);
+JavaPairRDD, HoodieRecord> partitionedRDD;
+if (table.requireSortedRecords()) {
+  // Partition and sort within each partition as a single step. This is 
faster than partitioning first and then
+  // applying a sort.
+  Comparator> comparator = 
(Comparator> & Serializable) (t1, t2) -> {
+HoodieKey key1 = t1._1();
+HoodieKey key2 = t2._1();
+return key1.getRecordKey().compareTo(key2.getRecordKey());
+  };
+  partitionedRDD = 
mappedRDD.repartitionAndSortWithinPartitions(partitioner, comparator);
+} else {
+  // Partition only
+  partitionedRDD = mappedRDD.partitionBy(partitioner);
+}
+
+return 
HoodieJavaRDD.of(partitionedRDD.map(Tuple2::_2).mapPartitionsWithIndex((partition,
 recordItr) -> {
+  if (WriteOperationType.isChangingRecords(operationType)) {
+return handleUpsertPartition(instantTime, partition, recordItr, 
partitioner);
+  } else {
+return handleInsertPartition(instantTime, partition, recordItr, 
partitioner);
+  }
+}, true).flatMap(List::iterator));
+  }
+
+  private boolean operationRequiresSorting() {
+return operationType == WriteOperationType.INSERT && 
config.getBoolean(INSERT_SORT);
+  }
+
+  private JavaPairRDD, HoodieRecord> 
getSortedIndexedRecords(HoodieData> dedupedRecords) {
+// Get any user specified sort columns
+String customSortColField = 
config.getString(INSERT_USER_DEFINED_SORT_COLUMNS);
+
+String[] sortColumns;
+if (!isNullOrEmpty(customSortColField)) {
+  // Extract user specified sort-column fields as an array
+  sortColumns = Arrays.stream(customSortColField.split(","))
+  .map(String::trim).toArray(String[]::new);
+} else {
+  // Use record-key as sort column
+  sortColumns = 
Arrays.stream(HoodieRecord.HoodieMetadataField.RECORD_KEY_METADATA_FIELD.getFieldName().split(","))
+  .map(String::trim).toArray(String[]::new);
+}
+
+// Get the record's schema from the write config
+SerializableSchema serializableSchema = new SerializableSchema(new 
Schema.Parser().parse(config.getSchema()));
+
+JavaRDD> javaRdd = 
HoodieJavaRDD.getJavaRDD(dedupedRecords);
+JavaRDD> sortedRecords = javaRdd.sortBy(record -> {

Review Comment:
   My understanding is that `repartitionAndSortWithinPartitions` is to sort 
within a bucket (or a Spark RDD partition) after UpsertPartitioner has already 
partitioned the input batch. It is for handling the case of writing sorted 
key-values to file with file formats that depend on it (ex : HFile). I am not 
sure how partitioning first and then sorting within that partition will be 
useful.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

2024-03-22 Thread via GitHub


bhat-vinay commented on code in PR #10876:
URL: https://github.com/apache/hudi/pull/10876#discussion_r1535749601


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java:
##
@@ -411,4 +427,90 @@ public Partitioner getLayoutPartitioner(WorkloadProfile 
profile, String layoutPa
   protected void 
runPrecommitValidators(HoodieWriteMetadata> 
writeMetadata) {
 SparkValidatorUtils.runValidators(config, writeMetadata, context, table, 
instantTime);
   }
+
+  private HoodieData 
sortAndMapPartitionsAsRDD(HoodieData> dedupedRecords, 
Partitioner partitioner) {
+JavaPairRDD, HoodieRecord> mappedRDD = 
getSortedIndexedRecords(dedupedRecords);
+JavaPairRDD, HoodieRecord> partitionedRDD;
+if (table.requireSortedRecords()) {
+  // Partition and sort within each partition as a single step. This is 
faster than partitioning first and then
+  // applying a sort.
+  Comparator> comparator = 
(Comparator> & Serializable) (t1, t2) -> {
+HoodieKey key1 = t1._1();
+HoodieKey key2 = t2._1();
+return key1.getRecordKey().compareTo(key2.getRecordKey());
+  };
+  partitionedRDD = 
mappedRDD.repartitionAndSortWithinPartitions(partitioner, comparator);
+} else {
+  // Partition only
+  partitionedRDD = mappedRDD.partitionBy(partitioner);
+}
+
+return 
HoodieJavaRDD.of(partitionedRDD.map(Tuple2::_2).mapPartitionsWithIndex((partition,
 recordItr) -> {
+  if (WriteOperationType.isChangingRecords(operationType)) {
+return handleUpsertPartition(instantTime, partition, recordItr, 
partitioner);
+  } else {
+return handleInsertPartition(instantTime, partition, recordItr, 
partitioner);
+  }
+}, true).flatMap(List::iterator));
+  }
+
+  private boolean operationRequiresSorting() {
+return operationType == WriteOperationType.INSERT && 
config.getBoolean(INSERT_SORT);
+  }
+
+  private JavaPairRDD, HoodieRecord> 
getSortedIndexedRecords(HoodieData> dedupedRecords) {
+// Get any user specified sort columns
+String customSortColField = 
config.getString(INSERT_USER_DEFINED_SORT_COLUMNS);
+
+String[] sortColumns;
+if (!isNullOrEmpty(customSortColField)) {
+  // Extract user specified sort-column fields as an array
+  sortColumns = Arrays.stream(customSortColField.split(","))
+  .map(String::trim).toArray(String[]::new);
+} else {
+  // Use record-key as sort column
+  sortColumns = 
Arrays.stream(HoodieRecord.HoodieMetadataField.RECORD_KEY_METADATA_FIELD.getFieldName().split(","))
+  .map(String::trim).toArray(String[]::new);
+}
+
+// Get the record's schema from the write config
+SerializableSchema serializableSchema = new SerializableSchema(new 
Schema.Parser().parse(config.getSchema()));
+
+JavaRDD> javaRdd = 
HoodieJavaRDD.getJavaRDD(dedupedRecords);
+JavaRDD> sortedRecords = javaRdd.sortBy(record -> {
+  if (isNullOrEmpty(customSortColField)) {
+// If sorting based on record-key, extract it directly using 
record.getRecordKey()
+return new StringBuilder()
+.append(record.getPartitionPath())
+.append("+")
+.append(record.getRecordKey())
+.toString();
+  } else {
+// Extract the sort columns from the record and return it as  string 
(prepended with partition-path)
+Object[] columnValues = 
record.getColumnValues(serializableSchema.get(), sortColumns, false);
+String sortColString = 
Arrays.stream(columnValues).map(Object::toString).collect(Collectors.joining());
+return new StringBuilder()
+.append(record.getPartitionPath())
+.append("+")
+.append(sortColString)
+.toString();
+  }
+}, true, 0);
+
+// Assign index to each record in the RDD
+JavaRDD, Long>> indexedRecords = 
sortedRecords.zipWithIndex()

Review Comment:
   This is required by the partitioner to assign a contiguous chunk of sorted 
input records to a single bucket (bucket in turn maps to a single file, hence 
the sorted records are written to a single file). I am not sure if there is any 
other way to assign sequentially increasing indexes to the sorted records - 
which can then be used in `UpsertSortPartitioner::getPartition(...)` to detect 
the bucket that this record maps to.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

2024-03-22 Thread via GitHub


bhat-vinay commented on code in PR #10876:
URL: https://github.com/apache/hudi/pull/10876#discussion_r1535706564


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java:
##
@@ -394,6 +404,12 @@ public Partitioner getUpsertPartitioner(WorkloadProfile 
profile) {
 if (profile == null) {
   throw new HoodieUpsertException("Need workload profile to construct the 
upsert partitioner.");
 }
+
+if (operationRequiresSorting()) {
+  // Return UpsertSortPartitioner if the input records are going to be 
sorted
+  return new UpsertSortPartitioner<>(profile, context, table, config);
+}

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

2024-03-22 Thread via GitHub


bhat-vinay commented on code in PR #10876:
URL: https://github.com/apache/hudi/pull/10876#discussion_r1535704794


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java:
##
@@ -230,6 +236,10 @@ protected Partitioner getPartitioner(WorkloadProfile 
profile) {
   }
 
   private HoodieData 
mapPartitionsAsRDD(HoodieData> dedupedRecords, Partitioner 
partitioner) {
+if (operationRequiresSorting()) {

Review Comment:
   What does sorting mean for 'upsert' operation. If the record is really being 
updated, wont there be a index lookup which routes the record to its specific 
filegroup? Or is there benefit of supporting sorting when an upsert batch 
contains new records that are getting written for the first time? This PR 
allows sorting only for INSERT operation. 
`BaseSparkCommitActionExecutor::operationRequiresSorting(...)` takes care of 
that. If the config needs to be made ambiguity-proof for future usecases, 
should I rename it to `WRITE_SORT_MODE`, `WRITE_SORT_OPERATIONS` and 
`WRITE_USER_DEFINED_PARTITIONER_SORT_COLUMNS`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

2024-03-22 Thread via GitHub


bhat-vinay commented on code in PR #10876:
URL: https://github.com/apache/hudi/pull/10876#discussion_r1535694790


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##
@@ -480,6 +480,20 @@ public class HoodieWriteConfig extends HoodieConfig {
   .markAdvanced()
   .withDocumentation(BulkInsertSortMode.class);
 
+  public static final ConfigProperty INSERT_SORT = ConfigProperty

Review Comment:
   Done. IIUC, you are asking to use `BulkInsertSortMode::NONE` and 
`BulkInsertSortMode::GLOBAL_SORT` (instead of a boolean). FYI,  there are no 
sort modes for `insert`. There is only global sort (i.e sort the entire input 
batch). Hence the valid values are NONE or GLOBAL_SORT



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

2024-03-22 Thread via GitHub


bhat-vinay commented on code in PR #10876:
URL: https://github.com/apache/hudi/pull/10876#discussion_r1535695254


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##
@@ -480,6 +480,20 @@ public class HoodieWriteConfig extends HoodieConfig {
   .markAdvanced()
   .withDocumentation(BulkInsertSortMode.class);
 
+  public static final ConfigProperty INSERT_SORT = ConfigProperty
+  .key("hoodie.insert.sort")
+  .defaultValue(false)
+  .markAdvanced()
+  .withDocumentation("Determines whether the insert operation should sort 
the input records. The sorting for insert is always"
+  + " global (among all input records in a batch)");
+
+  public static final ConfigProperty INSERT_USER_DEFINED_SORT_COLUMNS 
= ConfigProperty
+  .key("hoodie.insert.user.defined.sort.columns")
+  .noDefaultValue()
+  .markAdvanced()
+  .withDocumentation("Columns to sort the data by when hoodie.insert.sort 
is set to true. If not specified, record-key is used for sorting."

Review Comment:
   Yes, it is. It was just not explicitly mentioned here. Update the document 
to be more explicit



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

2024-03-22 Thread via GitHub


bhat-vinay commented on code in PR #10876:
URL: https://github.com/apache/hudi/pull/10876#discussion_r1535473655


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##
@@ -480,6 +480,20 @@ public class HoodieWriteConfig extends HoodieConfig {
   .markAdvanced()
   .withDocumentation(BulkInsertSortMode.class);
 
+  public static final ConfigProperty INSERT_SORT = ConfigProperty
+  .key("hoodie.insert.sort")
+  .defaultValue(false)
+  .markAdvanced()
+  .withDocumentation("Determines whether the insert operation should sort 
the input records. The sorting for insert is always"
+  + " global (among all input records in a batch)");
+
+  public static final ConfigProperty INSERT_USER_DEFINED_SORT_COLUMNS 
= ConfigProperty
+  .key("hoodie.insert.user.defined.sort.columns")

Review Comment:
   Bulk insert's sort column config is named 
`hoodie.bulkinsert.user.defined.partitioner.sort.columns`. Hence using 
`hoodie.insert.user.defined.partitioner.sort.columns` here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

2024-03-21 Thread via GitHub


vinothchandar commented on code in PR #10876:
URL: https://github.com/apache/hudi/pull/10876#discussion_r1534279513


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##
@@ -480,6 +480,20 @@ public class HoodieWriteConfig extends HoodieConfig {
   .markAdvanced()
   .withDocumentation(BulkInsertSortMode.class);
 
+  public static final ConfigProperty INSERT_SORT = ConfigProperty
+  .key("hoodie.insert.sort")
+  .defaultValue(false)
+  .markAdvanced()
+  .withDocumentation("Determines whether the insert operation should sort 
the input records. The sorting for insert is always"
+  + " global (among all input records in a batch)");
+
+  public static final ConfigProperty INSERT_USER_DEFINED_SORT_COLUMNS 
= ConfigProperty
+  .key("hoodie.insert.user.defined.sort.columns")

Review Comment:
   lets make sure its consistent in naming with bulk_insert 's config.



##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java:
##
@@ -230,6 +236,10 @@ protected Partitioner getPartitioner(WorkloadProfile 
profile) {
   }
 
   private HoodieData 
mapPartitionsAsRDD(HoodieData> dedupedRecords, Partitioner 
partitioner) {
+if (operationRequiresSorting()) {

Review Comment:
   so technically - this works for both insert and upsert operations? or just 
insert? If both, then we can't name the configs just around `insert`



##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java:
##
@@ -411,4 +427,90 @@ public Partitioner getLayoutPartitioner(WorkloadProfile 
profile, String layoutPa
   protected void 
runPrecommitValidators(HoodieWriteMetadata> 
writeMetadata) {
 SparkValidatorUtils.runValidators(config, writeMetadata, context, table, 
instantTime);
   }
+
+  private HoodieData 
sortAndMapPartitionsAsRDD(HoodieData> dedupedRecords, 
Partitioner partitioner) {
+JavaPairRDD, HoodieRecord> mappedRDD = 
getSortedIndexedRecords(dedupedRecords);
+JavaPairRDD, HoodieRecord> partitionedRDD;
+if (table.requireSortedRecords()) {
+  // Partition and sort within each partition as a single step. This is 
faster than partitioning first and then
+  // applying a sort.
+  Comparator> comparator = 
(Comparator> & Serializable) (t1, t2) -> {
+HoodieKey key1 = t1._1();
+HoodieKey key2 = t2._1();
+return key1.getRecordKey().compareTo(key2.getRecordKey());
+  };
+  partitionedRDD = 
mappedRDD.repartitionAndSortWithinPartitions(partitioner, comparator);
+} else {
+  // Partition only
+  partitionedRDD = mappedRDD.partitionBy(partitioner);
+}
+
+return 
HoodieJavaRDD.of(partitionedRDD.map(Tuple2::_2).mapPartitionsWithIndex((partition,
 recordItr) -> {
+  if (WriteOperationType.isChangingRecords(operationType)) {
+return handleUpsertPartition(instantTime, partition, recordItr, 
partitioner);
+  } else {
+return handleInsertPartition(instantTime, partition, recordItr, 
partitioner);
+  }
+}, true).flatMap(List::iterator));
+  }
+
+  private boolean operationRequiresSorting() {
+return operationType == WriteOperationType.INSERT && 
config.getBoolean(INSERT_SORT);

Review Comment:
   ok here, we are skipping upserts. but should this be done for upserts too? 



##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java:
##
@@ -411,4 +427,90 @@ public Partitioner getLayoutPartitioner(WorkloadProfile 
profile, String layoutPa
   protected void 
runPrecommitValidators(HoodieWriteMetadata> 
writeMetadata) {
 SparkValidatorUtils.runValidators(config, writeMetadata, context, table, 
instantTime);
   }
+
+  private HoodieData 
sortAndMapPartitionsAsRDD(HoodieData> dedupedRecords, 
Partitioner partitioner) {

Review Comment:
   lets UT this method?



##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java:
##
@@ -411,4 +427,90 @@ public Partitioner getLayoutPartitioner(WorkloadProfile 
profile, String layoutPa
   protected void 
runPrecommitValidators(HoodieWriteMetadata> 
writeMetadata) {
 SparkValidatorUtils.runValidators(config, writeMetadata, context, table, 
instantTime);
   }
+
+  private HoodieData 
sortAndMapPartitionsAsRDD(HoodieData> dedupedRecords, 
Partitioner partitioner) {
+JavaPairRDD, HoodieRecord> mappedRDD = 
getSortedIndexedRecords(dedupedRecords);
+JavaPairRDD, HoodieRecord> partitionedRDD;
+if (table.requireSortedRecords()) {
+  // Partition and sort within each partition as a single step. This is 
faster than partitioning first and then
+  // applying a sort.
+  Comparator> comparator = 
(Comparator> & Serializable) (t1, t2) -> {
+HoodieKey key1 = t1._1();
+HoodieKey key2

Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

2024-03-19 Thread via GitHub


hudi-bot commented on PR #10876:
URL: https://github.com/apache/hudi/pull/10876#issuecomment-2007159791

   
   ## CI report:
   
   * b802619f011c1d9ef5b334ecf67ab7df74964e08 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22958)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

2024-03-19 Thread via GitHub


hudi-bot commented on PR #10876:
URL: https://github.com/apache/hudi/pull/10876#issuecomment-2007022918

   
   ## CI report:
   
   * bd71699ccef3e28be182c2cd5f8093b0cb507694 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22951)
 
   * b802619f011c1d9ef5b334ecf67ab7df74964e08 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22958)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

2024-03-19 Thread via GitHub


hudi-bot commented on PR #10876:
URL: https://github.com/apache/hudi/pull/10876#issuecomment-2007009003

   
   ## CI report:
   
   * bd71699ccef3e28be182c2cd5f8093b0cb507694 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22951)
 
   * b802619f011c1d9ef5b334ecf67ab7df74964e08 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

2024-03-19 Thread via GitHub


hudi-bot commented on PR #10876:
URL: https://github.com/apache/hudi/pull/10876#issuecomment-2006227969

   
   ## CI report:
   
   * bd71699ccef3e28be182c2cd5f8093b0cb507694 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22951)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

2024-03-19 Thread via GitHub


bhat-vinay commented on code in PR #10876:
URL: https://github.com/apache/hudi/pull/10876#discussion_r1529876646


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java:
##
@@ -90,8 +94,11 @@ public class UpsertPartitioner extends 
SparkHoodiePartitioner {
   public UpsertPartitioner(WorkloadProfile profile, HoodieEngineContext 
context, HoodieTable table,

Review Comment:
   Done. The changes were minimal, hence did not add it earlier.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

2024-03-19 Thread via GitHub


hudi-bot commented on PR #10876:
URL: https://github.com/apache/hudi/pull/10876#issuecomment-2006041219

   
   ## CI report:
   
   * 5016a9c8d9daeea9f6f28f63cc090514482571a4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22941)
 
   * bd71699ccef3e28be182c2cd5f8093b0cb507694 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22951)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

2024-03-19 Thread via GitHub


hudi-bot commented on PR #10876:
URL: https://github.com/apache/hudi/pull/10876#issuecomment-2006019061

   
   ## CI report:
   
   * 5016a9c8d9daeea9f6f28f63cc090514482571a4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22941)
 
   * bd71699ccef3e28be182c2cd5f8093b0cb507694 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

2024-03-18 Thread via GitHub


rmahindra123 commented on code in PR #10876:
URL: https://github.com/apache/hudi/pull/10876#discussion_r1529460065


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java:
##
@@ -90,8 +94,11 @@ public class UpsertPartitioner extends 
SparkHoodiePartitioner {
   public UpsertPartitioner(WorkloadProfile profile, HoodieEngineContext 
context, HoodieTable table,

Review Comment:
   Should we add the implementation to a new class, may be 
sortedUpsertPartitioner or something, so there is a clean separation. We can 
use the same config to control which one gets called.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

2024-03-18 Thread via GitHub


hudi-bot commented on PR #10876:
URL: https://github.com/apache/hudi/pull/10876#issuecomment-2003610845

   
   ## CI report:
   
   * 5016a9c8d9daeea9f6f28f63cc090514482571a4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22941)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

2024-03-18 Thread via GitHub


hudi-bot commented on PR #10876:
URL: https://github.com/apache/hudi/pull/10876#issuecomment-2003452778

   
   ## CI report:
   
   * f3c15a77a88d778d532dcc3fbed186441b3fa04c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22937)
 
   * 5016a9c8d9daeea9f6f28f63cc090514482571a4 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22941)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

2024-03-18 Thread via GitHub


hudi-bot commented on PR #10876:
URL: https://github.com/apache/hudi/pull/10876#issuecomment-2003422824

   
   ## CI report:
   
   * f3c15a77a88d778d532dcc3fbed186441b3fa04c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22937)
 
   * 5016a9c8d9daeea9f6f28f63cc090514482571a4 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

2024-03-18 Thread via GitHub


hudi-bot commented on PR #10876:
URL: https://github.com/apache/hudi/pull/10876#issuecomment-2003139899

   
   ## CI report:
   
   * f3c15a77a88d778d532dcc3fbed186441b3fa04c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22937)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

2024-03-18 Thread via GitHub


hudi-bot commented on PR #10876:
URL: https://github.com/apache/hudi/pull/10876#issuecomment-2003066457

   
   ## CI report:
   
   * f3c15a77a88d778d532dcc3fbed186441b3fa04c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22937)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

2024-03-18 Thread via GitHub


hudi-bot commented on PR #10876:
URL: https://github.com/apache/hudi/pull/10876#issuecomment-2003058078

   
   ## CI report:
   
   * f3c15a77a88d778d532dcc3fbed186441b3fa04c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org