[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #7413: [HUDI-5321] Fix inconsistencies in arePartitionRecordsSorted and try to limit lots of small files during bulk insert

GitBox Fri, 16 Dec 2022 14:09:26 -0800


alexeykudinkin commented on code in PR #7413:
URL: https://github.com/apache/hudi/pull/7413#discussion_r1051199250



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/SpatialCurveSortPartitionerBase.java:
##########
@@ -78,6 +78,8 @@ protected Dataset<Row> reorder(Dataset<Row> dataset, int 
numOutputGroups) {
 
   @Override
   public boolean arePartitionRecordsSorted() {
-    return true;
+    //The data is sorted using a function that maps multiple columns into a 
single dimension.

Review Comment:
   nit: Please make sure comments are formatted in-line with the rest of the 
code-base



##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestInsertTable.scala:
##########
@@ -1046,4 +1048,62 @@ class TestInsertTable extends HoodieSparkSqlTestBase {
       )
     }
   }
+
+  /**
+   * This test is to make sure that bulk insert doesn't create a bunch of tiny 
files if
+   * hoodie.bulkinsert.user.defined.partitioner.sort.columns doesn't start 
with the partition columns
+   *
+   */
+  forAll(BulkInsertSortMode.values().toList) { (sortMode: BulkInsertSortMode) 
=>
+    val sortModeName = sortMode.name()
+    test(s"Test Bulk Insert with BulkInsertSortMode: '$sortModeName'") {

Review Comment:
   Unfortunately, i don't think this is going to work -- Scalatest requires 
`test` method have to be at the top level, so we'd have to move this iteration 
into it



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/BulkInsertPartitioner.java:
##########
@@ -72,4 +78,31 @@ default Option<WriteHandleFactory> getWriteHandleFactory(int 
partitionId) {
     return Option.empty();
   }
 
+  /*
+   * If possible, we want to sort the data by partition path. Doing so will 
reduce the number of files written.
+   **/
+  static String[] prependPartitionPathColumn(String[] columnNames, 
HoodieWriteConfig config) {
+    String partitionPath = 
config.getString(KeyGeneratorOptions.PARTITIONPATH_FIELD_NAME.key());
+    if (partitionPath.isEmpty() || 
config.getMetadataConfig().populateMetaFields()) {

Review Comment:
   When meta-fields are enabled, we should use `PARTITION_PATH_METADATA_FIELD` 
instead of `KeyGeneratorOptions.PARTITIONPATH_FIELD_NAME`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #7413: [HUDI-5321] Fix inconsistencies in arePartitionRecordsSorted and try to limit lots of small files during bulk insert

Reply via email to