alexeykudinkin commented on code in PR #7413: URL: https://github.com/apache/hudi/pull/7413#discussion_r1051199250
########## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/SpatialCurveSortPartitionerBase.java: ########## @@ -78,6 +78,8 @@ protected Dataset<Row> reorder(Dataset<Row> dataset, int numOutputGroups) { @Override public boolean arePartitionRecordsSorted() { - return true; + //The data is sorted using a function that maps multiple columns into a single dimension. Review Comment: nit: Please make sure comments are formatted in-line with the rest of the code-base ########## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestInsertTable.scala: ########## @@ -1046,4 +1048,62 @@ class TestInsertTable extends HoodieSparkSqlTestBase { ) } } + + /** + * This test is to make sure that bulk insert doesn't create a bunch of tiny files if + * hoodie.bulkinsert.user.defined.partitioner.sort.columns doesn't start with the partition columns + * + */ + forAll(BulkInsertSortMode.values().toList) { (sortMode: BulkInsertSortMode) => + val sortModeName = sortMode.name() + test(s"Test Bulk Insert with BulkInsertSortMode: '$sortModeName'") { Review Comment: Unfortunately, i don't think this is going to work -- Scalatest requires `test` method have to be at the top level, so we'd have to move this iteration into it ########## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/BulkInsertPartitioner.java: ########## @@ -72,4 +78,31 @@ default Option<WriteHandleFactory> getWriteHandleFactory(int partitionId) { return Option.empty(); } + /* + * If possible, we want to sort the data by partition path. Doing so will reduce the number of files written. + **/ + static String[] prependPartitionPathColumn(String[] columnNames, HoodieWriteConfig config) { + String partitionPath = config.getString(KeyGeneratorOptions.PARTITIONPATH_FIELD_NAME.key()); + if (partitionPath.isEmpty() || config.getMetadataConfig().populateMetaFields()) { Review Comment: When meta-fields are enabled, we should use `PARTITION_PATH_METADATA_FIELD` instead of `KeyGeneratorOptions.PARTITIONPATH_FIELD_NAME` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org