[GitHub] [hudi] yihua commented on a diff in pull request #5337: [HUDI-3895] Fixing files partitioning sequence for `BaseFileOnlyRelation`

GitBox Mon, 18 Apr 2022 19:55:23 -0700


yihua commented on code in PR #5337:
URL: https://github.com/apache/hudi/pull/5337#discussion_r852551786



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala:
##########
@@ -84,21 +84,24 @@ class BaseFileOnlyRelation(sqlContext: SQLContext,
 
   protected def collectFileSplits(partitionFilters: Seq[Expression], 
dataFilters: Seq[Expression]): Seq[HoodieBaseFileSplit] = {
     val partitions = listLatestBaseFiles(globPaths, partitionFilters, 
dataFilters)
-    val fileSplits = partitions.values.toSeq.flatMap { files =>
-      files.flatMap { file =>
-        // TODO move to adapter
-        // TODO fix, currently assuming parquet as underlying format
-        HoodieDataSourceHelper.splitFiles(
-          sparkSession = sparkSession,
-          file = file,
-          // TODO clarify why this is required
-          partitionValues = getPartitionColumnsAsInternalRow(file)
-        )
+    val fileSplits = partitions.values.toSeq
+      .flatMap { files =>
+        files.flatMap { file =>
+          // TODO fix, currently assuming parquet as underlying format
+          HoodieDataSourceHelper.splitFiles(
+            sparkSession = sparkSession,
+            file = file,
+            partitionValues = getPartitionColumnsAsInternalRow(file)
+          )
+        }
       }
-    }
+      // NOTE: It's important to order the splits in the reverse order of their
+      //       size so that we can subsequently bucket them in an efficient 
manner
+      .sortBy(_.length)(implicitly[Ordering[Long]].reverse)

Review Comment:
   @alexeykudinkin we still need to keep the following TODOS, to make sure we 
don't lose context?
   ```
   // TODO move to adapter
   // TODO clarify why this is required
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yihua commented on a diff in pull request #5337: [HUDI-3895] Fixing files partitioning sequence for `BaseFileOnlyRelation`

Reply via email to