[GitHub] [spark] cloud-fan commented on a change in pull request #33310: [SPARK-36105][SQL] OptimizeLocalShuffleReader support reading data of multiple mappers in one task

GitBox Wed, 14 Jul 2021 01:24:17 -0700


cloud-fan commented on a change in pull request #33310:
URL: https://github.com/apache/spark/pull/33310#discussion_r669399284




##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeLocalShuffleReader.scala
##########
@@ -85,12 +85,24 @@ object OptimizeLocalShuffleReader extends 
CustomShuffleReaderRule {
     val expectedParallelism = advisoryParallelism.getOrElse(numReducers)
     val splitPoints = if (numMappers == 0) {
       Seq.empty
-    } else {
-      equallyDivide(numReducers, math.max(1, expectedParallelism / numMappers))
+    } else if (expectedParallelism >= numMappers) {
+      equallyDivide(numReducers, expectedParallelism / numMappers)
+    }
+    else {
+      equallyDivide(numMappers, expectedParallelism)
+    }
+    if (expectedParallelism >= numMappers) {
+      (0 until numMappers).flatMap { mapIndex =>
+        (splitPoints :+ numReducers).sliding(2).map {
+          case Seq(start, end) => PartialMapperPartitionSpec(mapIndex, start, 
end)
+        }
+      }
     }
-    (0 until numMappers).flatMap { mapIndex =>
-      (splitPoints :+ numReducers).sliding(2).map {
-        case Seq(start, end) => PartialMapperPartitionSpec(mapIndex, start, 
end)
+    else {
+      (0 until 1).flatMap { _ =>
+        (splitPoints :+ numMappers).sliding(2).map {
+          case Seq(start, end) => CoalescedMapperPartitionSpec(start, end, 
numReducers)

Review comment:
       I'm wondering that if we should have a more meticulous algorithm.
   
   Let's say that there are 3 mappers and 2 reducers, so 6 shuffle blocks in 
total: `(M0, R0), (M0, R1), (M1, R0), (M1, R1), (M2, R0), (M2, R1)`. If the 
expected parallelism is 2, I think each task should read 3 blocks:
   task 0: `(M0, R0), (M0, R1), (M1, R0)`
   task1: `(M1, R1), (M2, R0), (M2, R1)`
   
   So one task can read some entire mappers and part of one mapper.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #33310: [SPARK-36105][SQL] OptimizeLocalShuffleReader support reading data of multiple mappers in one task

Reply via email to