[ https://issues.apache.org/jira/browse/SPARK-36105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17379452#comment-17379452 ]
Michael Zhang commented on SPARK-36105: --------------------------------------- @[~cloud_fan] I'm working on this issue, can you assign me to it? > OptimizeLocalShuffleReader support reading data of multiple mappers in one > task > ------------------------------------------------------------------------------- > > Key: SPARK-36105 > URL: https://issues.apache.org/jira/browse/SPARK-36105 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.1.2 > Reporter: Michael Zhang > Priority: Minor > > Right now OptimizeLocalShuffleReader tries to match the parallelism of the > total shuffle reader against the original parallelism of the shuffle > partition number if no coalescing (i.e., a shuffle stage without > CustomShuffleReaderExec) or coalesced shuffle number if with coalescing > (i.e., a shuffle stage with CustomShuffleReaderExec on top), by calling > equallyDivide. > This is based on the assumption that the target parallelism is bigger than > the number of mappers, so equallyDivide will assign a range of reducer ids of > the same mapper to each downstream task, and that is why > PartialMapperPartitionSpec has a mapIndex together with a reducerStartIndex > and a reducerEndIndex. > However, it is also possible that the target parallelism is smaller than the > number of mappers, and in that case, we need to “coalesce” the mappers by > assigning a range of mapper ids to each downstream task. For that purpose, we > might need to introduce a new type of ShufflePartitionSpec, which has a > mapStartIndex and mapEndIndex , with the implication that each task will read > all reducer outputs from mapStartIndex(inclusive) to mapEndIndex(exclusive). > Note that this is different from CoalescedPartitionSpec which reads all > mapper outputs from reduceStartIndex to reduceEndIndex. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org