[jira] [Commented] (SPARK-36105) OptimizeLocalShuffleReader support reading data of multiple mappers in one task

Apache Spark (Jira) Mon, 12 Jul 2021 17:28:06 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-36105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17379480#comment-17379480
 ]


Apache Spark commented on SPARK-36105:
--------------------------------------

User 'michaelzhang-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/33310

> OptimizeLocalShuffleReader support reading data of multiple mappers in one 
> task
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-36105
>                 URL: https://issues.apache.org/jira/browse/SPARK-36105
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.1.2
>            Reporter: Michael Zhang
>            Priority: Minor
>
> Right now OptimizeLocalShuffleReader tries to match the parallelism of the 
> total shuffle reader against the original parallelism of the shuffle 
> partition number if no coalescing (i.e., a shuffle stage without 
> CustomShuffleReaderExec) or coalesced shuffle number if with coalescing 
> (i.e., a shuffle stage with CustomShuffleReaderExec on top), by calling 
> equallyDivide.
> This is based on the assumption that the target parallelism is bigger than 
> the number of mappers, so equallyDivide will assign a range of reducer ids of 
> the same mapper to each downstream task, and that is why 
> PartialMapperPartitionSpec has a mapIndex together with a reducerStartIndex 
> and a reducerEndIndex.
> However, it is also possible that the target parallelism is smaller than the 
> number of mappers, and in that case, we need to “coalesce” the mappers by 
> assigning a range of mapper ids to each downstream task. For that purpose, we 
> might need to introduce a new type of ShufflePartitionSpec, which has a 
> mapStartIndex and mapEndIndex , with the implication that each task will read 
> all reducer outputs from mapStartIndex(inclusive) to mapEndIndex(exclusive). 
> Note that this is different from CoalescedPartitionSpec which reads all 
> mapper outputs from reduceStartIndex to reduceEndIndex.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36105) OptimizeLocalShuffleReader support reading data of multiple mappers in one task

Reply via email to