[ 
https://issues.apache.org/jira/browse/SPARK-34840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-34840.
-----------------------------------------
    Fix Version/s: 3.1.2
                   3.2.0
       Resolution: Fixed

Issue resolved by pull request 31934
[https://github.com/apache/spark/pull/31934]

> Fix cases of corruption in merged shuffle blocks that are pushed
> ----------------------------------------------------------------
>
>                 Key: SPARK-34840
>                 URL: https://issues.apache.org/jira/browse/SPARK-34840
>             Project: Spark
>          Issue Type: Bug
>          Components: Shuffle
>    Affects Versions: 3.1.0
>            Reporter: Chandni Singh
>            Assignee: Chandni Singh
>            Priority: Major
>             Fix For: 3.2.0, 3.1.2
>
>
> The {{RemoteBlockPushResolver}} which handles the shuffle push blocks and 
> merges them was introduced in 
> [#30062|https://github.com/apache/spark/pull/30062]. We have identified 2 
> scenarios where the merged blocks get corrupted:
>  # {{StreamCallback.onFailure()}} is called more than once. Initially we 
> assumed that the onFailure callback will be called just once per stream. 
> However, we observed that this is called twice when a client connection is 
> reset. When the client connection is reset then there are 2 events that get 
> triggered in this order.
>  * {{exceptionCaught}}. This event is propagated to {{StreamInterceptor}}. 
> {{StreamInterceptor.exceptionCaught()}} invokes 
> {{callback.onFailure(streamId, cause)}}. This is the first time 
> StreamCallback.onFailure() will be invoked.
>  * {{channelInactive}}. Since the channel closes, the {{channelInactive}} 
> event gets triggered which again is propagated to {{StreamInterceptor}}. 
> {{StreamInterceptor.channelInactive()}} invokes 
> {{callback.onFailure(streamId, new ClosedChannelException())}}. This is the 
> second time StreamCallback.onFailure() will be invoked.
>  # The flag {{isWriting}} is set prematurely to true. This introduces an edge 
> case where a stream that is trying to merge a duplicate block (created 
> because of a speculative task) may interfere with an active stream if the 
> duplicate stream fails.
> Also adding additional changes that improve the code.
>  # Using positional writes all the time because this simplifies the code and 
> with microbenchmarking haven't seen any performance impact.
>  # Additional minor changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to