[ 
https://issues.apache.org/jira/browse/SPARK-34534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haiyangyu updated SPARK-34534:
------------------------------
    Description: 
We will build a new rpc message `FetchShuffleBlocks` when 
`OneForOneBlockFetcher` init in replace of

`OpenBlocks` to use adaptive feature, this introduce additional problems as 
follows.

`OneForOneBlockFetcher` will init a `blockIds` String array to catch chunk 
fetch success, it will use index in `blockIds` to fetch blocks and match 
blockId in `blockIds` when chunk data return. So the `blockIds` 's order must 
be consistent with fetchChunk index, but the new `FetchShuffleBlocks` return 
chunk order is not same as `blockIds`.

This will lead to the return data not match the blockId,  and this can lead to 
data corretness when retry to fetch after fetch block chunk failed.

Fetch chunk orker code and match blockId when rerun data code as follows: 

!image-2021-02-25-11-27-34-429.png|width=446,height=251!!image-2021-02-25-11-28-31-255.png|width=445,height=159!

Howerver, the fetch order in shuffle service,

!image-2021-02-25-11-30-03-834.png|width=510,height=361!

So, it will fetch some wrong block data when chunk fetch failed beause the 
blocks's wrong order.

!image-2021-02-25-11-31-59-110.png|width=601,height=204!

 

 

  was:
We will build a new rpc message `FetchShuffleBlocks` when 
`OneForOneBlockFetcher` init in replace of

`OpenBlocks` to use adaptive feature, this introduce additional problems as 
follows.

`OneForOneBlockFetcher` will init a `blockIds` String array to catch chunk 
fetch success, it will use index in `blockIds` to fetch blocks and match 
blockId in `blockIds` when chunk data return. So the `blockIds` 's order must 
be consistent with fetchChunk index.

!image-2021-02-25-11-17-12-714.png|width=875,height=502!


> New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or 
> correctness
> -----------------------------------------------------------------------------------------
>
>                 Key: SPARK-34534
>                 URL: https://issues.apache.org/jira/browse/SPARK-34534
>             Project: Spark
>          Issue Type: Bug
>          Components: Shuffle
>    Affects Versions: 3.0.0, 3.0.1, 3.0.2
>            Reporter: haiyangyu
>            Priority: Major
>              Labels: Correctness, data-loss
>         Attachments: image-2021-02-25-11-17-12-714.png, 
> image-2021-02-25-11-27-34-429.png, image-2021-02-25-11-28-31-255.png, 
> image-2021-02-25-11-30-03-834.png, image-2021-02-25-11-31-59-110.png
>
>
> We will build a new rpc message `FetchShuffleBlocks` when 
> `OneForOneBlockFetcher` init in replace of
> `OpenBlocks` to use adaptive feature, this introduce additional problems as 
> follows.
> `OneForOneBlockFetcher` will init a `blockIds` String array to catch chunk 
> fetch success, it will use index in `blockIds` to fetch blocks and match 
> blockId in `blockIds` when chunk data return. So the `blockIds` 's order must 
> be consistent with fetchChunk index, but the new `FetchShuffleBlocks` return 
> chunk order is not same as `blockIds`.
> This will lead to the return data not match the blockId,  and this can lead 
> to data corretness when retry to fetch after fetch block chunk failed.
> Fetch chunk orker code and match blockId when rerun data code as follows: 
> !image-2021-02-25-11-27-34-429.png|width=446,height=251!!image-2021-02-25-11-28-31-255.png|width=445,height=159!
> Howerver, the fetch order in shuffle service,
> !image-2021-02-25-11-30-03-834.png|width=510,height=361!
> So, it will fetch some wrong block data when chunk fetch failed beause the 
> blocks's wrong order.
> !image-2021-02-25-11-31-59-110.png|width=601,height=204!
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to