[jira] [Commented] (SPARK-26907) Does ShuffledRDD Replication Work With External Shuffle Service

Han Altae-Tran (JIRA) Sun, 24 Feb 2019 00:10:33 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-26907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16776150#comment-16776150
 ]


Han Altae-Tran commented on SPARK-26907:
----------------------------------------

Ok, thank you I will try the mailing list.

I think my main point is that Spark could be improved for use with preemptible 
virtual machines if shuffle files can be replicated across the cluster. In my 
experience, whenever there is a shuffle map task, a single node being preempted 
can cause the entire stage to be retried, causing a huge loss of uptime as all 
tasks fail until the retry is initiated. Using persist with replication doesn't 
seem to help this issue, so I figured there is an optimization around shuffle 
files that could be made for this use case.

> Does ShuffledRDD Replication Work With External Shuffle Service
> ---------------------------------------------------------------
>
>                 Key: SPARK-26907
>                 URL: https://issues.apache.org/jira/browse/SPARK-26907
>             Project: Spark
>          Issue Type: Question
>          Components: Block Manager, YARN
>    Affects Versions: 2.3.2
>            Reporter: Han Altae-Tran
>            Priority: Major
>
> I am interested in working with high replication environments for extreme 
> fault tolerance (e.g. 10x replication), but have noticed that when using 
> groupBy or groupWith followed by persist (with 10x replication), even if one 
> node fails, the entire stage can fail with FetchFailedException.
>  
> Is this because the External Shuffle Service writes and services intermediate 
> shuffle data only to/from the local disk attached to the executor that 
> generated it, causing spark to ignore possible replicated shuffle data (from 
> the persist) that may be serviced elsewhere? If so, is there any way to 
> increase the replication factor of the External Shuffle Service to make it 
> fault tolerant?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26907) Does ShuffledRDD Replication Work With External Shuffle Service

Reply via email to