[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

jiangxb1987 Wed, 08 Aug 2018 07:28:26 -0700

Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/21698
  
    Actually I have been thinking about the issue and feel we can still go 
further with the current patch to fix the issue with child stages @mridulm 
mentioned above, that requires track all the preceding stages that can generate 
non-deterministic output for each active/finished stage, and also track the 
stages that have partially finished (some tasks in the stage have finished, 
while the rest still running or pending). In case of FetchFailure(or 
ExecutorLost/WorkerLost), if the affected stage can generate non-deterministic 
output, then clear all the shuffle files from that stage (to make sure weâll 
retry all tasks for that stage), and also check all partially finished stages 
that are succeeding stages of the affected stage, also clear all the shuffle 
files of them (If a stage is succeeding stage of the affected stage but it has 
already finished successfully and no shuffle files are lost, then itâs fine 
we donât need to retry it).
    
    The above fix proposal requires more code refactoring of DAGScheduler, and 
it shall consume some memories to store additional informations (assume you 
have M active/finished stages, and N stages may generate non-deterministic 
output, then it shall take at most M * N Int values to store the stageIds). 
Normally N is by far smaller than the total number of stages, so I think the 
extra cost shall be acceptable.
    
    Actually, you always have to pay a lot of extra effort on FetchFailure, 
this may increase the effort by several times, but shall be statistically fine 
considering most Spark jobs are short-running and don't hit FetchFailure quite 
often (The major advantage of this approach is that you don't pay for any 
penalty if you don't hit FetchFailure).



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

Reply via email to