Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22112
  
    Sorry to be late, as this bug is really hard to reproduce. We need fetch 
failure to happen after an indeterminate map stage, we also need a large 
cluster, so that a fetch failure doesn't lose all the executors and retry the 
entire job.
    
    I created 2 test cases to verify the fix: one is having fetch failure 
happen in the result stage, so with this fix the job should fail. one is having 
fetch failure happen in a map stage, so this fix can properly retry the stages 
and get the correct result.
    
    The tests are run in Databricks cloud with a 20-nodes Spark cluster. The 
following is the result for the master branch without this PR:
    
![image](https://user-images.githubusercontent.com/3182036/44943136-f1c9ea00-adf3-11e8-89d6-f4249b549b4b.png)
    
![image](https://user-images.githubusercontent.com/3182036/44943143-16be5d00-adf4-11e8-8753-2eae137ce681.png)
    
    In the tests, we first do a shuffle to produce unordered data, then do a 
repartition to trigger the bug, finally collect the result and distinct it to 
see if it's corrected, or call `RDD#distinct` to add another shuffle.
    
    The result for the master branch with this PR:
    
![image](https://user-images.githubusercontent.com/3182036/44943192-c267ad00-adf4-11e8-85c2-acec916ccd62.png)
    
![image](https://user-images.githubusercontent.com/3182036/44943194-cc89ab80-adf4-11e8-8f42-dd9786cc7d0d.png)
    
    The first job fails, because we detect this bug but are not able to 
rollback the result stage currently. The second job finishes with corrected 
result.
    
    If you look into the number of tasks of the stages, you can see that with 
this fix, the stage hitting fetch failure is entirely retried(200 tasks), while 
without this fix, only the failed tasks are retried and produce wrong answer.
    
    The last thing: I have to revert https://github.com/apache/spark/pull/20422 
to make this fix work. It seems like a dangerous optimization to me: we skip 
the shuffle writing if the size of the existing shuffle file is same with the 
size of data we are writing. Same size doesn't mean same data, and my tests 
exposed it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to