Mridul Muralidharan created SPARK-2532:
------------------------------------------

             Summary: Fix issues with consolidated shuffle
                 Key: SPARK-2532
                 URL: https://issues.apache.org/jira/browse/SPARK-2532
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 1.1.0
         Environment: All
            Reporter: Mridul Muralidharan
            Assignee: Mridul Muralidharan
            Priority: Critical
             Fix For: 1.1.0



Will file PR with changes as soon as merge is done (earlier merge became 
outdated in 2 weeks unfortunately :) ).

Consolidated shuffle is broken in multiple ways in spark :

a) Task failure(s) can cause the state to become inconsistent.

b) Multiple revert's or combination of close/revert/close can cause the state 
to be inconsistent.
(As part of exception/error handling).

c) Some of the api in block writer causes implementation issues - for example: 
a revert is always followed by close : but the implemention tries to keep them 
separate, resulting in surface for errors.

d) Fetching data from consolidated shuffle files can go badly wrong if the file 
is being actively written to : it computes length by subtracting next offset 
from current offset (or length if this is last offset)- the latter fails when 
fetch is happening in parallel to write.
Note, this happens even if there are no task failures of any kind !
This usually results in stream corruption or decompression errors.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to