GitHub user JoshRosen opened a pull request:

    https://github.com/apache/spark/pull/15085

    [SPARK-17484] Prevent invalid block locations from being reported after 
put() exceptions

    ## What changes were proposed in this pull request?
    
    If a BlockManager `put()` call failed after the BlockManagerMaster was 
notified of a block's availability then incomplete cleanup logic in a `finally` 
block would never send a second block status method to inform the master of the 
block's unavailability. This, in turn, leads to fetch failures and used to be 
capable of causing complete job failures before #15037 was fixed.
    
    This patch addresses this issue via multiple small changes:
    
    - The `finally` block now calls `removeBlockInternal` when cleaning up from 
a failed `put()`; in addition to removing the `BlockInfo` entry (which was 
_all_ that the old cleanup logic did), this code (redundantly) tries to remove 
the block from the memory and disk stores (as an added layer of defense against 
bugs lower down in the stack) and optionally notifies the master of block 
removal (which now happens during exception-triggered cleanup).
    - When a BlockManager receives a request for a block that it does not have 
it will now notify the master to update its block locations. This ensures that 
bad metadata pointing to non-existent blocks will eventually be fixed. Note 
that I could have implemented this logic in the block manager client (rather 
than in the remote server), but that would introduce the problem of 
distinguishing between transient and permanent failures; on the server, 
however, we know definitively that the block isn't present.
    - Catch `NonFatal` instead of `Exception` to avoid swallowing 
`InterruptedException`s thrown from synchronous block replication calls.
    
    This patch depends upon the refactorings in #15036, so that other patch 
will also have to be backported when backporting this fix.
    
    For more background on this issue, including example logs from a real 
production failure, see 
[SPARK-17484](https://issues.apache.org/jira/browse/SPARK-17484).
    
    ## How was this patch tested?
    
    Two new regression tests in BlockManagerSuite.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JoshRosen/spark SPARK-17484

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15085.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15085
    
----
commit ecc81f70a29c6568449c755f5dd6408610d6e56b
Author: Josh Rosen <joshro...@databricks.com>
Date:   2016-09-13T19:45:59Z

    Add regression test for cleanup after put() exception

commit ba3a9bd68ccc6e360170010deb11bb719819d54a
Author: Josh Rosen <joshro...@databricks.com>
Date:   2016-09-13T19:48:13Z

    Don't swallow InterruptedException in block replication code.

commit b39531323aa68510f7146900fe2f3241110d492e
Author: Josh Rosen <joshro...@databricks.com>
Date:   2016-09-13T19:46:52Z

    Add regression test for update of invalid block locations after fetch 
failure

commit aa75e2d7cca5820893fd059462e236a5ccdb77d3
Author: Josh Rosen <joshro...@databricks.com>
Date:   2016-09-13T18:37:07Z

    Perform more complete cleanup in put() finally block.

commit 6609c2a9595a880a77e9356abd71a1688ff67615
Author: Josh Rosen <joshro...@databricks.com>
Date:   2016-09-09T18:20:25Z

    Unconditionally update master's block status when handling invalid request.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to