GitHub user JoshRosen opened a pull request: https://github.com/apache/spark/pull/15085
[SPARK-17484] Prevent invalid block locations from being reported after put() exceptions ## What changes were proposed in this pull request? If a BlockManager `put()` call failed after the BlockManagerMaster was notified of a block's availability then incomplete cleanup logic in a `finally` block would never send a second block status method to inform the master of the block's unavailability. This, in turn, leads to fetch failures and used to be capable of causing complete job failures before #15037 was fixed. This patch addresses this issue via multiple small changes: - The `finally` block now calls `removeBlockInternal` when cleaning up from a failed `put()`; in addition to removing the `BlockInfo` entry (which was _all_ that the old cleanup logic did), this code (redundantly) tries to remove the block from the memory and disk stores (as an added layer of defense against bugs lower down in the stack) and optionally notifies the master of block removal (which now happens during exception-triggered cleanup). - When a BlockManager receives a request for a block that it does not have it will now notify the master to update its block locations. This ensures that bad metadata pointing to non-existent blocks will eventually be fixed. Note that I could have implemented this logic in the block manager client (rather than in the remote server), but that would introduce the problem of distinguishing between transient and permanent failures; on the server, however, we know definitively that the block isn't present. - Catch `NonFatal` instead of `Exception` to avoid swallowing `InterruptedException`s thrown from synchronous block replication calls. This patch depends upon the refactorings in #15036, so that other patch will also have to be backported when backporting this fix. For more background on this issue, including example logs from a real production failure, see [SPARK-17484](https://issues.apache.org/jira/browse/SPARK-17484). ## How was this patch tested? Two new regression tests in BlockManagerSuite. You can merge this pull request into a Git repository by running: $ git pull https://github.com/JoshRosen/spark SPARK-17484 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15085.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15085 ---- commit ecc81f70a29c6568449c755f5dd6408610d6e56b Author: Josh Rosen <joshro...@databricks.com> Date: 2016-09-13T19:45:59Z Add regression test for cleanup after put() exception commit ba3a9bd68ccc6e360170010deb11bb719819d54a Author: Josh Rosen <joshro...@databricks.com> Date: 2016-09-13T19:48:13Z Don't swallow InterruptedException in block replication code. commit b39531323aa68510f7146900fe2f3241110d492e Author: Josh Rosen <joshro...@databricks.com> Date: 2016-09-13T19:46:52Z Add regression test for update of invalid block locations after fetch failure commit aa75e2d7cca5820893fd059462e236a5ccdb77d3 Author: Josh Rosen <joshro...@databricks.com> Date: 2016-09-13T18:37:07Z Perform more complete cleanup in put() finally block. commit 6609c2a9595a880a77e9356abd71a1688ff67615 Author: Josh Rosen <joshro...@databricks.com> Date: 2016-09-09T18:20:25Z Unconditionally update master's block status when handling invalid request. ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org