GitHub user JoshRosen opened a pull request:

    https://github.com/apache/spark/pull/15037

    [SPARK-17485] Prevent failed remote reads of cached blocks from failing 
entire job

    ## What changes were proposed in this pull request?
    
    In Spark's `RDD.getOrCompute` we first try to read a local copy of a cached 
RDD block, then a remote copy, and only fall back to recomputing the block if 
no cached copy (local or remote) can be read. This logic works correctly in the 
case where no remote copies of the block exist, but if there _are_ remote 
copies and reads of those copies fail (due to network issues or internal Spark 
bugs) then the BlockManager will throw a `BlockFetchException` that will fail 
the task (and which could possibly fail the whole job if the read failures keep 
occurring).
    
    In the cases of TorrentBroadcast and task result fetching we really do want 
to fail the entire job in case no remote blocks can be fetched, but this logic 
is inappropriate for reads of cached RDD blocks because those can/should be 
recomputed in case cached blocks are unavailable.
    
    Therefore, I think that the `BlockManager.getRemoteBytes()` method should 
never throw on remote fetch errors and, instead, should handle failures by 
returning `None`.
    
    ## How was this patch tested?
    
    Block manager changes should be covered by modified tests in 
`BlockManagerSuite`: the old tests expected exceptions to be thrown on failed 
remote reads, while the modified tests now expect `None` to be returned from 
the `getRemote*` method.
    
    I also manually inspected all usages of `BlockManager.getRemoteValues()`, 
`getRemoteBytes()`, and `get()` to verify that they correctly pattern-match on 
the result and handle `None`. Note that these `None` branches are already 
exercised because the old `getRemoteBytes` returned `None` when no remote 
locations for the block could be found (which could occur if an executor died 
and its block manager de-registered with the master).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JoshRosen/spark SPARK-17485

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15037.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15037
    
----
commit 35efe6a14ae148271ae674872cba3c8bad52834f
Author: Josh Rosen <joshro...@databricks.com>
Date:   2016-09-09T21:55:16Z

    Remove throw which caused the bug.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to