Josh Rosen created SPARK-29471:
----------------------------------

             Summary: "TaskResultLost (result lost from block manager)" error 
message is misleading in case result fetch is caused by client-side issues
                 Key: SPARK-29471
                 URL: https://issues.apache.org/jira/browse/SPARK-29471
             Project: Spark
          Issue Type: Bug
          Components: Block Manager
    Affects Versions: 3.0.0
            Reporter: Josh Rosen


I recently encountered a problem where jobs non-deterministically failed with
{code:java}
TaskResultLost (result lost from block manager) {code}
exceptions.

It turned out that this was due to some sort of networking issue where the 
Spark driver was unable to initiate outgoing connections to executors' block 
managers in order to fetch indirect task results.

In this situation, the error message was slightly misleading: the "result lost 
from block manager" makes it sound like we received an error / block-not-found 
response from the remote host, whereas in my case the problem was actually a 
network connectivity issue where we weren't even able to connect in the first 
place.

If it's easy to do so, it might be nice to refine the error-handling / logging 
code so that we distinguish between the receipt of an error response vs. a 
lower-level networking / connectivity issue. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to