[ 
https://issues.apache.org/jira/browse/SPARK-14209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15280980#comment-15280980
 ] 

Marcelo Vanzin commented on SPARK-14209:
----------------------------------------

Hi [~milesc], I look at your updates after my last update, but I don't see logs 
from an actual reproduction of the issue.

The file at https://www.dropbox.com/s/78u1y8ydsi3jpnb/default-driver.log 
contains log from a Spark application, but not from one that failed in the way 
you describe here. Similarly, 
https://www.dropbox.com/s/och5o15sxr7dxv6/host-lost-failure.gz?dl=0 contains a 
different issue.

If you look at my previous messages, the log message that shows the problem 
with your original logs looks like this:

{noformat}
2016-03-25 23:28:41,722 ERROR o.a.s.s.cluster.YarnClusterScheduler: Lost an 
executor 51 (already removed): Pending loss reason.
{noformat}

The logger that generates that message is from the {{TaskSchedulerImpl}} class; 
so if you see the message coming from {{YarnClusterScheduler}}, something is 
wrong. There's also a lot of missing logs that you can see in Spark's code but 
not in your logs, which are further hints that something is wrong with your 
logs.

None of the latest logs show any of those messages, which tell me that they're 
from apps that did not hit this issue.

> Application failure during preemption.
> --------------------------------------
>
>                 Key: SPARK-14209
>                 URL: https://issues.apache.org/jira/browse/SPARK-14209
>             Project: Spark
>          Issue Type: Bug
>          Components: Block Manager
>    Affects Versions: 1.6.1
>         Environment: Spark on YARN
>            Reporter: Miles Crawford
>
> We have a fair-sharing cluster set up, including the external shuffle 
> service.  When a new job arrives, existing jobs are successfully preempted 
> down to fit.
> A spate of these messages arrives:
>       ExecutorLostFailure (executor 48 exited unrelated to the running tasks) 
> Reason: Container container_1458935819920_0019_01_000143 on host: 
> ip-10-12-46-235.us-west-2.compute.internal was preempted.
> This seems fine - the problem is that soon thereafter, our whole application 
> fails because it is unable to fetch blocks from the pre-empted containers:
> org.apache.spark.storage.BlockFetchException: Failed to fetch block from 1 
> locations. Most recent failure cause:
>     Caused by: java.io.IOException: Failed to connect to 
> ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681
>         Caused by: java.net.ConnectException: Connection refused: 
> ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681
> Full stack: https://gist.github.com/milescrawford/33a1c1e61d88cc8c6daf
> Spark does not attempt to recreate these blocks - the tasks simply fail over 
> and over until the maxTaskAttempts value is reached.
> It appears to me that there is some fault in the way preempted containers are 
> being handled - shouldn't these blocks be recreated on demand?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to