Romi Kuntsman created SPARK-11228:
-------------------------------------

             Summary: Job stuck in Executor failure loop when NettyTransport 
failed to bind
                 Key: SPARK-11228
                 URL: https://issues.apache.org/jira/browse/SPARK-11228
             Project: Spark
          Issue Type: Bug
          Components: Scheduler
    Affects Versions: 1.5.1
         Environment: 14.04.1-Ubuntu SMP x86_64 GNU/Linux
            Reporter: Romi Kuntsman


I changed my network connection while a local spark cluster is running. In port 
8080, I see the master and worker running. 

I'm running Spark in Java in client mode, so the driver is running inside my 
IDE. When trying to start a job on the local spark cluster, I get an endless 
loop of the errors below at #1.
It only stops when I kill the application manually.

When looking at the worker log, I see an endless loop of the errors below at #2.

Expected behaviour would be failing the job after a few failed retries / 
timeout.

(IP anonymized to 1.2.3.4)

1. Errors see on driver:

2015-10-21 11:20:54,793 INFO  [org.apache.spark.scheduler.TaskSchedulerImpl] 
Adding task set 0.0 with 2 tasks
2015-10-21 11:20:55,847 INFO  
[org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: 
app-20151021112052-0005/1 is now EXITED (Command exited with code 1)
2015-10-21 11:20:55,847 INFO  
[org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Executor 
app-20151021112052-0005/1 removed: Command exited with code 1
2015-10-21 11:20:55,848 INFO  
[org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Asked to 
remove non-existent executor 1
2015-10-21 11:20:55,848 INFO  
[org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor added: 
app-20151021112052-0005/2 on worker-20151021090623-1.2.3.4-57305 
(1.2.3.4:57305) with 1 cores
2015-10-21 11:20:55,848 INFO  
[org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Granted 
executor ID app-20151021112052-0005/2 on hostPort 1.2.3.4:57305 with 1 cores, 
4.9 GB RAM
2015-10-21 11:20:55,849 INFO  
[org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: 
app-20151021112052-0005/2 is now LOADING
2015-10-21 11:20:55,852 INFO  
[org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: 
app-20151021112052-0005/2 is now RUNNING
2015-10-21 11:20:57,165 INFO  
[org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: 
app-20151021112052-0005/2 is now EXITED (Command exited with code 1)
2015-10-21 11:20:57,165 INFO  
[org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Executor 
app-20151021112052-0005/2 removed: Command exited with code 1
2015-10-21 11:20:57,166 INFO  
[org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Asked to 
remove non-existent executor 2
2015-10-21 11:20:57,166 INFO  
[org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor added: 
app-20151021112052-0005/3 on worker-20151021090623-1.2.3.4-57305 
(1.2.3.4:57305) with 1 cores
2015-10-21 11:20:57,167 INFO  
[org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Granted 
executor ID app-20151021112052-0005/3 on hostPort 1.2.3.4:57305 with 1 cores, 
4.9 GB RAM
2015-10-21 11:20:57,167 INFO  
[org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: 
app-20151021112052-0005/3 is now LOADING
2015-10-21 11:20:57,169 INFO  
[org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: 
app-20151021112052-0005/3 is now RUNNING
2015-10-21 11:20:58,531 INFO  
[org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: 
app-20151021112052-0005/3 is now EXITED (Command exited with code 1)
2015-10-21 11:20:58,531 INFO  
[org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Executor 
app-20151021112052-0005/3 removed: Command exited with code 1
2015-10-21 11:20:58,532 INFO  
[org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Asked to 
remove non-existent executor 3
2015-10-21 11:20:58,532 INFO  
[org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor added: 
app-20151021112052-0005/4 on worker-20151021090623-1.2.3.4-57305 
(1.2.3.4:57305) with 1 cores
2015-10-21 11:20:58,532 INFO  
[org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Granted 
executor ID app-20151021112052-0005/4 on hostPort 1.2.3.4:57305 with 1 cores, 
4.9 GB RAM
2015-10-21 11:20:58,533 INFO  
[org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: 
app-20151021112052-0005/4 is now LOADING
2015-10-21 11:20:58,535 INFO  
[org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: 
app-20151021112052-0005/4 is now RUNNING
2015-10-21 11:20:59,932 INFO  
[org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: 
app-20151021112052-0005/4 is now EXITED (Command exited with code 1)
2015-10-21 11:20:59,933 INFO  
[org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Executor 
app-20151021112052-0005/4 removed: Command exited with code 1
2015-10-21 11:20:59,933 INFO  
[org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Asked to 
remove non-existent executor 4
2015-10-21 11:20:59,933 INFO  
[org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor added: 
app-20151021112052-0005/5 on worker-20151021090623-1.2.3.4-57305 
(1.2.3.4:57305) with 1 cores
2015-10-21 11:20:59,934 INFO  
[org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Granted 
executor ID app-20151021112052-0005/5 on hostPort 1.2.3.4:57305 with 1 cores, 
4.9 GB RAM
2015-10-21 11:20:59,935 INFO  
[org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: 
app-20151021112052-0005/5 is now LOADING
2015-10-21 11:20:59,937 INFO  
[org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: 
app-20151021112052-0005/5 is now RUNNING
2015-10-21 11:21:01,338 INFO  
[org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: 
app-20151021112052-0005/5 is now EXITED (Command exited with code 1)
2015-10-21 11:21:01,338 INFO  
[org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Executor 
app-20151021112052-0005/5 removed: Command exited with code 1
2015-10-21 11:21:01,339 INFO  
[org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Asked to 
remove non-existent executor 5


2. Errors seen on workers:

15/10/21 11:20:53 INFO Remoting: Starting remoting
15/10/21 11:20:53 ERROR NettyTransport: failed to bind to /1.2.3.4:0, shutting 
down Netty transport
15/10/21 11:20:53 WARN Utils: Service 'driverPropsFetcher' could not bind on 
port 0. Attempting port 1.
15/10/21 11:20:53 INFO Slf4jLogger: Slf4jLogger started
15/10/21 11:20:53 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down 
remote daemon.
15/10/21 11:20:53 INFO Remoting: Starting remoting
15/10/21 11:20:53 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon 
shut down; proceeding with flushing remote transports.
15/10/21 11:20:53 ERROR Remoting: Remoting system has been terminated abrubtly. 
Attempting to shut down transports
15/10/21 11:20:53 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut 
down.
15/10/21 11:20:53 ERROR NettyTransport: failed to bind to /1.2.3.4:0, shutting 
down Netty transport
15/10/21 11:20:53 WARN Utils: Service 'driverPropsFetcher' could not bind on 
port 0. Attempting port 1.
15/10/21 11:20:53 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down 
remote daemon.
15/10/21 11:20:53 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon 
shut down; proceeding with flushing remote transports.
15/10/21 11:20:53 INFO Slf4jLogger: Slf4jLogger started
15/10/21 11:20:53 ERROR Remoting: Remoting system has been terminated abrubtly. 
Attempting to shut down transports
15/10/21 11:20:53 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut 
down.
15/10/21 11:20:53 INFO Remoting: Starting remoting
15/10/21 11:20:54 ERROR NettyTransport: failed to bind to /1.2.3.4:0, shutting 
down Netty transport
15/10/21 11:20:54 WARN Utils: Service 'driverPropsFetcher' could not bind on 
port 0. Attempting port 1.
15/10/21 11:20:54 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down 
remote daemon.
15/10/21 11:20:54 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon 
shut down; proceeding with flushing remote transports.
15/10/21 11:20:54 INFO Slf4jLogger: Slf4jLogger started
15/10/21 11:20:54 INFO Remoting: Starting remoting
15/10/21 11:20:54 ERROR NettyTransport: failed to bind to /1.2.3.4:0, shutting 
down Netty transport
15/10/21 11:20:54 WARN Utils: Service 'driverPropsFetcher' could not bind on 
port 0. Attempting port 1.
15/10/21 11:20:54 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down 
remote daemon.
15/10/21 11:20:54 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon 
shut down; proceeding with flushing remote transports.
15/10/21 11:20:54 ERROR Remoting: Remoting system has been terminated abrubtly. 
Attempting to shut down transports
15/10/21 11:20:54 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut 
down.
15/10/21 11:20:54 INFO Slf4jLogger: Slf4jLogger started




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to