Romi Kuntsman created SPARK-11228: ------------------------------------- Summary: Job stuck in Executor failure loop when NettyTransport failed to bind Key: SPARK-11228 URL: https://issues.apache.org/jira/browse/SPARK-11228 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 1.5.1 Environment: 14.04.1-Ubuntu SMP x86_64 GNU/Linux Reporter: Romi Kuntsman
I changed my network connection while a local spark cluster is running. In port 8080, I see the master and worker running. I'm running Spark in Java in client mode, so the driver is running inside my IDE. When trying to start a job on the local spark cluster, I get an endless loop of the errors below at #1. It only stops when I kill the application manually. When looking at the worker log, I see an endless loop of the errors below at #2. Expected behaviour would be failing the job after a few failed retries / timeout. (IP anonymized to 1.2.3.4) 1. Errors see on driver: 2015-10-21 11:20:54,793 INFO [org.apache.spark.scheduler.TaskSchedulerImpl] Adding task set 0.0 with 2 tasks 2015-10-21 11:20:55,847 INFO [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: app-20151021112052-0005/1 is now EXITED (Command exited with code 1) 2015-10-21 11:20:55,847 INFO [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Executor app-20151021112052-0005/1 removed: Command exited with code 1 2015-10-21 11:20:55,848 INFO [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Asked to remove non-existent executor 1 2015-10-21 11:20:55,848 INFO [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor added: app-20151021112052-0005/2 on worker-20151021090623-1.2.3.4-57305 (1.2.3.4:57305) with 1 cores 2015-10-21 11:20:55,848 INFO [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Granted executor ID app-20151021112052-0005/2 on hostPort 1.2.3.4:57305 with 1 cores, 4.9 GB RAM 2015-10-21 11:20:55,849 INFO [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: app-20151021112052-0005/2 is now LOADING 2015-10-21 11:20:55,852 INFO [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: app-20151021112052-0005/2 is now RUNNING 2015-10-21 11:20:57,165 INFO [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: app-20151021112052-0005/2 is now EXITED (Command exited with code 1) 2015-10-21 11:20:57,165 INFO [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Executor app-20151021112052-0005/2 removed: Command exited with code 1 2015-10-21 11:20:57,166 INFO [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Asked to remove non-existent executor 2 2015-10-21 11:20:57,166 INFO [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor added: app-20151021112052-0005/3 on worker-20151021090623-1.2.3.4-57305 (1.2.3.4:57305) with 1 cores 2015-10-21 11:20:57,167 INFO [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Granted executor ID app-20151021112052-0005/3 on hostPort 1.2.3.4:57305 with 1 cores, 4.9 GB RAM 2015-10-21 11:20:57,167 INFO [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: app-20151021112052-0005/3 is now LOADING 2015-10-21 11:20:57,169 INFO [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: app-20151021112052-0005/3 is now RUNNING 2015-10-21 11:20:58,531 INFO [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: app-20151021112052-0005/3 is now EXITED (Command exited with code 1) 2015-10-21 11:20:58,531 INFO [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Executor app-20151021112052-0005/3 removed: Command exited with code 1 2015-10-21 11:20:58,532 INFO [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Asked to remove non-existent executor 3 2015-10-21 11:20:58,532 INFO [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor added: app-20151021112052-0005/4 on worker-20151021090623-1.2.3.4-57305 (1.2.3.4:57305) with 1 cores 2015-10-21 11:20:58,532 INFO [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Granted executor ID app-20151021112052-0005/4 on hostPort 1.2.3.4:57305 with 1 cores, 4.9 GB RAM 2015-10-21 11:20:58,533 INFO [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: app-20151021112052-0005/4 is now LOADING 2015-10-21 11:20:58,535 INFO [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: app-20151021112052-0005/4 is now RUNNING 2015-10-21 11:20:59,932 INFO [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: app-20151021112052-0005/4 is now EXITED (Command exited with code 1) 2015-10-21 11:20:59,933 INFO [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Executor app-20151021112052-0005/4 removed: Command exited with code 1 2015-10-21 11:20:59,933 INFO [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Asked to remove non-existent executor 4 2015-10-21 11:20:59,933 INFO [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor added: app-20151021112052-0005/5 on worker-20151021090623-1.2.3.4-57305 (1.2.3.4:57305) with 1 cores 2015-10-21 11:20:59,934 INFO [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Granted executor ID app-20151021112052-0005/5 on hostPort 1.2.3.4:57305 with 1 cores, 4.9 GB RAM 2015-10-21 11:20:59,935 INFO [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: app-20151021112052-0005/5 is now LOADING 2015-10-21 11:20:59,937 INFO [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: app-20151021112052-0005/5 is now RUNNING 2015-10-21 11:21:01,338 INFO [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: app-20151021112052-0005/5 is now EXITED (Command exited with code 1) 2015-10-21 11:21:01,338 INFO [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Executor app-20151021112052-0005/5 removed: Command exited with code 1 2015-10-21 11:21:01,339 INFO [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Asked to remove non-existent executor 5 2. Errors seen on workers: 15/10/21 11:20:53 INFO Remoting: Starting remoting 15/10/21 11:20:53 ERROR NettyTransport: failed to bind to /1.2.3.4:0, shutting down Netty transport 15/10/21 11:20:53 WARN Utils: Service 'driverPropsFetcher' could not bind on port 0. Attempting port 1. 15/10/21 11:20:53 INFO Slf4jLogger: Slf4jLogger started 15/10/21 11:20:53 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 15/10/21 11:20:53 INFO Remoting: Starting remoting 15/10/21 11:20:53 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 15/10/21 11:20:53 ERROR Remoting: Remoting system has been terminated abrubtly. Attempting to shut down transports 15/10/21 11:20:53 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down. 15/10/21 11:20:53 ERROR NettyTransport: failed to bind to /1.2.3.4:0, shutting down Netty transport 15/10/21 11:20:53 WARN Utils: Service 'driverPropsFetcher' could not bind on port 0. Attempting port 1. 15/10/21 11:20:53 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 15/10/21 11:20:53 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 15/10/21 11:20:53 INFO Slf4jLogger: Slf4jLogger started 15/10/21 11:20:53 ERROR Remoting: Remoting system has been terminated abrubtly. Attempting to shut down transports 15/10/21 11:20:53 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down. 15/10/21 11:20:53 INFO Remoting: Starting remoting 15/10/21 11:20:54 ERROR NettyTransport: failed to bind to /1.2.3.4:0, shutting down Netty transport 15/10/21 11:20:54 WARN Utils: Service 'driverPropsFetcher' could not bind on port 0. Attempting port 1. 15/10/21 11:20:54 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 15/10/21 11:20:54 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 15/10/21 11:20:54 INFO Slf4jLogger: Slf4jLogger started 15/10/21 11:20:54 INFO Remoting: Starting remoting 15/10/21 11:20:54 ERROR NettyTransport: failed to bind to /1.2.3.4:0, shutting down Netty transport 15/10/21 11:20:54 WARN Utils: Service 'driverPropsFetcher' could not bind on port 0. Attempting port 1. 15/10/21 11:20:54 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 15/10/21 11:20:54 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 15/10/21 11:20:54 ERROR Remoting: Remoting system has been terminated abrubtly. Attempting to shut down transports 15/10/21 11:20:54 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down. 15/10/21 11:20:54 INFO Slf4jLogger: Slf4jLogger started -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org