[jira] [Issue Comment Deleted] (SPARK-1499) Workers continuously produce failing executors
[ https://issues.apache.org/jira/browse/SPARK-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XeonZhao updated SPARK-1499: Comment: was deleted (was: How to solve this problem ?I have encountered this issue.) > Workers continuously produce failing executors > -- > > Key: SPARK-1499 > URL: https://issues.apache.org/jira/browse/SPARK-1499 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 0.9.1, 1.0.0 >Reporter: Aaron Davidson > > If a node is in a bad state, such that newly started executors fail on > startup or first use, the Standalone Cluster Worker will happily keep > spawning new ones. A better behavior would be for a Worker to mark itself as > dead if it has had a history of continuously producing erroneous executors, > or else to somehow prevent a driver from re-registering executors from the > same machine repeatedly. > Reported on mailing list: > http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3ccal8t0bqjfgtf-vbzjq6yj7ckbl_9p9s0trvew2mvg6zbngx...@mail.gmail.com%3E > Relevant logs: > {noformat} > 14/04/11 19:06:52 INFO client.AppClient$ClientActor: Executor updated: > app-20140411190649-0008/4 is now FAILED (Command exited with code 53) > 14/04/11 19:06:52 INFO cluster.SparkDeploySchedulerBackend: Executor > app-20140411190649-0008/4 removed: Command exited with code 53 > 14/04/11 19:06:52 INFO cluster.SparkDeploySchedulerBackend: Executor 4 > disconnected, so removing it > 14/04/11 19:06:52 ERROR scheduler.TaskSchedulerImpl: Lost an executor 4 > (already removed): Failed to create local directory (bad spark.local.dir?) > 14/04/11 19:06:52 INFO client.AppClient$ClientActor: Executor added: > app-20140411190649-0008/27 on > worker-20140409212012-ip-172-31-19-11.us-west-1.compute.internal-58614 > (ip-172-31-19-11.us-west-1.compute.internal:58614) with 8 cores > 14/04/11 19:06:52 INFO cluster.SparkDeploySchedulerBackend: Granted executor > ID app-20140411190649-0008/27 on hostPort > ip-172-31-19-11.us-west-1.compute.internal:58614 with 8 cores, 56.9 GB RAM > 14/04/11 19:06:52 INFO client.AppClient$ClientActor: Executor updated: > app-20140411190649-0008/27 is now RUNNING > 14/04/11 19:06:52 INFO storage.BlockManagerMasterActor$BlockManagerInfo: > Registering block manager ip-172-31-24-76.us-west-1.compute.internal:50256 > with 32.7 GB RAM > 14/04/11 19:06:52 INFO metastore.HiveMetaStore: 0: get_table : db=default > tbl=wikistats_pd > 14/04/11 19:06:52 INFO HiveMetaStore.audit: ugi=root ip=unknown-ip-addr > cmd=get_table : db=default tbl=wikistats_pd > 14/04/11 19:06:53 DEBUG hive.log: DDL: struct wikistats_pd { string > projectcode, string pagename, i32 pageviews, i32 bytes} > 14/04/11 19:06:53 DEBUG lazy.LazySimpleSerDe: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe initialized with: > columnNames=[projectcode, pagename, pageviews, bytes] columnTypes=[string, > string, int, int] separator=[[B@29a81175] nullstring=\N > lastColumnTakesRest=false > shark> 14/04/11 19:06:55 INFO cluster.SparkDeploySchedulerBackend: Registered > executor: > Actor[akka.tcp://sparkexecu...@ip-172-31-19-11.us-west-1.compute.internal:45248/user/Executor#-1002203295] > with ID 27 > show 14/04/11 19:06:56 INFO cluster.SparkDeploySchedulerBackend: Executor 27 > disconnected, so removing it > 14/04/11 19:06:56 ERROR scheduler.TaskSchedulerImpl: Lost an executor 27 > (already removed): remote Akka client disassociated > 14/04/11 19:06:56 INFO client.AppClient$ClientActor: Executor updated: > app-20140411190649-0008/27 is now FAILED (Command exited with code 53) > 14/04/11 19:06:56 INFO cluster.SparkDeploySchedulerBackend: Executor > app-20140411190649-0008/27 removed: Command exited with code 53 > 14/04/11 19:06:56 INFO client.AppClient$ClientActor: Executor added: > app-20140411190649-0008/28 on > worker-20140409212012-ip-172-31-19-11.us-west-1.compute.internal-58614 > (ip-172-31-19-11.us-west-1.compute.internal:58614) with 8 cores > 14/04/11 19:06:56 INFO cluster.SparkDeploySchedulerBackend: Granted executor > ID app-20140411190649-0008/28 on hostPort > ip-172-31-19-11.us-west-1.compute.internal:58614 with 8 cores, 56.9 GB RAM > 14/04/11 19:06:56 INFO client.AppClient$ClientActor: Executor updated: > app-20140411190649-0008/28 is now RUNNING > tables; > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-1499) Workers continuously produce failing executors
[ https://issues.apache.org/jira/browse/SPARK-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XeonZhao updated SPARK-1499: Comment: was deleted (was: How to solve this problem ?I have encountered this issue.) > Workers continuously produce failing executors > -- > > Key: SPARK-1499 > URL: https://issues.apache.org/jira/browse/SPARK-1499 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 0.9.1, 1.0.0 >Reporter: Aaron Davidson > > If a node is in a bad state, such that newly started executors fail on > startup or first use, the Standalone Cluster Worker will happily keep > spawning new ones. A better behavior would be for a Worker to mark itself as > dead if it has had a history of continuously producing erroneous executors, > or else to somehow prevent a driver from re-registering executors from the > same machine repeatedly. > Reported on mailing list: > http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3ccal8t0bqjfgtf-vbzjq6yj7ckbl_9p9s0trvew2mvg6zbngx...@mail.gmail.com%3E > Relevant logs: > {noformat} > 14/04/11 19:06:52 INFO client.AppClient$ClientActor: Executor updated: > app-20140411190649-0008/4 is now FAILED (Command exited with code 53) > 14/04/11 19:06:52 INFO cluster.SparkDeploySchedulerBackend: Executor > app-20140411190649-0008/4 removed: Command exited with code 53 > 14/04/11 19:06:52 INFO cluster.SparkDeploySchedulerBackend: Executor 4 > disconnected, so removing it > 14/04/11 19:06:52 ERROR scheduler.TaskSchedulerImpl: Lost an executor 4 > (already removed): Failed to create local directory (bad spark.local.dir?) > 14/04/11 19:06:52 INFO client.AppClient$ClientActor: Executor added: > app-20140411190649-0008/27 on > worker-20140409212012-ip-172-31-19-11.us-west-1.compute.internal-58614 > (ip-172-31-19-11.us-west-1.compute.internal:58614) with 8 cores > 14/04/11 19:06:52 INFO cluster.SparkDeploySchedulerBackend: Granted executor > ID app-20140411190649-0008/27 on hostPort > ip-172-31-19-11.us-west-1.compute.internal:58614 with 8 cores, 56.9 GB RAM > 14/04/11 19:06:52 INFO client.AppClient$ClientActor: Executor updated: > app-20140411190649-0008/27 is now RUNNING > 14/04/11 19:06:52 INFO storage.BlockManagerMasterActor$BlockManagerInfo: > Registering block manager ip-172-31-24-76.us-west-1.compute.internal:50256 > with 32.7 GB RAM > 14/04/11 19:06:52 INFO metastore.HiveMetaStore: 0: get_table : db=default > tbl=wikistats_pd > 14/04/11 19:06:52 INFO HiveMetaStore.audit: ugi=root ip=unknown-ip-addr > cmd=get_table : db=default tbl=wikistats_pd > 14/04/11 19:06:53 DEBUG hive.log: DDL: struct wikistats_pd { string > projectcode, string pagename, i32 pageviews, i32 bytes} > 14/04/11 19:06:53 DEBUG lazy.LazySimpleSerDe: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe initialized with: > columnNames=[projectcode, pagename, pageviews, bytes] columnTypes=[string, > string, int, int] separator=[[B@29a81175] nullstring=\N > lastColumnTakesRest=false > shark> 14/04/11 19:06:55 INFO cluster.SparkDeploySchedulerBackend: Registered > executor: > Actor[akka.tcp://sparkexecu...@ip-172-31-19-11.us-west-1.compute.internal:45248/user/Executor#-1002203295] > with ID 27 > show 14/04/11 19:06:56 INFO cluster.SparkDeploySchedulerBackend: Executor 27 > disconnected, so removing it > 14/04/11 19:06:56 ERROR scheduler.TaskSchedulerImpl: Lost an executor 27 > (already removed): remote Akka client disassociated > 14/04/11 19:06:56 INFO client.AppClient$ClientActor: Executor updated: > app-20140411190649-0008/27 is now FAILED (Command exited with code 53) > 14/04/11 19:06:56 INFO cluster.SparkDeploySchedulerBackend: Executor > app-20140411190649-0008/27 removed: Command exited with code 53 > 14/04/11 19:06:56 INFO client.AppClient$ClientActor: Executor added: > app-20140411190649-0008/28 on > worker-20140409212012-ip-172-31-19-11.us-west-1.compute.internal-58614 > (ip-172-31-19-11.us-west-1.compute.internal:58614) with 8 cores > 14/04/11 19:06:56 INFO cluster.SparkDeploySchedulerBackend: Granted executor > ID app-20140411190649-0008/28 on hostPort > ip-172-31-19-11.us-west-1.compute.internal:58614 with 8 cores, 56.9 GB RAM > 14/04/11 19:06:56 INFO client.AppClient$ClientActor: Executor updated: > app-20140411190649-0008/28 is now RUNNING > tables; > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org