[jira] [Commented] (SPARK-5529) BlockManager heartbeat expiration does not kill executor
[ https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520450#comment-14520450 ] Apache Spark commented on SPARK-5529: - User 'alexrovner' has created a pull request for this issue: https://github.com/apache/spark/pull/5793 BlockManager heartbeat expiration does not kill executor Key: SPARK-5529 URL: https://issues.apache.org/jira/browse/SPARK-5529 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Hong Shen Assignee: Hong Shen Fix For: 1.4.0 Attachments: SPARK-5529.patch When I run a spark job, one executor is hold, after 120s, blockManager is removed by driver, but after half an hour before the executor is remove by driver. Here is the log: {code} 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms exceeds 12ms 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 10.215.143.14: remote Akka client disassociated 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 0.0 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 10.215.143.14): ExecutorLostFailure (executor 1 lost) 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove non-existent executor 1 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0) 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 from BlockManagerMaster. 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5529) BlockManager heartbeat expiration does not kill executor
[ https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517280#comment-14517280 ] Sean Owen commented on SPARK-5529: -- [~arov] CDH always has the latest upstream minor release in minor releases, and back-ports maintenance release fixes into maintenance releases. This is on about the same 3-4 month cycle as Spark, so it's about as fast one could expect; CDH 5.4 = 1.3.x already. This change isn't even in a Spark release yet, so yes you want it to be back-ported to 1.3, probably. That has to precede ending up in CDH though. BlockManager heartbeat expiration does not kill executor Key: SPARK-5529 URL: https://issues.apache.org/jira/browse/SPARK-5529 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Hong Shen Assignee: Hong Shen Fix For: 1.4.0 Attachments: SPARK-5529.patch When I run a spark job, one executor is hold, after 120s, blockManager is removed by driver, but after half an hour before the executor is remove by driver. Here is the log: {code} 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms exceeds 12ms 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 10.215.143.14: remote Akka client disassociated 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 0.0 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 10.215.143.14): ExecutorLostFailure (executor 1 lost) 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove non-existent executor 1 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0) 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 from BlockManagerMaster. 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5529) BlockManager heartbeat expiration does not kill executor
[ https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517281#comment-14517281 ] Alex Rovner commented on SPARK-5529: Applied patch to 1.3: https://github.com/apache/spark/pull/5745 BlockManager heartbeat expiration does not kill executor Key: SPARK-5529 URL: https://issues.apache.org/jira/browse/SPARK-5529 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Hong Shen Assignee: Hong Shen Fix For: 1.4.0 Attachments: SPARK-5529.patch When I run a spark job, one executor is hold, after 120s, blockManager is removed by driver, but after half an hour before the executor is remove by driver. Here is the log: {code} 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms exceeds 12ms 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 10.215.143.14: remote Akka client disassociated 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 0.0 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 10.215.143.14): ExecutorLostFailure (executor 1 lost) 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove non-existent executor 1 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0) 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 from BlockManagerMaster. 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5529) BlockManager heartbeat expiration does not kill executor
[ https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517285#comment-14517285 ] Apache Spark commented on SPARK-5529: - User 'alexrovner' has created a pull request for this issue: https://github.com/apache/spark/pull/5745 BlockManager heartbeat expiration does not kill executor Key: SPARK-5529 URL: https://issues.apache.org/jira/browse/SPARK-5529 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Hong Shen Assignee: Hong Shen Fix For: 1.4.0 Attachments: SPARK-5529.patch When I run a spark job, one executor is hold, after 120s, blockManager is removed by driver, but after half an hour before the executor is remove by driver. Here is the log: {code} 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms exceeds 12ms 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 10.215.143.14: remote Akka client disassociated 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 0.0 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 10.215.143.14): ExecutorLostFailure (executor 1 lost) 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove non-existent executor 1 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0) 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 from BlockManagerMaster. 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5529) BlockManager heartbeat expiration does not kill executor
[ https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517257#comment-14517257 ] Alex Rovner commented on SPARK-5529: CDH is usually somewhat slow on picking up the latest changes though. Would it be possible to backport this fix into 1.3? BlockManager heartbeat expiration does not kill executor Key: SPARK-5529 URL: https://issues.apache.org/jira/browse/SPARK-5529 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Hong Shen Assignee: Hong Shen Fix For: 1.4.0 Attachments: SPARK-5529.patch When I run a spark job, one executor is hold, after 120s, blockManager is removed by driver, but after half an hour before the executor is remove by driver. Here is the log: {code} 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms exceeds 12ms 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 10.215.143.14: remote Akka client disassociated 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 0.0 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 10.215.143.14): ExecutorLostFailure (executor 1 lost) 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove non-existent executor 1 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0) 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 from BlockManagerMaster. 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5529) BlockManager heartbeat expiration does not kill executor
[ https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517298#comment-14517298 ] Alex Rovner commented on SPARK-5529: Sorry to quickly pulled the trigger... Need to resolve some compilation errors BlockManager heartbeat expiration does not kill executor Key: SPARK-5529 URL: https://issues.apache.org/jira/browse/SPARK-5529 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Hong Shen Assignee: Hong Shen Fix For: 1.4.0 Attachments: SPARK-5529.patch When I run a spark job, one executor is hold, after 120s, blockManager is removed by driver, but after half an hour before the executor is remove by driver. Here is the log: {code} 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms exceeds 12ms 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 10.215.143.14: remote Akka client disassociated 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 0.0 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 10.215.143.14): ExecutorLostFailure (executor 1 lost) 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove non-existent executor 1 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0) 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 from BlockManagerMaster. 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5529) BlockManager heartbeat expiration does not kill executor
[ https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517498#comment-14517498 ] Apache Spark commented on SPARK-5529: - User 'alexrovner' has created a pull request for this issue: https://github.com/apache/spark/pull/5747 BlockManager heartbeat expiration does not kill executor Key: SPARK-5529 URL: https://issues.apache.org/jira/browse/SPARK-5529 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Hong Shen Assignee: Hong Shen Fix For: 1.4.0 Attachments: SPARK-5529.patch When I run a spark job, one executor is hold, after 120s, blockManager is removed by driver, but after half an hour before the executor is remove by driver. Here is the log: {code} 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms exceeds 12ms 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 10.215.143.14: remote Akka client disassociated 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 0.0 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 10.215.143.14): ExecutorLostFailure (executor 1 lost) 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove non-existent executor 1 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0) 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 from BlockManagerMaster. 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5529) BlockManager heartbeat expiration does not kill executor
[ https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517499#comment-14517499 ] Alex Rovner commented on SPARK-5529: Sorry about all the pull requests. Here is one rebased against the right branch and without any compilation issues: https://github.com/apache/spark/pull/5747 BlockManager heartbeat expiration does not kill executor Key: SPARK-5529 URL: https://issues.apache.org/jira/browse/SPARK-5529 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Hong Shen Assignee: Hong Shen Fix For: 1.4.0 Attachments: SPARK-5529.patch When I run a spark job, one executor is hold, after 120s, blockManager is removed by driver, but after half an hour before the executor is remove by driver. Here is the log: {code} 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms exceeds 12ms 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 10.215.143.14: remote Akka client disassociated 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 0.0 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 10.215.143.14): ExecutorLostFailure (executor 1 lost) 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove non-existent executor 1 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0) 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 from BlockManagerMaster. 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5529) BlockManager heartbeat expiration does not kill executor
[ https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512244#comment-14512244 ] Hong Shen commented on SPARK-5529: -- 1.4.0 version would be release in june. BlockManager heartbeat expiration does not kill executor Key: SPARK-5529 URL: https://issues.apache.org/jira/browse/SPARK-5529 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Hong Shen Assignee: Hong Shen Fix For: 1.4.0 Attachments: SPARK-5529.patch When I run a spark job, one executor is hold, after 120s, blockManager is removed by driver, but after half an hour before the executor is remove by driver. Here is the log: {code} 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms exceeds 12ms 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 10.215.143.14: remote Akka client disassociated 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 0.0 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 10.215.143.14): ExecutorLostFailure (executor 1 lost) 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove non-existent executor 1 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0) 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 from BlockManagerMaster. 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5529) BlockManager heartbeat expiration does not kill executor
[ https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14511374#comment-14511374 ] Alex Rovner commented on SPARK-5529: We are facing this issue on cdh5.3.2 spark 1.2.0-SNAPSHOT Is there any workaround except upgrading to 1.4 version of spark? BlockManager heartbeat expiration does not kill executor Key: SPARK-5529 URL: https://issues.apache.org/jira/browse/SPARK-5529 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Hong Shen Assignee: Hong Shen Fix For: 1.4.0 Attachments: SPARK-5529.patch When I run a spark job, one executor is hold, after 120s, blockManager is removed by driver, but after half an hour before the executor is remove by driver. Here is the log: {code} 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms exceeds 12ms 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 10.215.143.14: remote Akka client disassociated 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 0.0 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 10.215.143.14): ExecutorLostFailure (executor 1 lost) 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove non-existent executor 1 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0) 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 from BlockManagerMaster. 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org