[jira] [Commented] (SPARK-5529) BlockManager heartbeat expiration does not kill executor

2015-04-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520450#comment-14520450
 ] 

Apache Spark commented on SPARK-5529:
-

User 'alexrovner' has created a pull request for this issue:
https://github.com/apache/spark/pull/5793

 BlockManager heartbeat expiration does not kill executor
 

 Key: SPARK-5529
 URL: https://issues.apache.org/jira/browse/SPARK-5529
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Hong Shen
Assignee: Hong Shen
 Fix For: 1.4.0

 Attachments: SPARK-5529.patch


 When I run a spark job, one executor is hold, after 120s, blockManager is 
 removed by driver, but after half an hour before the executor is remove by  
 driver. Here is the log:
 {code}
 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms 
 exceeds 12ms
 
 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 
 10.215.143.14: remote Akka client disassociated
 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote 
 system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is 
 now gated for [5000] ms. Reason is: [Disassociated].
 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 
 0.0
 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 
 10.215.143.14): ExecutorLostFailure (executor 1 lost)
 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove 
 non-existent executor 1
 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0)
 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 
 from BlockManagerMaster.
 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in 
 removeExecutor
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5529) BlockManager heartbeat expiration does not kill executor

2015-04-28 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517280#comment-14517280
 ] 

Sean Owen commented on SPARK-5529:
--

[~arov] CDH always has the latest upstream minor release in minor releases, and 
back-ports maintenance release fixes into maintenance releases. This is on 
about the same 3-4 month cycle as Spark, so it's about as fast one could 
expect; CDH 5.4 = 1.3.x already. This change isn't even in a Spark release yet, 
so yes you want it to be back-ported to 1.3, probably. That has to precede 
ending up in CDH though.

 BlockManager heartbeat expiration does not kill executor
 

 Key: SPARK-5529
 URL: https://issues.apache.org/jira/browse/SPARK-5529
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Hong Shen
Assignee: Hong Shen
 Fix For: 1.4.0

 Attachments: SPARK-5529.patch


 When I run a spark job, one executor is hold, after 120s, blockManager is 
 removed by driver, but after half an hour before the executor is remove by  
 driver. Here is the log:
 {code}
 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms 
 exceeds 12ms
 
 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 
 10.215.143.14: remote Akka client disassociated
 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote 
 system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is 
 now gated for [5000] ms. Reason is: [Disassociated].
 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 
 0.0
 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 
 10.215.143.14): ExecutorLostFailure (executor 1 lost)
 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove 
 non-existent executor 1
 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0)
 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 
 from BlockManagerMaster.
 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in 
 removeExecutor
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5529) BlockManager heartbeat expiration does not kill executor

2015-04-28 Thread Alex Rovner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517281#comment-14517281
 ] 

Alex Rovner commented on SPARK-5529:


Applied patch to 1.3: https://github.com/apache/spark/pull/5745

 BlockManager heartbeat expiration does not kill executor
 

 Key: SPARK-5529
 URL: https://issues.apache.org/jira/browse/SPARK-5529
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Hong Shen
Assignee: Hong Shen
 Fix For: 1.4.0

 Attachments: SPARK-5529.patch


 When I run a spark job, one executor is hold, after 120s, blockManager is 
 removed by driver, but after half an hour before the executor is remove by  
 driver. Here is the log:
 {code}
 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms 
 exceeds 12ms
 
 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 
 10.215.143.14: remote Akka client disassociated
 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote 
 system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is 
 now gated for [5000] ms. Reason is: [Disassociated].
 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 
 0.0
 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 
 10.215.143.14): ExecutorLostFailure (executor 1 lost)
 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove 
 non-existent executor 1
 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0)
 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 
 from BlockManagerMaster.
 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in 
 removeExecutor
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5529) BlockManager heartbeat expiration does not kill executor

2015-04-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517285#comment-14517285
 ] 

Apache Spark commented on SPARK-5529:
-

User 'alexrovner' has created a pull request for this issue:
https://github.com/apache/spark/pull/5745

 BlockManager heartbeat expiration does not kill executor
 

 Key: SPARK-5529
 URL: https://issues.apache.org/jira/browse/SPARK-5529
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Hong Shen
Assignee: Hong Shen
 Fix For: 1.4.0

 Attachments: SPARK-5529.patch


 When I run a spark job, one executor is hold, after 120s, blockManager is 
 removed by driver, but after half an hour before the executor is remove by  
 driver. Here is the log:
 {code}
 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms 
 exceeds 12ms
 
 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 
 10.215.143.14: remote Akka client disassociated
 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote 
 system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is 
 now gated for [5000] ms. Reason is: [Disassociated].
 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 
 0.0
 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 
 10.215.143.14): ExecutorLostFailure (executor 1 lost)
 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove 
 non-existent executor 1
 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0)
 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 
 from BlockManagerMaster.
 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in 
 removeExecutor
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5529) BlockManager heartbeat expiration does not kill executor

2015-04-28 Thread Alex Rovner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517257#comment-14517257
 ] 

Alex Rovner commented on SPARK-5529:


CDH is usually somewhat slow on picking up the latest changes though. Would it 
be possible to backport this fix into 1.3?

 BlockManager heartbeat expiration does not kill executor
 

 Key: SPARK-5529
 URL: https://issues.apache.org/jira/browse/SPARK-5529
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Hong Shen
Assignee: Hong Shen
 Fix For: 1.4.0

 Attachments: SPARK-5529.patch


 When I run a spark job, one executor is hold, after 120s, blockManager is 
 removed by driver, but after half an hour before the executor is remove by  
 driver. Here is the log:
 {code}
 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms 
 exceeds 12ms
 
 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 
 10.215.143.14: remote Akka client disassociated
 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote 
 system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is 
 now gated for [5000] ms. Reason is: [Disassociated].
 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 
 0.0
 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 
 10.215.143.14): ExecutorLostFailure (executor 1 lost)
 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove 
 non-existent executor 1
 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0)
 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 
 from BlockManagerMaster.
 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in 
 removeExecutor
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5529) BlockManager heartbeat expiration does not kill executor

2015-04-28 Thread Alex Rovner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517298#comment-14517298
 ] 

Alex Rovner commented on SPARK-5529:


Sorry to quickly pulled the trigger... Need to resolve some compilation errors 

 BlockManager heartbeat expiration does not kill executor
 

 Key: SPARK-5529
 URL: https://issues.apache.org/jira/browse/SPARK-5529
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Hong Shen
Assignee: Hong Shen
 Fix For: 1.4.0

 Attachments: SPARK-5529.patch


 When I run a spark job, one executor is hold, after 120s, blockManager is 
 removed by driver, but after half an hour before the executor is remove by  
 driver. Here is the log:
 {code}
 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms 
 exceeds 12ms
 
 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 
 10.215.143.14: remote Akka client disassociated
 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote 
 system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is 
 now gated for [5000] ms. Reason is: [Disassociated].
 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 
 0.0
 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 
 10.215.143.14): ExecutorLostFailure (executor 1 lost)
 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove 
 non-existent executor 1
 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0)
 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 
 from BlockManagerMaster.
 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in 
 removeExecutor
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5529) BlockManager heartbeat expiration does not kill executor

2015-04-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517498#comment-14517498
 ] 

Apache Spark commented on SPARK-5529:
-

User 'alexrovner' has created a pull request for this issue:
https://github.com/apache/spark/pull/5747

 BlockManager heartbeat expiration does not kill executor
 

 Key: SPARK-5529
 URL: https://issues.apache.org/jira/browse/SPARK-5529
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Hong Shen
Assignee: Hong Shen
 Fix For: 1.4.0

 Attachments: SPARK-5529.patch


 When I run a spark job, one executor is hold, after 120s, blockManager is 
 removed by driver, but after half an hour before the executor is remove by  
 driver. Here is the log:
 {code}
 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms 
 exceeds 12ms
 
 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 
 10.215.143.14: remote Akka client disassociated
 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote 
 system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is 
 now gated for [5000] ms. Reason is: [Disassociated].
 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 
 0.0
 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 
 10.215.143.14): ExecutorLostFailure (executor 1 lost)
 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove 
 non-existent executor 1
 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0)
 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 
 from BlockManagerMaster.
 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in 
 removeExecutor
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5529) BlockManager heartbeat expiration does not kill executor

2015-04-28 Thread Alex Rovner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517499#comment-14517499
 ] 

Alex Rovner commented on SPARK-5529:


Sorry about all the pull requests. Here is one rebased against the right branch 
and without any compilation issues: https://github.com/apache/spark/pull/5747

 BlockManager heartbeat expiration does not kill executor
 

 Key: SPARK-5529
 URL: https://issues.apache.org/jira/browse/SPARK-5529
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Hong Shen
Assignee: Hong Shen
 Fix For: 1.4.0

 Attachments: SPARK-5529.patch


 When I run a spark job, one executor is hold, after 120s, blockManager is 
 removed by driver, but after half an hour before the executor is remove by  
 driver. Here is the log:
 {code}
 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms 
 exceeds 12ms
 
 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 
 10.215.143.14: remote Akka client disassociated
 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote 
 system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is 
 now gated for [5000] ms. Reason is: [Disassociated].
 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 
 0.0
 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 
 10.215.143.14): ExecutorLostFailure (executor 1 lost)
 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove 
 non-existent executor 1
 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0)
 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 
 from BlockManagerMaster.
 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in 
 removeExecutor
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5529) BlockManager heartbeat expiration does not kill executor

2015-04-24 Thread Hong Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512244#comment-14512244
 ] 

Hong Shen commented on SPARK-5529:
--

1.4.0 version would be release in june.

 BlockManager heartbeat expiration does not kill executor
 

 Key: SPARK-5529
 URL: https://issues.apache.org/jira/browse/SPARK-5529
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Hong Shen
Assignee: Hong Shen
 Fix For: 1.4.0

 Attachments: SPARK-5529.patch


 When I run a spark job, one executor is hold, after 120s, blockManager is 
 removed by driver, but after half an hour before the executor is remove by  
 driver. Here is the log:
 {code}
 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms 
 exceeds 12ms
 
 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 
 10.215.143.14: remote Akka client disassociated
 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote 
 system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is 
 now gated for [5000] ms. Reason is: [Disassociated].
 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 
 0.0
 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 
 10.215.143.14): ExecutorLostFailure (executor 1 lost)
 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove 
 non-existent executor 1
 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0)
 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 
 from BlockManagerMaster.
 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in 
 removeExecutor
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5529) BlockManager heartbeat expiration does not kill executor

2015-04-24 Thread Alex Rovner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14511374#comment-14511374
 ] 

Alex Rovner commented on SPARK-5529:


We are facing this issue on cdh5.3.2 spark 1.2.0-SNAPSHOT

Is there any workaround except upgrading to 1.4 version of spark?

 BlockManager heartbeat expiration does not kill executor
 

 Key: SPARK-5529
 URL: https://issues.apache.org/jira/browse/SPARK-5529
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Hong Shen
Assignee: Hong Shen
 Fix For: 1.4.0

 Attachments: SPARK-5529.patch


 When I run a spark job, one executor is hold, after 120s, blockManager is 
 removed by driver, but after half an hour before the executor is remove by  
 driver. Here is the log:
 {code}
 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms 
 exceeds 12ms
 
 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 
 10.215.143.14: remote Akka client disassociated
 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote 
 system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is 
 now gated for [5000] ms. Reason is: [Disassociated].
 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 
 0.0
 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 
 10.215.143.14): ExecutorLostFailure (executor 1 lost)
 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove 
 non-existent executor 1
 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0)
 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 
 from BlockManagerMaster.
 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in 
 removeExecutor
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org