Found the issue above error - the setting for spark_shuffle was incomplete.

Now it is able to ask and get additional executors. The issue is once they
are released, it is not able to proceed with next query.

The environment is CDH 5.3.2 (Hadoop 2.5) with Kerberos & Spark 1.3

After idle time, the release of executor gives a stack trace under WARN and
returns to the prompt (in spark-shell)

15/03/23 20:55:50 INFO YarnClientSchedulerBackend: Requesting to kill
executor(s) 2
15/03/23 20:55:50 INFO ExecutorAllocationManager: Removing executor 2
because it has been idle for 60 seconds (new desired total will be 6)
15/03/23 20:55:50 INFO YarnClientSchedulerBackend: Requesting to kill
executor(s) 5
15/03/23 20:55:50 INFO ExecutorAllocationManager: Removing executor 5
because it has been idle for 60 seconds (new desired total will be 5)
15/03/23 20:55:50 INFO YarnClientSchedulerBackend: Requesting to kill
executor(s) 1
15/03/23 20:55:51 WARN ReliableDeliverySupervisor: Association with remote
system [akka.tcp://sparkExecutor@xxxl:47358] has failed, address is now
gated for [5000] ms. Reason is: [Disassociated].
15/03/23 20:55:51 WARN ReliableDeliverySupervisor: Association with remote
system [akka.tcp://sparkExecutor@yyy:51807] has failed, address is now
gated for [5000] ms. Reason is: [Disassociated].
15/03/23 20:55:52 WARN ReliableDeliverySupervisor: Association with remote
system [akka.tcp://sparkExecutor@zzz:54623] has failed, address is now
gated for [5000] ms. Reason is: [Disassociated].
15/03/23 20:56:20 WARN AkkaUtils: Error sending message [message =
KillExecutors(ArrayBuffer(1))] in 1 attempts
java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:195)
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:171)
at
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerActor$$anonfun$receive$1$$anonfun$applyOrElse$4.apply$mcV$sp(YarnSchedulerBackend.scala:136)
at
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerActor$$anonfun$receive$1$$anonfun$applyOrElse$4.apply(YarnSchedulerBackend.scala:136)
at
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerActor$$anonfun$receive$1$$anonfun$applyOrElse$4.apply(YarnSchedulerBackend.scala:136)
at
scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at
scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
15/03/23 20:56:20 WARN AkkaUtils: Error sending message [message =
KillExecutors(ArrayBuffer(1))] in 1 attempts
java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:195)
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:171)
at
org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doKillExecutors(YarnSchedulerBackend.scala:68)
at
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.killExecutors(CoarseGrainedSchedulerBackend.scala:375)
at org.apache.spark.SparkContext.killExecutors(SparkContext.scala:1173)
at
org.apache.spark.ExecutorAllocationClient$class.killExecutor(ExecutorAllocationClient.scala:49)
at org.apache.spark.SparkContext.killExecutor(SparkContext.scala:1186)
at org.apache.spark.ExecutorAllocationManager.org
$apache$spark$ExecutorAllocationManager$$removeExecutor(ExecutorAllocationManager.scala:353)
at
org.apache.spark.ExecutorAllocationManager$$anonfun$org$apache$spark$ExecutorAllocationManager$$schedule$1.apply(ExecutorAllocationManager.scala:237)
at
org.apache.spark.ExecutorAllocationManager$$anonfun$org$apache$spark$ExecutorAllocationManager$$schedule$1.apply(ExecutorAllocationManager.scala:234)
at
scala.collection.mutable.MapLike$$anonfun$retain$2.apply(MapLike.scala:213)
at
scala.collection.mutable.MapLike$$anonfun$retain$2.apply(MapLike.scala:212)
at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at scala.collection.immutable.List.foreach(List.scala:318)
at
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at scala.collection.mutable.MapLike$class.retain(MapLike.scala:212)
at scala.collection.mutable.AbstractMap.retain(Map.scala:91)
at org.apache.spark.ExecutorAllocationManager.org
$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:234)
at
org.apache.spark.ExecutorAllocationManager$$anon$1.run(ExecutorAllocationManager.scala:191)

The next query hangs with following output

......
15/03/23 20:57:27 INFO MemoryStore: Block broadcast_4_piece0 stored as
bytes in memory (estimated size 4.5 KB, free 264.9 MB)
15/03/23 20:57:27 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory
on xxx:45903 (size: 4.5 KB, free: 265.1 MB)
15/03/23 20:57:27 INFO BlockManagerMaster: Updated info of block
broadcast_4_piece0
15/03/23 20:57:27 INFO SparkContext: Created broadcast 4 from broadcast at
DAGScheduler.scala:839
15/03/23 20:57:27 INFO DAGScheduler: Submitting 10 missing tasks from Stage
2 (MapPartitionsRDD[11] at mapPartitions at Exchange.scala:64)
15/03/23 20:57:27 INFO YarnScheduler: Adding task set 2.0 with 10 tasks
15/03/23 20:57:56 WARN BlockManagerMasterActor: Removing BlockManager
BlockManagerId(2, yyy, 37231) with no recent heart beats: 131006ms exceeds
120000ms
15/03/23 20:57:56 WARN BlockManagerMasterActor: Removing BlockManager
BlockManagerId(5, zzz, 34437) with no recent heart beats: 131329ms exceeds
120000ms
15/03/23 20:57:56 WARN BlockManagerMasterActor: Removing BlockManager
BlockManagerId(1, bbb, 52747) with no recent heart beats: 128199ms exceeds
120000ms
15/03/23 20:57:56 INFO BlockManagerMasterActor: Removing block manager
BlockManagerId(2, yyy, 37231)
15/03/23 20:57:56 INFO BlockManagerMasterActor: Removing block manager
BlockManagerId(1, bbb, 52747)
15/03/23 20:57:56 INFO BlockManagerMasterActor: Removing block manager
BlockManagerId(5, zzz, 34437)



On Sat, Mar 21, 2015 at 6:51 AM, Ted Yu <yuzhih...@gmail.com> wrote:

> bq. Requesting 1 new executor(s) because tasks are backlogged
>
> 1 executor was requested.
>
> Which hadoop release are you using ?
>
> Can you check resource manager log to see if there is some clue ?
>
> Thanks
>
> On Fri, Mar 20, 2015 at 4:17 PM, Manoj Samel <manojsamelt...@gmail.com>
> wrote:
>
>> Forgot to add - the cluster is idle otherwise so there should be no
>> resource issues. Also the configuration works when not using Dynamic
>> allocation.
>>
>> On Fri, Mar 20, 2015 at 4:15 PM, Manoj Samel <manojsamelt...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> Running Spark 1.3 with secured Hadoop.
>>>
>>> Spark-shell with Yarn client mode runs without issue when not using
>>> Dynamic Allocation.
>>>
>>> When Dynamic allocation is turned on, the shell comes up but same SQL
>>> etc. causes it to loop.
>>>
>>> spark.dynamicAllocation.enabled=true
>>> spark.dynamicAllocation.initialExecutors=1
>>> spark.dynamicAllocation.maxExecutors=10
>>> # Set IdleTime low for testing
>>> spark.dynamicAllocation.executorIdleTimeout=60
>>> spark.shuffle.service.enabled=true
>>>
>>> Following is the start of the messages and then it keeps looping with
>>> "Requesting 0 new executors"
>>>
>>> 15/03/20 22:52:42 INFO storage.BlockManagerMaster: Updated info of block
>>> broadcast_1_piece0
>>> 15/03/20 22:52:42 INFO spark.SparkContext: Created broadcast 1 from
>>> broadcast at DAGScheduler.scala:839
>>> 15/03/20 22:52:42 INFO scheduler.DAGScheduler: Submitting 1 missing
>>> tasks from Stage 0 (MapPartitionsRDD[3] at mapPartitions at
>>> Exchange.scala:100)
>>> 15/03/20 22:52:42 INFO cluster.YarnScheduler: Adding task set 0.0 with 1
>>> tasks
>>> 15/03/20 22:52:47 INFO spark.ExecutorAllocationManager: Requesting 1 new
>>> executor(s) because tasks are backlogged (new desired total will be 1)
>>> 15/03/20 22:52:52 INFO spark.ExecutorAllocationManager: Requesting 0 new
>>> executor(s) because tasks are backlogged (new desired total will be 1)
>>> 15/03/20 22:52:57 WARN cluster.YarnScheduler: Initial job has not
>>> accepted any resources; check your cluster UI to ensure that workers are
>>> registered and have sufficient resources
>>> 15/03/20 22:52:57 INFO spark.ExecutorAllocationManager: Requesting 0 new
>>> executor(s) because tasks are backlogged (new desired total will be 1)
>>> 15/03/20 22:53:02 INFO spark.ExecutorAllocationManager: Requesting 0 new
>>> executor(s) because tasks are backlogged (new desired total will be 1)
>>> 15/03/20 22:53:07 INFO spark.ExecutorAllocationManager: Requesting 0 new
>>> executor(s) because tasks are backlogged (new desired total will be 1)
>>> 15/03/20 22:53:12 INFO spark.ExecutorAllocationManager: Requesting 0 new
>>> executor(s) because tasks are backlogged (new desired total will be 1)
>>> 15/03/20 22:53:12 WARN cluster.YarnScheduler: Initial job has not
>>> accepted any resources; check your cluster UI to ensure that workers are
>>> registered and have sufficient resources
>>> 15/03/20 22:53:17 INFO spark.ExecutorAllocationManager: Requesting 0 new
>>> executor(s) because tasks are backlogged (new desired total will be 1)
>>> 15/03/20 22:53:22 INFO spark.ExecutorAllocationManager: Requesting 0 new
>>> executor(s) because tasks are backlogged (new desired total will be 1)
>>> 15/03/20 22:53:27 INFO spark.ExecutorAllocationManager: Requesting 0 new
>>> executor(s) because tasks are backlogged (new desired total will be 1)
>>>
>>
>>
>

Reply via email to