Hi Yash,

What is your total cluster memory and number of cores?
Problem might be with the number of executors you are allocating. The logs shows it as 168510 which is on very high side. Try reducing your executors.

On Friday 23 September 2016 12:34 PM, Yash Sharma wrote:
Hi All,
I have a spark job which runs over a huge bulk of data with Dynamic allocation enabled.
The job takes some 15 minutes to start up and fails as soon as it starts*.

Is there anything I can check to debug this problem. There is not a lot of information in logs for the exact cause but here is some snapshot below.

Thanks All.

* - by starts I mean when it shows something on the spark web ui, before that its just blank page.

Logs here -

{code}
16/09/23 06:33:19 INFO ApplicationMaster: Started progress reporter thread with (heartbeat : 3000, initial allocation : 200) intervals 16/09/23 06:33:27 INFO YarnAllocator: Driver requested a total number of 168510 executor(s). 16/09/23 06:33:27 INFO YarnAllocator: Will request 168510 executor containers, each with 2 cores and 6758 MB memory including 614 MB overhead 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for non-existent executor 22 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for non-existent executor 19 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for non-existent executor 18 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for non-existent executor 12 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for non-existent executor 11 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for non-existent executor 20 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for non-existent executor 15 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for non-existent executor 7 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for non-existent executor 8 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for non-existent executor 16 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for non-existent executor 21 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for non-existent executor 6 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for non-existent executor 13 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for non-existent executor 14 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for non-existent executor 9 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for non-existent executor 3 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for non-existent executor 17 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for non-existent executor 1 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for non-existent executor 10 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for non-existent executor 4 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for non-existent executor 2 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for non-existent executor 5 16/09/23 06:33:36 WARN ApplicationMaster: Reporter thread fails 1 time(s) in a row.
java.lang.StackOverflowError
at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245) at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245) at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245) at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245) at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
{code}

... <trimmed logs>

{code}
16/09/23 06:33:36 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to get executor loss reason for executor id 7 at RPC address , but got no response. Marking as slave lost. org.apache.spark.SparkException: Fail to find loss reason for non-existent executor 7 at org.apache.spark.deploy.yarn.YarnAllocator.enqueueGetLossReasonRequest(YarnAllocator.scala:554) at org.apache.spark.deploy.yarn.ApplicationMaster$AMEndpoint$$anonfun$receiveAndReply$1.applyOrElse(ApplicationMaster.scala:632) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:104)
        at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
        at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
{code}





---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to