Hi All,
I have a spark job which runs over a huge bulk of data with Dynamic
allocation enabled.
The job takes some 15 minutes to start up and fails as soon as it starts*.

Is there anything I can check to debug this problem. There is not a lot of
information in logs for the exact cause but here is some snapshot below.

Thanks All.

* - by starts I mean when it shows something on the spark web ui, before
that its just blank page.

Logs here -

{code}
16/09/23 06:33:19 INFO ApplicationMaster: Started progress reporter thread
with (heartbeat : 3000, initial allocation : 200) intervals
16/09/23 06:33:27 INFO YarnAllocator: Driver requested a total number of
168510 executor(s).
16/09/23 06:33:27 INFO YarnAllocator: Will request 168510 executor
containers, each with 2 cores and 6758 MB memory including 614 MB overhead
16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
non-existent executor 22
16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
non-existent executor 19
16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
non-existent executor 18
16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
non-existent executor 12
16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
non-existent executor 11
16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
non-existent executor 20
16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
non-existent executor 15
16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
non-existent executor 7
16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
non-existent executor 8
16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
non-existent executor 16
16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
non-existent executor 21
16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
non-existent executor 6
16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
non-existent executor 13
16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
non-existent executor 14
16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
non-existent executor 9
16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
non-existent executor 3
16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
non-existent executor 17
16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
non-existent executor 1
16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
non-existent executor 10
16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
non-existent executor 4
16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
non-existent executor 2
16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
non-existent executor 5
16/09/23 06:33:36 WARN ApplicationMaster: Reporter thread fails 1 time(s)
in a row.
java.lang.StackOverflowError
        at
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
        at
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
        at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
        at
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
        at
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
        at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
        at
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
        at
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
        at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
        at
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
        at
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
{code}

... <trimmed logs>

{code}
16/09/23 06:33:36 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
Attempted to get executor loss reason for executor id 7 at RPC address ,
but got no response. Marking as slave lost.
org.apache.spark.SparkException: Fail to find loss reason for non-existent
executor 7
        at
org.apache.spark.deploy.yarn.YarnAllocator.enqueueGetLossReasonRequest(YarnAllocator.scala:554)
        at
org.apache.spark.deploy.yarn.ApplicationMaster$AMEndpoint$$anonfun$receiveAndReply$1.applyOrElse(ApplicationMaster.scala:632)
        at
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:104)
        at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
        at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
        at
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
{code}

Reply via email to