Thanks Aditya, appreciate the help. I had the exact thought about the huge number of executors requested. I am going with the dynamic executors and not specifying the number of executors. Are you suggesting that I should limit the number of executors when the dynamic allocator requests for more number of executors.
Its a 12 node EMR cluster and has more than a Tb of memory. On Fri, Sep 23, 2016 at 5:12 PM, Aditya <aditya.calangut...@augmentiq.co.in> wrote: > Hi Yash, > > What is your total cluster memory and number of cores? > Problem might be with the number of executors you are allocating. The logs > shows it as 168510 which is on very high side. Try reducing your executors. > > > On Friday 23 September 2016 12:34 PM, Yash Sharma wrote: > >> Hi All, >> I have a spark job which runs over a huge bulk of data with Dynamic >> allocation enabled. >> The job takes some 15 minutes to start up and fails as soon as it starts*. >> >> Is there anything I can check to debug this problem. There is not a lot >> of information in logs for the exact cause but here is some snapshot below. >> >> Thanks All. >> >> * - by starts I mean when it shows something on the spark web ui, before >> that its just blank page. >> >> Logs here - >> >> {code} >> 16/09/23 06:33:19 INFO ApplicationMaster: Started progress reporter >> thread with (heartbeat : 3000, initial allocation : 200) intervals >> 16/09/23 06:33:27 INFO YarnAllocator: Driver requested a total number of >> 168510 executor(s). >> 16/09/23 06:33:27 INFO YarnAllocator: Will request 168510 executor >> containers, each with 2 cores and 6758 MB memory including 614 MB overhead >> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >> non-existent executor 22 >> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >> non-existent executor 19 >> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >> non-existent executor 18 >> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >> non-existent executor 12 >> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >> non-existent executor 11 >> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >> non-existent executor 20 >> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >> non-existent executor 15 >> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >> non-existent executor 7 >> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >> non-existent executor 8 >> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >> non-existent executor 16 >> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >> non-existent executor 21 >> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >> non-existent executor 6 >> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >> non-existent executor 13 >> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >> non-existent executor 14 >> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >> non-existent executor 9 >> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >> non-existent executor 3 >> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >> non-existent executor 17 >> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >> non-existent executor 1 >> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >> non-existent executor 10 >> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >> non-existent executor 4 >> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >> non-existent executor 2 >> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >> non-existent executor 5 >> 16/09/23 06:33:36 WARN ApplicationMaster: Reporter thread fails 1 time(s) >> in a row. >> java.lang.StackOverflowError >> at scala.collection.MapLike$MappedValues$$anonfun$foreach$3. >> apply(MapLike.scala:245) >> at scala.collection.MapLike$MappedValues$$anonfun$foreach$3. >> apply(MapLike.scala:245) >> at scala.collection.TraversableLike$WithFilter$$anonfun$ >> foreach$1.apply(TraversableLike.scala:772) >> at scala.collection.MapLike$MappedValues$$anonfun$foreach$3. >> apply(MapLike.scala:245) >> at scala.collection.MapLike$MappedValues$$anonfun$foreach$3. >> apply(MapLike.scala:245) >> at scala.collection.TraversableLike$WithFilter$$anonfun$ >> foreach$1.apply(TraversableLike.scala:772) >> at scala.collection.MapLike$MappedValues$$anonfun$foreach$3. >> apply(MapLike.scala:245) >> at scala.collection.MapLike$MappedValues$$anonfun$foreach$3. >> apply(MapLike.scala:245) >> at scala.collection.TraversableLike$WithFilter$$anonfun$ >> foreach$1.apply(TraversableLike.scala:772) >> at scala.collection.MapLike$MappedValues$$anonfun$foreach$3. >> apply(MapLike.scala:245) >> at scala.collection.MapLike$MappedValues$$anonfun$foreach$3. >> apply(MapLike.scala:245) >> {code} >> >> ... <trimmed logs> >> >> {code} >> 16/09/23 06:33:36 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: >> Attempted to get executor loss reason for executor id 7 at RPC address , >> but got no response. Marking as slave lost. >> org.apache.spark.SparkException: Fail to find loss reason for >> non-existent executor 7 >> at org.apache.spark.deploy.yarn.YarnAllocator.enqueueGetLossRea >> sonRequest(YarnAllocator.scala:554) >> at org.apache.spark.deploy.yarn.ApplicationMaster$AMEndpoint$$a >> nonfun$receiveAndReply$1.applyOrElse(ApplicationMaster.scala:632) >> at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$ >> mcV$sp(Inbox.scala:104) >> at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) >> at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) >> at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispat >> cher.scala:215) >> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool >> Executor.java:1145) >> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo >> lExecutor.java:615) >> at java.lang.Thread.run(Thread.java:745) >> {code} >> > > > > >