Do you have too many small files you are trying to read? Number of executors are very high On 24 Sep 2016 10:28, "Yash Sharma" <yash...@gmail.com> wrote:
> Have been playing around with configs to crack this. Adding them here > where it would be helpful to others :) > Number of executors and timeout seemed like the core issue. > > {code} > --driver-memory 4G \ > --conf spark.dynamicAllocation.enabled=true \ > --conf spark.dynamicAllocation.maxExecutors=500 \ > --conf spark.core.connection.ack.wait.timeout=6000 \ > --conf spark.akka.heartbeat.interval=6000 \ > --conf spark.akka.frameSize=100 \ > --conf spark.akka.timeout=6000 \ > {code} > > Cheers ! > > On Fri, Sep 23, 2016 at 7:50 PM, <aditya.calangut...@augmentiq.co.in> > wrote: > >> For testing purpose can you run with fix number of executors and try. May >> be 12 executors for testing and let know the status. >> >> Get Outlook for Android <https://aka.ms/ghei36> >> >> >> >> On Fri, Sep 23, 2016 at 3:13 PM +0530, "Yash Sharma" <yash...@gmail.com> >> wrote: >> >> Thanks Aditya, appreciate the help. >>> >>> I had the exact thought about the huge number of executors requested. >>> I am going with the dynamic executors and not specifying the number of >>> executors. Are you suggesting that I should limit the number of executors >>> when the dynamic allocator requests for more number of executors. >>> >>> Its a 12 node EMR cluster and has more than a Tb of memory. >>> >>> >>> >>> On Fri, Sep 23, 2016 at 5:12 PM, Aditya <aditya.calangutkar@augmentiq. >>> co.in> wrote: >>> >>>> Hi Yash, >>>> >>>> What is your total cluster memory and number of cores? >>>> Problem might be with the number of executors you are allocating. The >>>> logs shows it as 168510 which is on very high side. Try reducing your >>>> executors. >>>> >>>> >>>> On Friday 23 September 2016 12:34 PM, Yash Sharma wrote: >>>> >>>>> Hi All, >>>>> I have a spark job which runs over a huge bulk of data with Dynamic >>>>> allocation enabled. >>>>> The job takes some 15 minutes to start up and fails as soon as it >>>>> starts*. >>>>> >>>>> Is there anything I can check to debug this problem. There is not a >>>>> lot of information in logs for the exact cause but here is some snapshot >>>>> below. >>>>> >>>>> Thanks All. >>>>> >>>>> * - by starts I mean when it shows something on the spark web ui, >>>>> before that its just blank page. >>>>> >>>>> Logs here - >>>>> >>>>> {code} >>>>> 16/09/23 06:33:19 INFO ApplicationMaster: Started progress reporter >>>>> thread with (heartbeat : 3000, initial allocation : 200) intervals >>>>> 16/09/23 06:33:27 INFO YarnAllocator: Driver requested a total number >>>>> of 168510 executor(s). >>>>> 16/09/23 06:33:27 INFO YarnAllocator: Will request 168510 executor >>>>> containers, each with 2 cores and 6758 MB memory including 614 MB overhead >>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >>>>> non-existent executor 22 >>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >>>>> non-existent executor 19 >>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >>>>> non-existent executor 18 >>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >>>>> non-existent executor 12 >>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >>>>> non-existent executor 11 >>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >>>>> non-existent executor 20 >>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >>>>> non-existent executor 15 >>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >>>>> non-existent executor 7 >>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >>>>> non-existent executor 8 >>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >>>>> non-existent executor 16 >>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >>>>> non-existent executor 21 >>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >>>>> non-existent executor 6 >>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >>>>> non-existent executor 13 >>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >>>>> non-existent executor 14 >>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >>>>> non-existent executor 9 >>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >>>>> non-existent executor 3 >>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >>>>> non-existent executor 17 >>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >>>>> non-existent executor 1 >>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >>>>> non-existent executor 10 >>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >>>>> non-existent executor 4 >>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >>>>> non-existent executor 2 >>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >>>>> non-existent executor 5 >>>>> 16/09/23 06:33:36 WARN ApplicationMaster: Reporter thread fails 1 >>>>> time(s) in a row. >>>>> java.lang.StackOverflowError >>>>> at scala.collection.MapLike$Mappe >>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245) >>>>> at scala.collection.MapLike$Mappe >>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245) >>>>> at scala.collection.TraversableLi >>>>> ke$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) >>>>> at scala.collection.MapLike$Mappe >>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245) >>>>> at scala.collection.MapLike$Mappe >>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245) >>>>> at scala.collection.TraversableLi >>>>> ke$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) >>>>> at scala.collection.MapLike$Mappe >>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245) >>>>> at scala.collection.MapLike$Mappe >>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245) >>>>> at scala.collection.TraversableLi >>>>> ke$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) >>>>> at scala.collection.MapLike$Mappe >>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245) >>>>> at scala.collection.MapLike$Mappe >>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245) >>>>> {code} >>>>> >>>>> ... <trimmed logs> >>>>> >>>>> {code} >>>>> 16/09/23 06:33:36 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: >>>>> Attempted to get executor loss reason for executor id 7 at RPC address , >>>>> but got no response. Marking as slave lost. >>>>> org.apache.spark.SparkException: Fail to find loss reason for >>>>> non-existent executor 7 >>>>> at org.apache.spark.deploy.yarn.Y >>>>> arnAllocator.enqueueGetLossReasonRequest(YarnAllocator.scala:554) >>>>> at org.apache.spark.deploy.yarn.A >>>>> pplicationMaster$AMEndpoint$$anonfun$receiveAndReply$1.apply >>>>> OrElse(ApplicationMaster.scala:632) >>>>> at org.apache.spark.rpc.netty.Inb >>>>> ox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:104) >>>>> at org.apache.spark.rpc.netty.Inb >>>>> ox.safelyCall(Inbox.scala:204) >>>>> at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) >>>>> at org.apache.spark.rpc.netty.Dis >>>>> patcher$MessageLoop.run(Dispatcher.scala:215) >>>>> at java.util.concurrent.ThreadPoo >>>>> lExecutor.runWorker(ThreadPoolExecutor.java:1145) >>>>> at java.util.concurrent.ThreadPoo >>>>> lExecutor$Worker.run(ThreadPoolExecutor.java:615) >>>>> at java.lang.Thread.run(Thread.java:745) >>>>> {code} >>>>> >>>> >>>> >>>> >>>> >>>> >>> >> >