Hi Dhruve, thanks. I've solved the issue with adding max executors. I wanted to find some place where I can add this behavior in Spark so that user should not have to worry about the max executors.
Cheers - Thanks, via mobile, excuse brevity. On Sep 24, 2016 1:15 PM, "dhruve ashar" <dhruveas...@gmail.com> wrote: > From your log, its trying to launch every executor with approximately > 6.6GB of memory. 168510 is an extremely huge no. executors and 168510 x > 6.6GB is unrealistic for a 12 node cluster. > 16/09/23 06:33:27 INFO YarnAllocator: Will request 168510 executor > containers, each with 2 cores and 6758 MB memory including 614 MB overhead > > I don't know the size of the data that you are processing here. > > Here are some general choices that I would start with. > > Start with a smaller no. of minimum executors and assign them reasonable > memory. This can be around 48 assuming 12 nodes x 4 cores each. You could > start with processing a subset of your data and see if you are able to get > a decent performance. Then gradually increase the maximum # of execs for > dynamic allocation and process the remaining data. > > > > > On Fri, Sep 23, 2016 at 7:54 PM, Yash Sharma <yash...@gmail.com> wrote: > >> Is there anywhere I can help fix this ? >> >> I can see the requests being made in the yarn allocator. What should be >> the upperlimit of the requests made ? >> >> https://github.com/apache/spark/blob/master/yarn/src/main/ >> scala/org/apache/spark/deploy/yarn/YarnAllocator.scala#L222 >> >> On Sat, Sep 24, 2016 at 10:27 AM, Yash Sharma <yash...@gmail.com> wrote: >> >>> Have been playing around with configs to crack this. Adding them here >>> where it would be helpful to others :) >>> Number of executors and timeout seemed like the core issue. >>> >>> {code} >>> --driver-memory 4G \ >>> --conf spark.dynamicAllocation.enabled=true \ >>> --conf spark.dynamicAllocation.maxExecutors=500 \ >>> --conf spark.core.connection.ack.wait.timeout=6000 \ >>> --conf spark.akka.heartbeat.interval=6000 \ >>> --conf spark.akka.frameSize=100 \ >>> --conf spark.akka.timeout=6000 \ >>> {code} >>> >>> Cheers ! >>> >>> On Fri, Sep 23, 2016 at 7:50 PM, <aditya.calangut...@augmentiq.co.in> >>> wrote: >>> >>>> For testing purpose can you run with fix number of executors and try. >>>> May be 12 executors for testing and let know the status. >>>> >>>> Get Outlook for Android <https://aka.ms/ghei36> >>>> >>>> >>>> >>>> On Fri, Sep 23, 2016 at 3:13 PM +0530, "Yash Sharma" <yash...@gmail.com >>>> > wrote: >>>> >>>> Thanks Aditya, appreciate the help. >>>>> >>>>> I had the exact thought about the huge number of executors requested. >>>>> I am going with the dynamic executors and not specifying the number of >>>>> executors. Are you suggesting that I should limit the number of executors >>>>> when the dynamic allocator requests for more number of executors. >>>>> >>>>> Its a 12 node EMR cluster and has more than a Tb of memory. >>>>> >>>>> >>>>> >>>>> On Fri, Sep 23, 2016 at 5:12 PM, Aditya <aditya.calangutkar@augmentiq. >>>>> co.in> wrote: >>>>> >>>>>> Hi Yash, >>>>>> >>>>>> What is your total cluster memory and number of cores? >>>>>> Problem might be with the number of executors you are allocating. The >>>>>> logs shows it as 168510 which is on very high side. Try reducing your >>>>>> executors. >>>>>> >>>>>> >>>>>> On Friday 23 September 2016 12:34 PM, Yash Sharma wrote: >>>>>> >>>>>>> Hi All, >>>>>>> I have a spark job which runs over a huge bulk of data with Dynamic >>>>>>> allocation enabled. >>>>>>> The job takes some 15 minutes to start up and fails as soon as it >>>>>>> starts*. >>>>>>> >>>>>>> Is there anything I can check to debug this problem. There is not a >>>>>>> lot of information in logs for the exact cause but here is some snapshot >>>>>>> below. >>>>>>> >>>>>>> Thanks All. >>>>>>> >>>>>>> * - by starts I mean when it shows something on the spark web ui, >>>>>>> before that its just blank page. >>>>>>> >>>>>>> Logs here - >>>>>>> >>>>>>> {code} >>>>>>> 16/09/23 06:33:19 INFO ApplicationMaster: Started progress reporter >>>>>>> thread with (heartbeat : 3000, initial allocation : 200) intervals >>>>>>> 16/09/23 06:33:27 INFO YarnAllocator: Driver requested a total >>>>>>> number of 168510 executor(s). >>>>>>> 16/09/23 06:33:27 INFO YarnAllocator: Will request 168510 executor >>>>>>> containers, each with 2 cores and 6758 MB memory including 614 MB >>>>>>> overhead >>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason >>>>>>> for non-existent executor 22 >>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason >>>>>>> for non-existent executor 19 >>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason >>>>>>> for non-existent executor 18 >>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason >>>>>>> for non-existent executor 12 >>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason >>>>>>> for non-existent executor 11 >>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason >>>>>>> for non-existent executor 20 >>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason >>>>>>> for non-existent executor 15 >>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason >>>>>>> for non-existent executor 7 >>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason >>>>>>> for non-existent executor 8 >>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason >>>>>>> for non-existent executor 16 >>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason >>>>>>> for non-existent executor 21 >>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason >>>>>>> for non-existent executor 6 >>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason >>>>>>> for non-existent executor 13 >>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason >>>>>>> for non-existent executor 14 >>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason >>>>>>> for non-existent executor 9 >>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason >>>>>>> for non-existent executor 3 >>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason >>>>>>> for non-existent executor 17 >>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason >>>>>>> for non-existent executor 1 >>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason >>>>>>> for non-existent executor 10 >>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason >>>>>>> for non-existent executor 4 >>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason >>>>>>> for non-existent executor 2 >>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason >>>>>>> for non-existent executor 5 >>>>>>> 16/09/23 06:33:36 WARN ApplicationMaster: Reporter thread fails 1 >>>>>>> time(s) in a row. >>>>>>> java.lang.StackOverflowError >>>>>>> at scala.collection.MapLike$Mappe >>>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245) >>>>>>> at scala.collection.MapLike$Mappe >>>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245) >>>>>>> at scala.collection.TraversableLi >>>>>>> ke$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) >>>>>>> at scala.collection.MapLike$Mappe >>>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245) >>>>>>> at scala.collection.MapLike$Mappe >>>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245) >>>>>>> at scala.collection.TraversableLi >>>>>>> ke$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) >>>>>>> at scala.collection.MapLike$Mappe >>>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245) >>>>>>> at scala.collection.MapLike$Mappe >>>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245) >>>>>>> at scala.collection.TraversableLi >>>>>>> ke$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) >>>>>>> at scala.collection.MapLike$Mappe >>>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245) >>>>>>> at scala.collection.MapLike$Mappe >>>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245) >>>>>>> {code} >>>>>>> >>>>>>> ... <trimmed logs> >>>>>>> >>>>>>> {code} >>>>>>> 16/09/23 06:33:36 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: >>>>>>> Attempted to get executor loss reason for executor id 7 at RPC address , >>>>>>> but got no response. Marking as slave lost. >>>>>>> org.apache.spark.SparkException: Fail to find loss reason for >>>>>>> non-existent executor 7 >>>>>>> at org.apache.spark.deploy.yarn.Y >>>>>>> arnAllocator.enqueueGetLossReasonRequest(YarnAllocator.scala:554) >>>>>>> at org.apache.spark.deploy.yarn.A >>>>>>> pplicationMaster$AMEndpoint$$anonfun$receiveAndReply$1.apply >>>>>>> OrElse(ApplicationMaster.scala:632) >>>>>>> at org.apache.spark.rpc.netty.Inb >>>>>>> ox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:104) >>>>>>> at org.apache.spark.rpc.netty.Inb >>>>>>> ox.safelyCall(Inbox.scala:204) >>>>>>> at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) >>>>>>> at org.apache.spark.rpc.netty.Dis >>>>>>> patcher$MessageLoop.run(Dispatcher.scala:215) >>>>>>> at java.util.concurrent.ThreadPoo >>>>>>> lExecutor.runWorker(ThreadPoolExecutor.java:1145) >>>>>>> at java.util.concurrent.ThreadPoo >>>>>>> lExecutor$Worker.run(ThreadPoolExecutor.java:615) >>>>>>> at java.lang.Thread.run(Thread.java:745) >>>>>>> {code} >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> > > > -- > -Dhruve Ashar > >