Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

Yash Sharma Fri, 23 Sep 2016 21:44:55 -0700

Hi Dhruve, thanks.
I've solved the issue with adding max executors.
I wanted to find some place where I can add this behavior in Spark so that
user should not have to worry about the max executors.


Cheers

- Thanks, via mobile,  excuse brevity.

On Sep 24, 2016 1:15 PM, "dhruve ashar" <dhruveas...@gmail.com> wrote:

> From your log, its trying to launch every executor with approximately
> 6.6GB of memory. 168510 is an extremely huge no. executors and 168510 x
> 6.6GB is unrealistic for a 12 node cluster.
> 16/09/23 06:33:27 INFO YarnAllocator: Will request 168510 executor
> containers, each with 2 cores and 6758 MB memory including 614 MB overhead
>
> I don't know the size of the data that you are processing here.
>
> Here are some general choices that I would start with.
>
> Start with a smaller no. of minimum executors and assign them reasonable
> memory. This can be around 48 assuming 12 nodes x 4 cores each. You could
> start with processing a subset of your data and see if you are able to get
> a decent performance. Then gradually increase the maximum # of execs for
> dynamic allocation and process the remaining data.
>
>
>
>
> On Fri, Sep 23, 2016 at 7:54 PM, Yash Sharma <yash...@gmail.com> wrote:
>
>> Is there anywhere I can help fix this ?
>>
>> I can see the requests being made in the yarn allocator. What should be
>> the upperlimit of the requests made ?
>>
>> https://github.com/apache/spark/blob/master/yarn/src/main/
>> scala/org/apache/spark/deploy/yarn/YarnAllocator.scala#L222
>>
>> On Sat, Sep 24, 2016 at 10:27 AM, Yash Sharma <yash...@gmail.com> wrote:
>>
>>> Have been playing around with configs to crack this. Adding them here
>>> where it would be helpful to others :)
>>> Number of executors and timeout seemed like the core issue.
>>>
>>> {code}
>>> --driver-memory 4G \
>>> --conf spark.dynamicAllocation.enabled=true \
>>> --conf spark.dynamicAllocation.maxExecutors=500 \
>>> --conf spark.core.connection.ack.wait.timeout=6000 \
>>> --conf spark.akka.heartbeat.interval=6000 \
>>> --conf spark.akka.frameSize=100 \
>>> --conf spark.akka.timeout=6000 \
>>> {code}
>>>
>>> Cheers !
>>>
>>> On Fri, Sep 23, 2016 at 7:50 PM, <aditya.calangut...@augmentiq.co.in>
>>> wrote:
>>>
>>>> For testing purpose can you run with fix number of executors and try.
>>>> May be 12 executors for testing and let know the status.
>>>>
>>>> Get Outlook for Android <https://aka.ms/ghei36>
>>>>
>>>>
>>>>
>>>> On Fri, Sep 23, 2016 at 3:13 PM +0530, "Yash Sharma" <yash...@gmail.com
>>>> > wrote:
>>>>
>>>> Thanks Aditya, appreciate the help.
>>>>>
>>>>> I had the exact thought about the huge number of executors requested.
>>>>> I am going with the dynamic executors and not specifying the number of
>>>>> executors. Are you suggesting that I should limit the number of executors
>>>>> when the dynamic allocator requests for more number of executors.
>>>>>
>>>>> Its a 12 node EMR cluster and has more than a Tb of memory.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Sep 23, 2016 at 5:12 PM, Aditya <aditya.calangutkar@augmentiq.
>>>>> co.in> wrote:
>>>>>
>>>>>> Hi Yash,
>>>>>>
>>>>>> What is your total cluster memory and number of cores?
>>>>>> Problem might be with the number of executors you are allocating. The
>>>>>> logs shows it as 168510 which is on very high side. Try reducing your
>>>>>> executors.
>>>>>>
>>>>>>
>>>>>> On Friday 23 September 2016 12:34 PM, Yash Sharma wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>> I have a spark job which runs over a huge bulk of data with Dynamic
>>>>>>> allocation enabled.
>>>>>>> The job takes some 15 minutes to start up and fails as soon as it
>>>>>>> starts*.
>>>>>>>
>>>>>>> Is there anything I can check to debug this problem. There is not a
>>>>>>> lot of information in logs for the exact cause but here is some snapshot
>>>>>>> below.
>>>>>>>
>>>>>>> Thanks All.
>>>>>>>
>>>>>>> * - by starts I mean when it shows something on the spark web ui,
>>>>>>> before that its just blank page.
>>>>>>>
>>>>>>> Logs here -
>>>>>>>
>>>>>>> {code}
>>>>>>> 16/09/23 06:33:19 INFO ApplicationMaster: Started progress reporter
>>>>>>> thread with (heartbeat : 3000, initial allocation : 200) intervals
>>>>>>> 16/09/23 06:33:27 INFO YarnAllocator: Driver requested a total
>>>>>>> number of 168510 executor(s).
>>>>>>> 16/09/23 06:33:27 INFO YarnAllocator: Will request 168510 executor
>>>>>>> containers, each with 2 cores and 6758 MB memory including 614 MB 
>>>>>>> overhead
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 22
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 19
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 18
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 12
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 11
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 20
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 15
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 7
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 8
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 16
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 21
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 6
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 13
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 14
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 9
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 3
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 17
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 1
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 10
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 4
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 2
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 5
>>>>>>> 16/09/23 06:33:36 WARN ApplicationMaster: Reporter thread fails 1
>>>>>>> time(s) in a row.
>>>>>>> java.lang.StackOverflowError
>>>>>>>         at scala.collection.MapLike$Mappe
>>>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>>>         at scala.collection.MapLike$Mappe
>>>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>>>         at scala.collection.TraversableLi
>>>>>>> ke$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>>>>>>         at scala.collection.MapLike$Mappe
>>>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>>>         at scala.collection.MapLike$Mappe
>>>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>>>         at scala.collection.TraversableLi
>>>>>>> ke$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>>>>>>         at scala.collection.MapLike$Mappe
>>>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>>>         at scala.collection.MapLike$Mappe
>>>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>>>         at scala.collection.TraversableLi
>>>>>>> ke$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>>>>>>         at scala.collection.MapLike$Mappe
>>>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>>>         at scala.collection.MapLike$Mappe
>>>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>>> {code}
>>>>>>>
>>>>>>> ... <trimmed logs>
>>>>>>>
>>>>>>> {code}
>>>>>>> 16/09/23 06:33:36 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
>>>>>>> Attempted to get executor loss reason for executor id 7 at RPC address ,
>>>>>>> but got no response. Marking as slave lost.
>>>>>>> org.apache.spark.SparkException: Fail to find loss reason for
>>>>>>> non-existent executor 7
>>>>>>>         at org.apache.spark.deploy.yarn.Y
>>>>>>> arnAllocator.enqueueGetLossReasonRequest(YarnAllocator.scala:554)
>>>>>>>         at org.apache.spark.deploy.yarn.A
>>>>>>> pplicationMaster$AMEndpoint$$anonfun$receiveAndReply$1.apply
>>>>>>> OrElse(ApplicationMaster.scala:632)
>>>>>>>         at org.apache.spark.rpc.netty.Inb
>>>>>>> ox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:104)
>>>>>>>         at org.apache.spark.rpc.netty.Inb
>>>>>>> ox.safelyCall(Inbox.scala:204)
>>>>>>>         at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>>>>>>>         at org.apache.spark.rpc.netty.Dis
>>>>>>> patcher$MessageLoop.run(Dispatcher.scala:215)
>>>>>>>         at java.util.concurrent.ThreadPoo
>>>>>>> lExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>>>>         at java.util.concurrent.ThreadPoo
>>>>>>> lExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>>>>         at java.lang.Thread.run(Thread.java:745)
>>>>>>> {code}
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>
> --
> -Dhruve Ashar
>
>

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

Reply via email to