Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

Yash Sharma Sat, 24 Sep 2016 02:09:41 -0700

We have too many (large)  files. We have about 30k partitions with about 4
years worth data and we need to process entire history in a one time
monolithic job.


I would like to know how spark decides the number of executors requested.
I've seen testcases where the max executors count is Integer's Max value,
 was wondering if we can compute an appropriate max executor count based on
the cluster resources.

Would be happy to contribute back if I can get some info on the executors
requests.

Cheers


On Sat, Sep 24, 2016, 6:39 PM ayan guha <guha.a...@gmail.com> wrote:

> Do you have too many small files you are trying to read? Number of
> executors are very high
> On 24 Sep 2016 10:28, "Yash Sharma" <yash...@gmail.com> wrote:
>
>> Have been playing around with configs to crack this. Adding them here
>> where it would be helpful to others :)
>> Number of executors and timeout seemed like the core issue.
>>
>> {code}
>> --driver-memory 4G \
>> --conf spark.dynamicAllocation.enabled=true \
>> --conf spark.dynamicAllocation.maxExecutors=500 \
>> --conf spark.core.connection.ack.wait.timeout=6000 \
>> --conf spark.akka.heartbeat.interval=6000 \
>> --conf spark.akka.frameSize=100 \
>> --conf spark.akka.timeout=6000 \
>> {code}
>>
>> Cheers !
>>
>> On Fri, Sep 23, 2016 at 7:50 PM, <aditya.calangut...@augmentiq.co.in>
>> wrote:
>>
>>> For testing purpose can you run with fix number of executors and try.
>>> May be 12 executors for testing and let know the status.
>>>
>>> Get Outlook for Android <https://aka.ms/ghei36>
>>>
>>>
>>>
>>> On Fri, Sep 23, 2016 at 3:13 PM +0530, "Yash Sharma" <yash...@gmail.com>
>>> wrote:
>>>
>>> Thanks Aditya, appreciate the help.
>>>>
>>>> I had the exact thought about the huge number of executors requested.
>>>> I am going with the dynamic executors and not specifying the number of
>>>> executors. Are you suggesting that I should limit the number of executors
>>>> when the dynamic allocator requests for more number of executors.
>>>>
>>>> Its a 12 node EMR cluster and has more than a Tb of memory.
>>>>
>>>>
>>>>
>>>> On Fri, Sep 23, 2016 at 5:12 PM, Aditya <
>>>> aditya.calangut...@augmentiq.co.in> wrote:
>>>>
>>>>> Hi Yash,
>>>>>
>>>>> What is your total cluster memory and number of cores?
>>>>> Problem might be with the number of executors you are allocating. The
>>>>> logs shows it as 168510 which is on very high side. Try reducing your
>>>>> executors.
>>>>>
>>>>>
>>>>> On Friday 23 September 2016 12:34 PM, Yash Sharma wrote:
>>>>>
>>>>>> Hi All,
>>>>>> I have a spark job which runs over a huge bulk of data with Dynamic
>>>>>> allocation enabled.
>>>>>> The job takes some 15 minutes to start up and fails as soon as it
>>>>>> starts*.
>>>>>>
>>>>>> Is there anything I can check to debug this problem. There is not a
>>>>>> lot of information in logs for the exact cause but here is some snapshot
>>>>>> below.
>>>>>>
>>>>>> Thanks All.
>>>>>>
>>>>>> * - by starts I mean when it shows something on the spark web ui,
>>>>>> before that its just blank page.
>>>>>>
>>>>>> Logs here -
>>>>>>
>>>>>> {code}
>>>>>> 16/09/23 06:33:19 INFO ApplicationMaster: Started progress reporter
>>>>>> thread with (heartbeat : 3000, initial allocation : 200) intervals
>>>>>> 16/09/23 06:33:27 INFO YarnAllocator: Driver requested a total number
>>>>>> of 168510 executor(s).
>>>>>> 16/09/23 06:33:27 INFO YarnAllocator: Will request 168510 executor
>>>>>> containers, each with 2 cores and 6758 MB memory including 614 MB 
>>>>>> overhead
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 22
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 19
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 18
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 12
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 11
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 20
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 15
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 7
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 8
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 16
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 21
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 6
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 13
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 14
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 9
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 3
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 17
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 1
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 10
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 4
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 2
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 5
>>>>>> 16/09/23 06:33:36 WARN ApplicationMaster: Reporter thread fails 1
>>>>>> time(s) in a row.
>>>>>> java.lang.StackOverflowError
>>>>>>         at
>>>>>> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>>         at
>>>>>> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>>         at
>>>>>> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>>>>>         at
>>>>>> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>>         at
>>>>>> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>>         at
>>>>>> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>>>>>         at
>>>>>> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>>         at
>>>>>> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>>         at
>>>>>> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>>>>>         at
>>>>>> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>>         at
>>>>>> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>> {code}
>>>>>>
>>>>>> ... <trimmed logs>
>>>>>>
>>>>>> {code}
>>>>>> 16/09/23 06:33:36 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
>>>>>> Attempted to get executor loss reason for executor id 7 at RPC address ,
>>>>>> but got no response. Marking as slave lost.
>>>>>> org.apache.spark.SparkException: Fail to find loss reason for
>>>>>> non-existent executor 7
>>>>>>         at
>>>>>> org.apache.spark.deploy.yarn.YarnAllocator.enqueueGetLossReasonRequest(YarnAllocator.scala:554)
>>>>>>         at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$AMEndpoint$$anonfun$receiveAndReply$1.applyOrElse(ApplicationMaster.scala:632)
>>>>>>         at
>>>>>> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:104)
>>>>>>         at
>>>>>> org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
>>>>>>         at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>>>>>>         at
>>>>>> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
>>>>>>         at
>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>>>         at
>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>>>         at java.lang.Thread.run(Thread.java:745)
>>>>>> {code}
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

Reply via email to