Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

Yash Sharma Fri, 23 Sep 2016 17:56:22 -0700

Is there anywhere I can help fix this ?

I can see the requests being made in the yarn allocator. What should be the
upperlimit of the requests made ?


https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala#L222

On Sat, Sep 24, 2016 at 10:27 AM, Yash Sharma <yash...@gmail.com> wrote:

> Have been playing around with configs to crack this. Adding them here
> where it would be helpful to others :)
> Number of executors and timeout seemed like the core issue.
>
> {code}
> --driver-memory 4G \
> --conf spark.dynamicAllocation.enabled=true \
> --conf spark.dynamicAllocation.maxExecutors=500 \
> --conf spark.core.connection.ack.wait.timeout=6000 \
> --conf spark.akka.heartbeat.interval=6000 \
> --conf spark.akka.frameSize=100 \
> --conf spark.akka.timeout=6000 \
> {code}
>
> Cheers !
>
> On Fri, Sep 23, 2016 at 7:50 PM, <aditya.calangut...@augmentiq.co.in>
> wrote:
>
>> For testing purpose can you run with fix number of executors and try. May
>> be 12 executors for testing and let know the status.
>>
>> Get Outlook for Android <https://aka.ms/ghei36>
>>
>>
>>
>> On Fri, Sep 23, 2016 at 3:13 PM +0530, "Yash Sharma" <yash...@gmail.com>
>> wrote:
>>
>> Thanks Aditya, appreciate the help.
>>>
>>> I had the exact thought about the huge number of executors requested.
>>> I am going with the dynamic executors and not specifying the number of
>>> executors. Are you suggesting that I should limit the number of executors
>>> when the dynamic allocator requests for more number of executors.
>>>
>>> Its a 12 node EMR cluster and has more than a Tb of memory.
>>>
>>>
>>>
>>> On Fri, Sep 23, 2016 at 5:12 PM, Aditya <aditya.calangutkar@augmentiq.
>>> co.in> wrote:
>>>
>>>> Hi Yash,
>>>>
>>>> What is your total cluster memory and number of cores?
>>>> Problem might be with the number of executors you are allocating. The
>>>> logs shows it as 168510 which is on very high side. Try reducing your
>>>> executors.
>>>>
>>>>
>>>> On Friday 23 September 2016 12:34 PM, Yash Sharma wrote:
>>>>
>>>>> Hi All,
>>>>> I have a spark job which runs over a huge bulk of data with Dynamic
>>>>> allocation enabled.
>>>>> The job takes some 15 minutes to start up and fails as soon as it
>>>>> starts*.
>>>>>
>>>>> Is there anything I can check to debug this problem. There is not a
>>>>> lot of information in logs for the exact cause but here is some snapshot
>>>>> below.
>>>>>
>>>>> Thanks All.
>>>>>
>>>>> * - by starts I mean when it shows something on the spark web ui,
>>>>> before that its just blank page.
>>>>>
>>>>> Logs here -
>>>>>
>>>>> {code}
>>>>> 16/09/23 06:33:19 INFO ApplicationMaster: Started progress reporter
>>>>> thread with (heartbeat : 3000, initial allocation : 200) intervals
>>>>> 16/09/23 06:33:27 INFO YarnAllocator: Driver requested a total number
>>>>> of 168510 executor(s).
>>>>> 16/09/23 06:33:27 INFO YarnAllocator: Will request 168510 executor
>>>>> containers, each with 2 cores and 6758 MB memory including 614 MB overhead
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 22
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 19
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 18
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 12
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 11
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 20
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 15
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 7
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 8
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 16
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 21
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 6
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 13
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 14
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 9
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 3
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 17
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 1
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 10
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 4
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 2
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 5
>>>>> 16/09/23 06:33:36 WARN ApplicationMaster: Reporter thread fails 1
>>>>> time(s) in a row.
>>>>> java.lang.StackOverflowError
>>>>>         at scala.collection.MapLike$Mappe
>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>         at scala.collection.MapLike$Mappe
>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>         at scala.collection.TraversableLi
>>>>> ke$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>>>>         at scala.collection.MapLike$Mappe
>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>         at scala.collection.MapLike$Mappe
>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>         at scala.collection.TraversableLi
>>>>> ke$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>>>>         at scala.collection.MapLike$Mappe
>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>         at scala.collection.MapLike$Mappe
>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>         at scala.collection.TraversableLi
>>>>> ke$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>>>>         at scala.collection.MapLike$Mappe
>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>         at scala.collection.MapLike$Mappe
>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>> {code}
>>>>>
>>>>> ... <trimmed logs>
>>>>>
>>>>> {code}
>>>>> 16/09/23 06:33:36 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
>>>>> Attempted to get executor loss reason for executor id 7 at RPC address ,
>>>>> but got no response. Marking as slave lost.
>>>>> org.apache.spark.SparkException: Fail to find loss reason for
>>>>> non-existent executor 7
>>>>>         at org.apache.spark.deploy.yarn.Y
>>>>> arnAllocator.enqueueGetLossReasonRequest(YarnAllocator.scala:554)
>>>>>         at org.apache.spark.deploy.yarn.A
>>>>> pplicationMaster$AMEndpoint$$anonfun$receiveAndReply$1.apply
>>>>> OrElse(ApplicationMaster.scala:632)
>>>>>         at org.apache.spark.rpc.netty.Inb
>>>>> ox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:104)
>>>>>         at org.apache.spark.rpc.netty.Inb
>>>>> ox.safelyCall(Inbox.scala:204)
>>>>>         at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>>>>>         at org.apache.spark.rpc.netty.Dis
>>>>> patcher$MessageLoop.run(Dispatcher.scala:215)
>>>>>         at java.util.concurrent.ThreadPoo
>>>>> lExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>>         at java.util.concurrent.ThreadPoo
>>>>> lExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>>         at java.lang.Thread.run(Thread.java:745)
>>>>> {code}
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

Reply via email to