Have been playing around with configs to crack this. Adding them here where
it would be helpful to others :)
Number of executors and timeout seemed like the core issue.

{code}
--driver-memory 4G \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.maxExecutors=500 \
--conf spark.core.connection.ack.wait.timeout=6000 \
--conf spark.akka.heartbeat.interval=6000 \
--conf spark.akka.frameSize=100 \
--conf spark.akka.timeout=6000 \
{code}

Cheers !

On Fri, Sep 23, 2016 at 7:50 PM, <aditya.calangut...@augmentiq.co.in> wrote:

> For testing purpose can you run with fix number of executors and try. May
> be 12 executors for testing and let know the status.
>
> Get Outlook for Android <https://aka.ms/ghei36>
>
>
>
> On Fri, Sep 23, 2016 at 3:13 PM +0530, "Yash Sharma" <yash...@gmail.com>
> wrote:
>
> Thanks Aditya, appreciate the help.
>>
>> I had the exact thought about the huge number of executors requested.
>> I am going with the dynamic executors and not specifying the number of
>> executors. Are you suggesting that I should limit the number of executors
>> when the dynamic allocator requests for more number of executors.
>>
>> Its a 12 node EMR cluster and has more than a Tb of memory.
>>
>>
>>
>> On Fri, Sep 23, 2016 at 5:12 PM, Aditya <aditya.calangutkar@augmentiq.
>> co.in> wrote:
>>
>>> Hi Yash,
>>>
>>> What is your total cluster memory and number of cores?
>>> Problem might be with the number of executors you are allocating. The
>>> logs shows it as 168510 which is on very high side. Try reducing your
>>> executors.
>>>
>>>
>>> On Friday 23 September 2016 12:34 PM, Yash Sharma wrote:
>>>
>>>> Hi All,
>>>> I have a spark job which runs over a huge bulk of data with Dynamic
>>>> allocation enabled.
>>>> The job takes some 15 minutes to start up and fails as soon as it
>>>> starts*.
>>>>
>>>> Is there anything I can check to debug this problem. There is not a lot
>>>> of information in logs for the exact cause but here is some snapshot below.
>>>>
>>>> Thanks All.
>>>>
>>>> * - by starts I mean when it shows something on the spark web ui,
>>>> before that its just blank page.
>>>>
>>>> Logs here -
>>>>
>>>> {code}
>>>> 16/09/23 06:33:19 INFO ApplicationMaster: Started progress reporter
>>>> thread with (heartbeat : 3000, initial allocation : 200) intervals
>>>> 16/09/23 06:33:27 INFO YarnAllocator: Driver requested a total number
>>>> of 168510 executor(s).
>>>> 16/09/23 06:33:27 INFO YarnAllocator: Will request 168510 executor
>>>> containers, each with 2 cores and 6758 MB memory including 614 MB overhead
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 22
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 19
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 18
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 12
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 11
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 20
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 15
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 7
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 8
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 16
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 21
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 6
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 13
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 14
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 9
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 3
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 17
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 1
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 10
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 4
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 2
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 5
>>>> 16/09/23 06:33:36 WARN ApplicationMaster: Reporter thread fails 1
>>>> time(s) in a row.
>>>> java.lang.StackOverflowError
>>>>         at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.app
>>>> ly(MapLike.scala:245)
>>>>         at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.app
>>>> ly(MapLike.scala:245)
>>>>         at scala.collection.TraversableLike$WithFilter$$anonfun$foreach
>>>> $1.apply(TraversableLike.scala:772)
>>>>         at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.app
>>>> ly(MapLike.scala:245)
>>>>         at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.app
>>>> ly(MapLike.scala:245)
>>>>         at scala.collection.TraversableLike$WithFilter$$anonfun$foreach
>>>> $1.apply(TraversableLike.scala:772)
>>>>         at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.app
>>>> ly(MapLike.scala:245)
>>>>         at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.app
>>>> ly(MapLike.scala:245)
>>>>         at scala.collection.TraversableLike$WithFilter$$anonfun$foreach
>>>> $1.apply(TraversableLike.scala:772)
>>>>         at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.app
>>>> ly(MapLike.scala:245)
>>>>         at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.app
>>>> ly(MapLike.scala:245)
>>>> {code}
>>>>
>>>> ... <trimmed logs>
>>>>
>>>> {code}
>>>> 16/09/23 06:33:36 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
>>>> Attempted to get executor loss reason for executor id 7 at RPC address ,
>>>> but got no response. Marking as slave lost.
>>>> org.apache.spark.SparkException: Fail to find loss reason for
>>>> non-existent executor 7
>>>>         at org.apache.spark.deploy.yarn.YarnAllocator.enqueueGetLossRea
>>>> sonRequest(YarnAllocator.scala:554)
>>>>         at org.apache.spark.deploy.yarn.ApplicationMaster$AMEndpoint$$a
>>>> nonfun$receiveAndReply$1.applyOrElse(ApplicationMaster.scala:632)
>>>>         at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mc
>>>> V$sp(Inbox.scala:104)
>>>>         at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
>>>>         at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>>>>         at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispat
>>>> cher.scala:215)
>>>>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>>>> Executor.java:1145)
>>>>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>>>> lExecutor.java:615)
>>>>         at java.lang.Thread.run(Thread.java:745)
>>>> {code}
>>>>
>>>
>>>
>>>
>>>
>>>
>>
>

Reply via email to