Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

2016-09-24 Thread Yash Sharma
We have too many (large)  files. We have about 30k partitions with about 4
years worth data and we need to process entire history in a one time
monolithic job.

I would like to know how spark decides the number of executors requested.
I've seen testcases where the max executors count is Integer's Max value,
 was wondering if we can compute an appropriate max executor count based on
the cluster resources.

Would be happy to contribute back if I can get some info on the executors
requests.

Cheers


On Sat, Sep 24, 2016, 6:39 PM ayan guha  wrote:

> Do you have too many small files you are trying to read? Number of
> executors are very high
> On 24 Sep 2016 10:28, "Yash Sharma"  wrote:
>
>> Have been playing around with configs to crack this. Adding them here
>> where it would be helpful to others :)
>> Number of executors and timeout seemed like the core issue.
>>
>> {code}
>> --driver-memory 4G \
>> --conf spark.dynamicAllocation.enabled=true \
>> --conf spark.dynamicAllocation.maxExecutors=500 \
>> --conf spark.core.connection.ack.wait.timeout=6000 \
>> --conf spark.akka.heartbeat.interval=6000 \
>> --conf spark.akka.frameSize=100 \
>> --conf spark.akka.timeout=6000 \
>> {code}
>>
>> Cheers !
>>
>> On Fri, Sep 23, 2016 at 7:50 PM, 
>> wrote:
>>
>>> For testing purpose can you run with fix number of executors and try.
>>> May be 12 executors for testing and let know the status.
>>>
>>> Get Outlook for Android 
>>>
>>>
>>>
>>> On Fri, Sep 23, 2016 at 3:13 PM +0530, "Yash Sharma" 
>>> wrote:
>>>
>>> Thanks Aditya, appreciate the help.

 I had the exact thought about the huge number of executors requested.
 I am going with the dynamic executors and not specifying the number of
 executors. Are you suggesting that I should limit the number of executors
 when the dynamic allocator requests for more number of executors.

 Its a 12 node EMR cluster and has more than a Tb of memory.



 On Fri, Sep 23, 2016 at 5:12 PM, Aditya <
 aditya.calangut...@augmentiq.co.in> wrote:

> Hi Yash,
>
> What is your total cluster memory and number of cores?
> Problem might be with the number of executors you are allocating. The
> logs shows it as 168510 which is on very high side. Try reducing your
> executors.
>
>
> On Friday 23 September 2016 12:34 PM, Yash Sharma wrote:
>
>> Hi All,
>> I have a spark job which runs over a huge bulk of data with Dynamic
>> allocation enabled.
>> The job takes some 15 minutes to start up and fails as soon as it
>> starts*.
>>
>> Is there anything I can check to debug this problem. There is not a
>> lot of information in logs for the exact cause but here is some snapshot
>> below.
>>
>> Thanks All.
>>
>> * - by starts I mean when it shows something on the spark web ui,
>> before that its just blank page.
>>
>> Logs here -
>>
>> {code}
>> 16/09/23 06:33:19 INFO ApplicationMaster: Started progress reporter
>> thread with (heartbeat : 3000, initial allocation : 200) intervals
>> 16/09/23 06:33:27 INFO YarnAllocator: Driver requested a total number
>> of 168510 executor(s).
>> 16/09/23 06:33:27 INFO YarnAllocator: Will request 168510 executor
>> containers, each with 2 cores and 6758 MB memory including 614 MB 
>> overhead
>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>> for non-existent executor 22
>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>> for non-existent executor 19
>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>> for non-existent executor 18
>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>> for non-existent executor 12
>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>> for non-existent executor 11
>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>> for non-existent executor 20
>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>> for non-existent executor 15
>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>> for non-existent executor 7
>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>> for non-existent executor 8
>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>> for non-existent executor 16
>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>> for non-existent executor 21
>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>> for non-existent executor 6
>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>> for non-existent executor 13
>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

2016-09-23 Thread Yash Sharma
Hi Dhruve, thanks.
I've solved the issue with adding max executors.
I wanted to find some place where I can add this behavior in Spark so that
user should not have to worry about the max executors.

Cheers

- Thanks, via mobile,  excuse brevity.

On Sep 24, 2016 1:15 PM, "dhruve ashar"  wrote:

> From your log, its trying to launch every executor with approximately
> 6.6GB of memory. 168510 is an extremely huge no. executors and 168510 x
> 6.6GB is unrealistic for a 12 node cluster.
> 16/09/23 06:33:27 INFO YarnAllocator: Will request 168510 executor
> containers, each with 2 cores and 6758 MB memory including 614 MB overhead
>
> I don't know the size of the data that you are processing here.
>
> Here are some general choices that I would start with.
>
> Start with a smaller no. of minimum executors and assign them reasonable
> memory. This can be around 48 assuming 12 nodes x 4 cores each. You could
> start with processing a subset of your data and see if you are able to get
> a decent performance. Then gradually increase the maximum # of execs for
> dynamic allocation and process the remaining data.
>
>
>
>
> On Fri, Sep 23, 2016 at 7:54 PM, Yash Sharma  wrote:
>
>> Is there anywhere I can help fix this ?
>>
>> I can see the requests being made in the yarn allocator. What should be
>> the upperlimit of the requests made ?
>>
>> https://github.com/apache/spark/blob/master/yarn/src/main/
>> scala/org/apache/spark/deploy/yarn/YarnAllocator.scala#L222
>>
>> On Sat, Sep 24, 2016 at 10:27 AM, Yash Sharma  wrote:
>>
>>> Have been playing around with configs to crack this. Adding them here
>>> where it would be helpful to others :)
>>> Number of executors and timeout seemed like the core issue.
>>>
>>> {code}
>>> --driver-memory 4G \
>>> --conf spark.dynamicAllocation.enabled=true \
>>> --conf spark.dynamicAllocation.maxExecutors=500 \
>>> --conf spark.core.connection.ack.wait.timeout=6000 \
>>> --conf spark.akka.heartbeat.interval=6000 \
>>> --conf spark.akka.frameSize=100 \
>>> --conf spark.akka.timeout=6000 \
>>> {code}
>>>
>>> Cheers !
>>>
>>> On Fri, Sep 23, 2016 at 7:50 PM, 
>>> wrote:
>>>
 For testing purpose can you run with fix number of executors and try.
 May be 12 executors for testing and let know the status.

 Get Outlook for Android 



 On Fri, Sep 23, 2016 at 3:13 PM +0530, "Yash Sharma"  wrote:

 Thanks Aditya, appreciate the help.
>
> I had the exact thought about the huge number of executors requested.
> I am going with the dynamic executors and not specifying the number of
> executors. Are you suggesting that I should limit the number of executors
> when the dynamic allocator requests for more number of executors.
>
> Its a 12 node EMR cluster and has more than a Tb of memory.
>
>
>
> On Fri, Sep 23, 2016 at 5:12 PM, Aditya  co.in> wrote:
>
>> Hi Yash,
>>
>> What is your total cluster memory and number of cores?
>> Problem might be with the number of executors you are allocating. The
>> logs shows it as 168510 which is on very high side. Try reducing your
>> executors.
>>
>>
>> On Friday 23 September 2016 12:34 PM, Yash Sharma wrote:
>>
>>> Hi All,
>>> I have a spark job which runs over a huge bulk of data with Dynamic
>>> allocation enabled.
>>> The job takes some 15 minutes to start up and fails as soon as it
>>> starts*.
>>>
>>> Is there anything I can check to debug this problem. There is not a
>>> lot of information in logs for the exact cause but here is some snapshot
>>> below.
>>>
>>> Thanks All.
>>>
>>> * - by starts I mean when it shows something on the spark web ui,
>>> before that its just blank page.
>>>
>>> Logs here -
>>>
>>> {code}
>>> 16/09/23 06:33:19 INFO ApplicationMaster: Started progress reporter
>>> thread with (heartbeat : 3000, initial allocation : 200) intervals
>>> 16/09/23 06:33:27 INFO YarnAllocator: Driver requested a total
>>> number of 168510 executor(s).
>>> 16/09/23 06:33:27 INFO YarnAllocator: Will request 168510 executor
>>> containers, each with 2 cores and 6758 MB memory including 614 MB 
>>> overhead
>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>> for non-existent executor 22
>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>> for non-existent executor 19
>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>> for non-existent executor 18
>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>> for non-existent executor 12
>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>> for non-existent executor 11

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

2016-09-23 Thread Yash Sharma
Is there anywhere I can help fix this ?

I can see the requests being made in the yarn allocator. What should be the
upperlimit of the requests made ?

https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala#L222

On Sat, Sep 24, 2016 at 10:27 AM, Yash Sharma  wrote:

> Have been playing around with configs to crack this. Adding them here
> where it would be helpful to others :)
> Number of executors and timeout seemed like the core issue.
>
> {code}
> --driver-memory 4G \
> --conf spark.dynamicAllocation.enabled=true \
> --conf spark.dynamicAllocation.maxExecutors=500 \
> --conf spark.core.connection.ack.wait.timeout=6000 \
> --conf spark.akka.heartbeat.interval=6000 \
> --conf spark.akka.frameSize=100 \
> --conf spark.akka.timeout=6000 \
> {code}
>
> Cheers !
>
> On Fri, Sep 23, 2016 at 7:50 PM, 
> wrote:
>
>> For testing purpose can you run with fix number of executors and try. May
>> be 12 executors for testing and let know the status.
>>
>> Get Outlook for Android 
>>
>>
>>
>> On Fri, Sep 23, 2016 at 3:13 PM +0530, "Yash Sharma" 
>> wrote:
>>
>> Thanks Aditya, appreciate the help.
>>>
>>> I had the exact thought about the huge number of executors requested.
>>> I am going with the dynamic executors and not specifying the number of
>>> executors. Are you suggesting that I should limit the number of executors
>>> when the dynamic allocator requests for more number of executors.
>>>
>>> Its a 12 node EMR cluster and has more than a Tb of memory.
>>>
>>>
>>>
>>> On Fri, Sep 23, 2016 at 5:12 PM, Aditya >> co.in> wrote:
>>>
 Hi Yash,

 What is your total cluster memory and number of cores?
 Problem might be with the number of executors you are allocating. The
 logs shows it as 168510 which is on very high side. Try reducing your
 executors.


 On Friday 23 September 2016 12:34 PM, Yash Sharma wrote:

> Hi All,
> I have a spark job which runs over a huge bulk of data with Dynamic
> allocation enabled.
> The job takes some 15 minutes to start up and fails as soon as it
> starts*.
>
> Is there anything I can check to debug this problem. There is not a
> lot of information in logs for the exact cause but here is some snapshot
> below.
>
> Thanks All.
>
> * - by starts I mean when it shows something on the spark web ui,
> before that its just blank page.
>
> Logs here -
>
> {code}
> 16/09/23 06:33:19 INFO ApplicationMaster: Started progress reporter
> thread with (heartbeat : 3000, initial allocation : 200) intervals
> 16/09/23 06:33:27 INFO YarnAllocator: Driver requested a total number
> of 168510 executor(s).
> 16/09/23 06:33:27 INFO YarnAllocator: Will request 168510 executor
> containers, each with 2 cores and 6758 MB memory including 614 MB overhead
> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
> non-existent executor 22
> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
> non-existent executor 19
> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
> non-existent executor 18
> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
> non-existent executor 12
> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
> non-existent executor 11
> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
> non-existent executor 20
> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
> non-existent executor 15
> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
> non-existent executor 7
> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
> non-existent executor 8
> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
> non-existent executor 16
> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
> non-existent executor 21
> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
> non-existent executor 6
> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
> non-existent executor 13
> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
> non-existent executor 14
> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
> non-existent executor 9
> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
> non-existent executor 3
> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
> non-existent executor 17
> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
> non-existent executor 1
> 16/09/23 06:33:36 WARN 

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

2016-09-23 Thread Yash Sharma
Have been playing around with configs to crack this. Adding them here where
it would be helpful to others :)
Number of executors and timeout seemed like the core issue.

{code}
--driver-memory 4G \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.maxExecutors=500 \
--conf spark.core.connection.ack.wait.timeout=6000 \
--conf spark.akka.heartbeat.interval=6000 \
--conf spark.akka.frameSize=100 \
--conf spark.akka.timeout=6000 \
{code}

Cheers !

On Fri, Sep 23, 2016 at 7:50 PM,  wrote:

> For testing purpose can you run with fix number of executors and try. May
> be 12 executors for testing and let know the status.
>
> Get Outlook for Android 
>
>
>
> On Fri, Sep 23, 2016 at 3:13 PM +0530, "Yash Sharma" 
> wrote:
>
> Thanks Aditya, appreciate the help.
>>
>> I had the exact thought about the huge number of executors requested.
>> I am going with the dynamic executors and not specifying the number of
>> executors. Are you suggesting that I should limit the number of executors
>> when the dynamic allocator requests for more number of executors.
>>
>> Its a 12 node EMR cluster and has more than a Tb of memory.
>>
>>
>>
>> On Fri, Sep 23, 2016 at 5:12 PM, Aditya > co.in> wrote:
>>
>>> Hi Yash,
>>>
>>> What is your total cluster memory and number of cores?
>>> Problem might be with the number of executors you are allocating. The
>>> logs shows it as 168510 which is on very high side. Try reducing your
>>> executors.
>>>
>>>
>>> On Friday 23 September 2016 12:34 PM, Yash Sharma wrote:
>>>
 Hi All,
 I have a spark job which runs over a huge bulk of data with Dynamic
 allocation enabled.
 The job takes some 15 minutes to start up and fails as soon as it
 starts*.

 Is there anything I can check to debug this problem. There is not a lot
 of information in logs for the exact cause but here is some snapshot below.

 Thanks All.

 * - by starts I mean when it shows something on the spark web ui,
 before that its just blank page.

 Logs here -

 {code}
 16/09/23 06:33:19 INFO ApplicationMaster: Started progress reporter
 thread with (heartbeat : 3000, initial allocation : 200) intervals
 16/09/23 06:33:27 INFO YarnAllocator: Driver requested a total number
 of 168510 executor(s).
 16/09/23 06:33:27 INFO YarnAllocator: Will request 168510 executor
 containers, each with 2 cores and 6758 MB memory including 614 MB overhead
 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
 non-existent executor 22
 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
 non-existent executor 19
 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
 non-existent executor 18
 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
 non-existent executor 12
 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
 non-existent executor 11
 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
 non-existent executor 20
 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
 non-existent executor 15
 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
 non-existent executor 7
 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
 non-existent executor 8
 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
 non-existent executor 16
 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
 non-existent executor 21
 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
 non-existent executor 6
 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
 non-existent executor 13
 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
 non-existent executor 14
 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
 non-existent executor 9
 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
 non-existent executor 3
 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
 non-existent executor 17
 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
 non-existent executor 1
 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
 non-existent executor 10
 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
 non-existent executor 4
 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
 non-existent executor 2
 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
 non-existent executor 5
 16/09/23 06:33:36 WARN ApplicationMaster: Reporter thread fails 1
 time(s) in a row.

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

2016-09-23 Thread aditya . calangutkar


For testing purpose can you run with fix number of executors and try. May be 12 
executors for testing and let know the status.


Get Outlook for Android






On Fri, Sep 23, 2016 at 3:13 PM +0530, "Yash Sharma"  wrote:










Thanks Aditya, appreciate the help.
I had the exact thought about the huge number of executors requested.I am going 
with the dynamic executors and not specifying the number of executors. Are you 
suggesting that I should limit the number of executors when the dynamic 
allocator requests for more number of executors.
Its a 12 node EMR cluster and has more than a Tb of memory. 


On Fri, Sep 23, 2016 at 5:12 PM, Aditya  
wrote:
Hi Yash,



What is your total cluster memory and number of cores?

Problem might be with the number of executors you are allocating. The logs 
shows it as 168510 which is on very high side. Try reducing your executors.



On Friday 23 September 2016 12:34 PM, Yash Sharma wrote:


Hi All,

I have a spark job which runs over a huge bulk of data with Dynamic allocation 
enabled.

The job takes some 15 minutes to start up and fails as soon as it starts*.



Is there anything I can check to debug this problem. There is not a lot of 
information in logs for the exact cause but here is some snapshot below.



Thanks All.



* - by starts I mean when it shows something on the spark web ui, before that 
its just blank page.



Logs here -



{code}

16/09/23 06:33:19 INFO ApplicationMaster: Started progress reporter thread with 
(heartbeat : 3000, initial allocation : 200) intervals

16/09/23 06:33:27 INFO YarnAllocator: Driver requested a total number of 168510 
executor(s).

16/09/23 06:33:27 INFO YarnAllocator: Will request 168510 executor containers, 
each with 2 cores and 6758 MB memory including 614 MB overhead

16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for 
non-existent executor 22

16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for 
non-existent executor 19

16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for 
non-existent executor 18

16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for 
non-existent executor 12

16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for 
non-existent executor 11

16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for 
non-existent executor 20

16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for 
non-existent executor 15

16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for 
non-existent executor 7

16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for 
non-existent executor 8

16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for 
non-existent executor 16

16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for 
non-existent executor 21

16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for 
non-existent executor 6

16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for 
non-existent executor 13

16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for 
non-existent executor 14

16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for 
non-existent executor 9

16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for 
non-existent executor 3

16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for 
non-existent executor 17

16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for 
non-existent executor 1

16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for 
non-existent executor 10

16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for 
non-existent executor 4

16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for 
non-existent executor 2

16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for 
non-existent executor 5

16/09/23 06:33:36 WARN ApplicationMaster: Reporter thread fails 1 time(s) in a 
row.

java.lang.StackOverflowError

        at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)

        at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)

        at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)

        at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)

        at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)

        at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)

        at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)

        at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)

        at 

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

2016-09-23 Thread Yash Sharma
Thanks Aditya, appreciate the help.

I had the exact thought about the huge number of executors requested.
I am going with the dynamic executors and not specifying the number of
executors. Are you suggesting that I should limit the number of executors
when the dynamic allocator requests for more number of executors.

Its a 12 node EMR cluster and has more than a Tb of memory.



On Fri, Sep 23, 2016 at 5:12 PM, Aditya 
wrote:

> Hi Yash,
>
> What is your total cluster memory and number of cores?
> Problem might be with the number of executors you are allocating. The logs
> shows it as 168510 which is on very high side. Try reducing your executors.
>
>
> On Friday 23 September 2016 12:34 PM, Yash Sharma wrote:
>
>> Hi All,
>> I have a spark job which runs over a huge bulk of data with Dynamic
>> allocation enabled.
>> The job takes some 15 minutes to start up and fails as soon as it starts*.
>>
>> Is there anything I can check to debug this problem. There is not a lot
>> of information in logs for the exact cause but here is some snapshot below.
>>
>> Thanks All.
>>
>> * - by starts I mean when it shows something on the spark web ui, before
>> that its just blank page.
>>
>> Logs here -
>>
>> {code}
>> 16/09/23 06:33:19 INFO ApplicationMaster: Started progress reporter
>> thread with (heartbeat : 3000, initial allocation : 200) intervals
>> 16/09/23 06:33:27 INFO YarnAllocator: Driver requested a total number of
>> 168510 executor(s).
>> 16/09/23 06:33:27 INFO YarnAllocator: Will request 168510 executor
>> containers, each with 2 cores and 6758 MB memory including 614 MB overhead
>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>> non-existent executor 22
>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>> non-existent executor 19
>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>> non-existent executor 18
>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>> non-existent executor 12
>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>> non-existent executor 11
>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>> non-existent executor 20
>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>> non-existent executor 15
>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>> non-existent executor 7
>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>> non-existent executor 8
>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>> non-existent executor 16
>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>> non-existent executor 21
>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>> non-existent executor 6
>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>> non-existent executor 13
>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>> non-existent executor 14
>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>> non-existent executor 9
>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>> non-existent executor 3
>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>> non-existent executor 17
>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>> non-existent executor 1
>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>> non-existent executor 10
>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>> non-existent executor 4
>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>> non-existent executor 2
>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>> non-existent executor 5
>> 16/09/23 06:33:36 WARN ApplicationMaster: Reporter thread fails 1 time(s)
>> in a row.
>> java.lang.StackOverflowError
>> at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.
>> apply(MapLike.scala:245)
>> at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.
>> apply(MapLike.scala:245)
>> at scala.collection.TraversableLike$WithFilter$$anonfun$
>> foreach$1.apply(TraversableLike.scala:772)
>> at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.
>> apply(MapLike.scala:245)
>> at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.
>> apply(MapLike.scala:245)
>> at scala.collection.TraversableLike$WithFilter$$anonfun$
>> foreach$1.apply(TraversableLike.scala:772)
>> at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.
>> apply(MapLike.scala:245)
>> at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.
>> apply(MapLike.scala:245)
>> at scala.collection.TraversableLike$WithFilter$$anonfun$
>> foreach$1.apply(TraversableLike.scala:772)
>> at 

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

2016-09-23 Thread Aditya

Hi Yash,

What is your total cluster memory and number of cores?
Problem might be with the number of executors you are allocating. The 
logs shows it as 168510 which is on very high side. Try reducing your 
executors.


On Friday 23 September 2016 12:34 PM, Yash Sharma wrote:

Hi All,
I have a spark job which runs over a huge bulk of data with Dynamic 
allocation enabled.

The job takes some 15 minutes to start up and fails as soon as it starts*.

Is there anything I can check to debug this problem. There is not a 
lot of information in logs for the exact cause but here is some 
snapshot below.


Thanks All.

* - by starts I mean when it shows something on the spark web ui, 
before that its just blank page.


Logs here -

{code}
16/09/23 06:33:19 INFO ApplicationMaster: Started progress reporter 
thread with (heartbeat : 3000, initial allocation : 200) intervals
16/09/23 06:33:27 INFO YarnAllocator: Driver requested a total number 
of 168510 executor(s).
16/09/23 06:33:27 INFO YarnAllocator: Will request 168510 executor 
containers, each with 2 cores and 6758 MB memory including 614 MB overhead
16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for 
non-existent executor 22
16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for 
non-existent executor 19
16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for 
non-existent executor 18
16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for 
non-existent executor 12
16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for 
non-existent executor 11
16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for 
non-existent executor 20
16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for 
non-existent executor 15
16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for 
non-existent executor 7
16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for 
non-existent executor 8
16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for 
non-existent executor 16
16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for 
non-existent executor 21
16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for 
non-existent executor 6
16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for 
non-existent executor 13
16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for 
non-existent executor 14
16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for 
non-existent executor 9
16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for 
non-existent executor 3
16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for 
non-existent executor 17
16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for 
non-existent executor 1
16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for 
non-existent executor 10
16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for 
non-existent executor 4
16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for 
non-existent executor 2
16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for 
non-existent executor 5
16/09/23 06:33:36 WARN ApplicationMaster: Reporter thread fails 1 
time(s) in a row.

java.lang.StackOverflowError
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)

{code}

... 

{code}
16/09/23 06:33:36 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: 
Attempted to get executor loss reason for executor id 7 at RPC address 
, but got no response. Marking as slave lost.
org.apache.spark.SparkException: Fail to find loss reason for 
non-existent executor 7
at 
org.apache.spark.deploy.yarn.YarnAllocator.enqueueGetLossReasonRequest(YarnAllocator.scala:554)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$AMEndpoint$$anonfun$receiveAndReply$1.applyOrElse(ApplicationMaster.scala:632)
at