[jira] [Commented] (SPARK-21656) spark dynamic allocation should not idle timeout executors when there are enough tasks to run on them

2017-12-18 Thread Nan Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295282#comment-16295282
 ] 

Nan Zhu commented on SPARK-21656:
-

NOTE: the issue fixed by https://github.com/apache/spark/pull/18874

> spark dynamic allocation should not idle timeout executors when there are 
> enough tasks to run on them
> -
>
> Key: SPARK-21656
> URL: https://issues.apache.org/jira/browse/SPARK-21656
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Jong Yoon Lee
>Assignee: Jong Yoon Lee
> Fix For: 2.2.1, 2.3.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Right now with dynamic allocation spark starts by getting the number of 
> executors it needs to run all the tasks in parallel (or the configured 
> maximum) for that stage.  After it gets that number it will never reacquire 
> more unless either an executor dies, is explicitly killed by yarn or it goes 
> to the next stage.  The dynamic allocation manager has the concept of idle 
> timeout. Currently this says if a task hasn't been scheduled on that executor 
> for a configurable amount of time (60 seconds by default), then let that 
> executor go.  Note when it lets that executor go due to the idle timeout it 
> never goes back to see if it should reacquire more.
> This is a problem for multiple reasons:
> 1 . Things can happen in the system that are not expected that can cause 
> delays. Spark should be resilient to these. If the driver is GC'ing, you have 
> network delays, etc we could idle timeout executors even though there are 
> tasks to run on them its just the scheduler hasn't had time to start those 
> tasks.  Note that in the worst case this allows the number of executors to go 
> to 0 and we have a deadlock.
> 2. Internal Spark components have opposing requirements. The scheduler has a 
> requirement to try to get locality, the dynamic allocation doesn't know about 
> this and if it lets the executors go it hurts the scheduler from doing what 
> it was designed to do.  For example the scheduler first tries to schedule 
> node local, during this time it can skip scheduling on some executors.  After 
> a while though the scheduler falls back from node local to scheduler on rack 
> local, and then eventually on any node.  So during when the scheduler is 
> doing node local scheduling, the other executors can idle timeout.  This 
> means that when the scheduler does fall back to rack or any locality where it 
> would have used those executors, we have already let them go and it can't 
> scheduler all the tasks it could which can have a huge negative impact on job 
> run time.
>  
> In both of these cases when the executors idle timeout we never go back to 
> check to see if we need more executors (until the next stage starts).  In the 
> worst case you end up with 0 and deadlock, but generally this shows itself by 
> just going down to very few executors when you could have 10's of thousands 
> of tasks to run on them, which causes the job to take way more time (in my 
> case I've seen it should take minutes and it takes hours due to only been 
> left a few executors).  
> We should handle these situations in Spark.   The most straight forward 
> approach would be to not allow the executors to idle timeout when there are 
> tasks that could run on those executors. This would allow the scheduler to do 
> its job with locality scheduling.  In doing this it also fixes number 1 above 
> because you never can go into a deadlock as it will keep enough executors to 
> run all the tasks on. 
> There are other approaches to fix this, like explicitly prevent it from going 
> to 0 executors, that prevents a deadlock but can still cause the job to 
> slowdown greatly.  We could also change it at some point to just re-check to 
> see if we should get more executors, but this adds extra logic, we would have 
> to decide when to check, its also just overhead in letting them go and then 
> re-acquiring them again and this would cause some slowdown in the job as the 
> executors aren't immediately there for the scheduler to place things on. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21656) spark dynamic allocation should not idle timeout executors when there are enough tasks to run on them

2017-08-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123443#comment-16123443
 ] 

Sean Owen commented on SPARK-21656:
---

In the end I still don't quite agree with how you frame it here. It's making 
some jobs use more resource to let _other_ jobs move faster when bumping up 
against timeout limits. The downside of this change it no compelling just so 
that someone doesn't have to tune their job, so I'd discard that argument. It 
is compelling to solve the "busy driver" and "0 executor" problems. I'd have 
preferred to frame it that way from the get-go. This discussion isn't going to 
get farther, and agreeing on an outcome but disagreeing about why is close 
enough.

> spark dynamic allocation should not idle timeout executors when there are 
> enough tasks to run on them
> -
>
> Key: SPARK-21656
> URL: https://issues.apache.org/jira/browse/SPARK-21656
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Jong Yoon Lee
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Right now with dynamic allocation spark starts by getting the number of 
> executors it needs to run all the tasks in parallel (or the configured 
> maximum) for that stage.  After it gets that number it will never reacquire 
> more unless either an executor dies, is explicitly killed by yarn or it goes 
> to the next stage.  The dynamic allocation manager has the concept of idle 
> timeout. Currently this says if a task hasn't been scheduled on that executor 
> for a configurable amount of time (60 seconds by default), then let that 
> executor go.  Note when it lets that executor go due to the idle timeout it 
> never goes back to see if it should reacquire more.
> This is a problem for multiple reasons:
> 1 . Things can happen in the system that are not expected that can cause 
> delays. Spark should be resilient to these. If the driver is GC'ing, you have 
> network delays, etc we could idle timeout executors even though there are 
> tasks to run on them its just the scheduler hasn't had time to start those 
> tasks.  Note that in the worst case this allows the number of executors to go 
> to 0 and we have a deadlock.
> 2. Internal Spark components have opposing requirements. The scheduler has a 
> requirement to try to get locality, the dynamic allocation doesn't know about 
> this and if it lets the executors go it hurts the scheduler from doing what 
> it was designed to do.  For example the scheduler first tries to schedule 
> node local, during this time it can skip scheduling on some executors.  After 
> a while though the scheduler falls back from node local to scheduler on rack 
> local, and then eventually on any node.  So during when the scheduler is 
> doing node local scheduling, the other executors can idle timeout.  This 
> means that when the scheduler does fall back to rack or any locality where it 
> would have used those executors, we have already let them go and it can't 
> scheduler all the tasks it could which can have a huge negative impact on job 
> run time.
>  
> In both of these cases when the executors idle timeout we never go back to 
> check to see if we need more executors (until the next stage starts).  In the 
> worst case you end up with 0 and deadlock, but generally this shows itself by 
> just going down to very few executors when you could have 10's of thousands 
> of tasks to run on them, which causes the job to take way more time (in my 
> case I've seen it should take minutes and it takes hours due to only been 
> left a few executors).  
> We should handle these situations in Spark.   The most straight forward 
> approach would be to not allow the executors to idle timeout when there are 
> tasks that could run on those executors. This would allow the scheduler to do 
> its job with locality scheduling.  In doing this it also fixes number 1 above 
> because you never can go into a deadlock as it will keep enough executors to 
> run all the tasks on. 
> There are other approaches to fix this, like explicitly prevent it from going 
> to 0 executors, that prevents a deadlock but can still cause the job to 
> slowdown greatly.  We could also change it at some point to just re-check to 
> see if we should get more executors, but this adds extra logic, we would have 
> to decide when to check, its also just overhead in letting them go and then 
> re-acquiring them again and this would cause some slowdown in the job as the 
> executors aren't immediately there for the scheduler to place things on. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: 

[jira] [Commented] (SPARK-21656) spark dynamic allocation should not idle timeout executors when there are enough tasks to run on them

2017-08-11 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123439#comment-16123439
 ] 

Thomas Graves commented on SPARK-21656:
---

Note, I've never said there is no counter part scenarioNote and you read what I 
said in the pr you will see that:

| It doesn't hurt the common case, the common case is all your executors have 
tasks on them as long as there are tasks to run. Normally scheduler can fill up 
the executors. It will use more resources if the scheduler takes time to put 
tasks on them, but that versus the time wasted in jobs that don't have enough 
executors to run on is hard to quantify because its going to be so application 
dependent. yes it is a behavior change but a behavior change that is fixing an 
issue.

> spark dynamic allocation should not idle timeout executors when there are 
> enough tasks to run on them
> -
>
> Key: SPARK-21656
> URL: https://issues.apache.org/jira/browse/SPARK-21656
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Jong Yoon Lee
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Right now with dynamic allocation spark starts by getting the number of 
> executors it needs to run all the tasks in parallel (or the configured 
> maximum) for that stage.  After it gets that number it will never reacquire 
> more unless either an executor dies, is explicitly killed by yarn or it goes 
> to the next stage.  The dynamic allocation manager has the concept of idle 
> timeout. Currently this says if a task hasn't been scheduled on that executor 
> for a configurable amount of time (60 seconds by default), then let that 
> executor go.  Note when it lets that executor go due to the idle timeout it 
> never goes back to see if it should reacquire more.
> This is a problem for multiple reasons:
> 1 . Things can happen in the system that are not expected that can cause 
> delays. Spark should be resilient to these. If the driver is GC'ing, you have 
> network delays, etc we could idle timeout executors even though there are 
> tasks to run on them its just the scheduler hasn't had time to start those 
> tasks.  Note that in the worst case this allows the number of executors to go 
> to 0 and we have a deadlock.
> 2. Internal Spark components have opposing requirements. The scheduler has a 
> requirement to try to get locality, the dynamic allocation doesn't know about 
> this and if it lets the executors go it hurts the scheduler from doing what 
> it was designed to do.  For example the scheduler first tries to schedule 
> node local, during this time it can skip scheduling on some executors.  After 
> a while though the scheduler falls back from node local to scheduler on rack 
> local, and then eventually on any node.  So during when the scheduler is 
> doing node local scheduling, the other executors can idle timeout.  This 
> means that when the scheduler does fall back to rack or any locality where it 
> would have used those executors, we have already let them go and it can't 
> scheduler all the tasks it could which can have a huge negative impact on job 
> run time.
>  
> In both of these cases when the executors idle timeout we never go back to 
> check to see if we need more executors (until the next stage starts).  In the 
> worst case you end up with 0 and deadlock, but generally this shows itself by 
> just going down to very few executors when you could have 10's of thousands 
> of tasks to run on them, which causes the job to take way more time (in my 
> case I've seen it should take minutes and it takes hours due to only been 
> left a few executors).  
> We should handle these situations in Spark.   The most straight forward 
> approach would be to not allow the executors to idle timeout when there are 
> tasks that could run on those executors. This would allow the scheduler to do 
> its job with locality scheduling.  In doing this it also fixes number 1 above 
> because you never can go into a deadlock as it will keep enough executors to 
> run all the tasks on. 
> There are other approaches to fix this, like explicitly prevent it from going 
> to 0 executors, that prevents a deadlock but can still cause the job to 
> slowdown greatly.  We could also change it at some point to just re-check to 
> see if we should get more executors, but this adds extra logic, we would have 
> to decide when to check, its also just overhead in letting them go and then 
> re-acquiring them again and this would cause some slowdown in the job as the 
> executors aren't immediately there for the scheduler to place things on. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (SPARK-21656) spark dynamic allocation should not idle timeout executors when there are enough tasks to run on them

2017-08-11 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123435#comment-16123435
 ] 

Thomas Graves commented on SPARK-21656:
---

Yes there is a trade off here, use some more resource or have your job run time 
be really really slow and possibly deadlock. I completely understand the 
scenario where some executors may stay up when they aren't being used, if you 
have a better solution to do both please state it.  As I've stated changing 
config to me is a work around and not a solution. This case is handled by many 
other big data frameworks (pig, tez, mapreduce) and I believe spark should 
handle it as well.   

I would much rather lean towards having as many jobs run as fast as possible 
without the user having to tune things even at the expense of possibly using 
more resources.  I've describe 2 scenarios in which this problem can occur, 
there is also the extreme case where it goes to 0 that you keep mentioning. The 
fix provided is to address both of them.





> spark dynamic allocation should not idle timeout executors when there are 
> enough tasks to run on them
> -
>
> Key: SPARK-21656
> URL: https://issues.apache.org/jira/browse/SPARK-21656
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Jong Yoon Lee
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Right now with dynamic allocation spark starts by getting the number of 
> executors it needs to run all the tasks in parallel (or the configured 
> maximum) for that stage.  After it gets that number it will never reacquire 
> more unless either an executor dies, is explicitly killed by yarn or it goes 
> to the next stage.  The dynamic allocation manager has the concept of idle 
> timeout. Currently this says if a task hasn't been scheduled on that executor 
> for a configurable amount of time (60 seconds by default), then let that 
> executor go.  Note when it lets that executor go due to the idle timeout it 
> never goes back to see if it should reacquire more.
> This is a problem for multiple reasons:
> 1 . Things can happen in the system that are not expected that can cause 
> delays. Spark should be resilient to these. If the driver is GC'ing, you have 
> network delays, etc we could idle timeout executors even though there are 
> tasks to run on them its just the scheduler hasn't had time to start those 
> tasks.  Note that in the worst case this allows the number of executors to go 
> to 0 and we have a deadlock.
> 2. Internal Spark components have opposing requirements. The scheduler has a 
> requirement to try to get locality, the dynamic allocation doesn't know about 
> this and if it lets the executors go it hurts the scheduler from doing what 
> it was designed to do.  For example the scheduler first tries to schedule 
> node local, during this time it can skip scheduling on some executors.  After 
> a while though the scheduler falls back from node local to scheduler on rack 
> local, and then eventually on any node.  So during when the scheduler is 
> doing node local scheduling, the other executors can idle timeout.  This 
> means that when the scheduler does fall back to rack or any locality where it 
> would have used those executors, we have already let them go and it can't 
> scheduler all the tasks it could which can have a huge negative impact on job 
> run time.
>  
> In both of these cases when the executors idle timeout we never go back to 
> check to see if we need more executors (until the next stage starts).  In the 
> worst case you end up with 0 and deadlock, but generally this shows itself by 
> just going down to very few executors when you could have 10's of thousands 
> of tasks to run on them, which causes the job to take way more time (in my 
> case I've seen it should take minutes and it takes hours due to only been 
> left a few executors).  
> We should handle these situations in Spark.   The most straight forward 
> approach would be to not allow the executors to idle timeout when there are 
> tasks that could run on those executors. This would allow the scheduler to do 
> its job with locality scheduling.  In doing this it also fixes number 1 above 
> because you never can go into a deadlock as it will keep enough executors to 
> run all the tasks on. 
> There are other approaches to fix this, like explicitly prevent it from going 
> to 0 executors, that prevents a deadlock but can still cause the job to 
> slowdown greatly.  We could also change it at some point to just re-check to 
> see if we should get more executors, but this adds extra logic, we would have 
> to decide when to check, its also just overhead in letting them go and then 
> re-acquiring them again and 

[jira] [Commented] (SPARK-21656) spark dynamic allocation should not idle timeout executors when there are enough tasks to run on them

2017-08-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123411#comment-16123411
 ] 

Sean Owen commented on SPARK-21656:
---

I'm referring to the same issue you cite repeatedly, including:
https://github.com/apache/spark/pull/18874#issuecomment-321313616
https://issues.apache.org/jira/browse/SPARK-21656?focusedCommentId=16117200=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16117200
 

Something like a driver busy in long GC pauses doesn't keep up with the fact 
that executors are non-idle and removes them. Its conclusion is incorrect and 
that's what we're trying to fix. All the more because going to 0 executors 
stops the stage.

Right? I though we finally had it clear that this was the problem being fixed.

Now you're just describing a job that needs a lower locality timeout. (Or else, 
describing a different problem with different solution, as in 
https://github.com/apache/spark/pull/18874#issuecomment-321625808 -- why do 
they take so much longer than 3s to fall back to other executors?) That 
scenario is not a reason to make this change.

[~tgraves] please read 
https://github.com/apache/spark/pull/18874#issuecomment-321683515 . You're 
saying there's no counterpart scenario that is actually harmed by this change a 
bit, and I think there is. We need to get on the same page.



> spark dynamic allocation should not idle timeout executors when there are 
> enough tasks to run on them
> -
>
> Key: SPARK-21656
> URL: https://issues.apache.org/jira/browse/SPARK-21656
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Jong Yoon Lee
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Right now with dynamic allocation spark starts by getting the number of 
> executors it needs to run all the tasks in parallel (or the configured 
> maximum) for that stage.  After it gets that number it will never reacquire 
> more unless either an executor dies, is explicitly killed by yarn or it goes 
> to the next stage.  The dynamic allocation manager has the concept of idle 
> timeout. Currently this says if a task hasn't been scheduled on that executor 
> for a configurable amount of time (60 seconds by default), then let that 
> executor go.  Note when it lets that executor go due to the idle timeout it 
> never goes back to see if it should reacquire more.
> This is a problem for multiple reasons:
> 1 . Things can happen in the system that are not expected that can cause 
> delays. Spark should be resilient to these. If the driver is GC'ing, you have 
> network delays, etc we could idle timeout executors even though there are 
> tasks to run on them its just the scheduler hasn't had time to start those 
> tasks.  Note that in the worst case this allows the number of executors to go 
> to 0 and we have a deadlock.
> 2. Internal Spark components have opposing requirements. The scheduler has a 
> requirement to try to get locality, the dynamic allocation doesn't know about 
> this and if it lets the executors go it hurts the scheduler from doing what 
> it was designed to do.  For example the scheduler first tries to schedule 
> node local, during this time it can skip scheduling on some executors.  After 
> a while though the scheduler falls back from node local to scheduler on rack 
> local, and then eventually on any node.  So during when the scheduler is 
> doing node local scheduling, the other executors can idle timeout.  This 
> means that when the scheduler does fall back to rack or any locality where it 
> would have used those executors, we have already let them go and it can't 
> scheduler all the tasks it could which can have a huge negative impact on job 
> run time.
>  
> In both of these cases when the executors idle timeout we never go back to 
> check to see if we need more executors (until the next stage starts).  In the 
> worst case you end up with 0 and deadlock, but generally this shows itself by 
> just going down to very few executors when you could have 10's of thousands 
> of tasks to run on them, which causes the job to take way more time (in my 
> case I've seen it should take minutes and it takes hours due to only been 
> left a few executors).  
> We should handle these situations in Spark.   The most straight forward 
> approach would be to not allow the executors to idle timeout when there are 
> tasks that could run on those executors. This would allow the scheduler to do 
> its job with locality scheduling.  In doing this it also fixes number 1 above 
> because you never can go into a deadlock as it will keep enough executors to 
> run all the tasks on. 
> There are other approaches to fix this, like explicitly prevent it from 

[jira] [Commented] (SPARK-21656) spark dynamic allocation should not idle timeout executors when there are enough tasks to run on them

2017-08-11 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123393#comment-16123393
 ] 

Thomas Graves commented on SPARK-21656:
---

I don't know what you mean by busy driver.  The example of the tests results is 
showing this is fixing the issue.  The issue is as I've describe in the 
description of the jira.  In this case its due to the scheduler and the fact it 
doesn't immediately use the executors due to the locality settings, as long as 
you keep those executors around (don't idle timeout them) they do get used and 
it has a huge impact on the run time.  the executors only eventually get tasks 
because of the scheduler locality delay.  

I don't know what you mean by the flip-side of the situation and how this gets 
worse.

If you want something to compare to go see how other frameworks due this same 
thing. TEZ for instance. This fix is changing it so it acts very similar to 
those.



> spark dynamic allocation should not idle timeout executors when there are 
> enough tasks to run on them
> -
>
> Key: SPARK-21656
> URL: https://issues.apache.org/jira/browse/SPARK-21656
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Jong Yoon Lee
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Right now with dynamic allocation spark starts by getting the number of 
> executors it needs to run all the tasks in parallel (or the configured 
> maximum) for that stage.  After it gets that number it will never reacquire 
> more unless either an executor dies, is explicitly killed by yarn or it goes 
> to the next stage.  The dynamic allocation manager has the concept of idle 
> timeout. Currently this says if a task hasn't been scheduled on that executor 
> for a configurable amount of time (60 seconds by default), then let that 
> executor go.  Note when it lets that executor go due to the idle timeout it 
> never goes back to see if it should reacquire more.
> This is a problem for multiple reasons:
> 1 . Things can happen in the system that are not expected that can cause 
> delays. Spark should be resilient to these. If the driver is GC'ing, you have 
> network delays, etc we could idle timeout executors even though there are 
> tasks to run on them its just the scheduler hasn't had time to start those 
> tasks.  Note that in the worst case this allows the number of executors to go 
> to 0 and we have a deadlock.
> 2. Internal Spark components have opposing requirements. The scheduler has a 
> requirement to try to get locality, the dynamic allocation doesn't know about 
> this and if it lets the executors go it hurts the scheduler from doing what 
> it was designed to do.  For example the scheduler first tries to schedule 
> node local, during this time it can skip scheduling on some executors.  After 
> a while though the scheduler falls back from node local to scheduler on rack 
> local, and then eventually on any node.  So during when the scheduler is 
> doing node local scheduling, the other executors can idle timeout.  This 
> means that when the scheduler does fall back to rack or any locality where it 
> would have used those executors, we have already let them go and it can't 
> scheduler all the tasks it could which can have a huge negative impact on job 
> run time.
>  
> In both of these cases when the executors idle timeout we never go back to 
> check to see if we need more executors (until the next stage starts).  In the 
> worst case you end up with 0 and deadlock, but generally this shows itself by 
> just going down to very few executors when you could have 10's of thousands 
> of tasks to run on them, which causes the job to take way more time (in my 
> case I've seen it should take minutes and it takes hours due to only been 
> left a few executors).  
> We should handle these situations in Spark.   The most straight forward 
> approach would be to not allow the executors to idle timeout when there are 
> tasks that could run on those executors. This would allow the scheduler to do 
> its job with locality scheduling.  In doing this it also fixes number 1 above 
> because you never can go into a deadlock as it will keep enough executors to 
> run all the tasks on. 
> There are other approaches to fix this, like explicitly prevent it from going 
> to 0 executors, that prevents a deadlock but can still cause the job to 
> slowdown greatly.  We could also change it at some point to just re-check to 
> see if we should get more executors, but this adds extra logic, we would have 
> to decide when to check, its also just overhead in letting them go and then 
> re-acquiring them again and this would cause some slowdown in the job as the 
> executors aren't 

[jira] [Commented] (SPARK-21656) spark dynamic allocation should not idle timeout executors when there are enough tasks to run on them

2017-08-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123373#comment-16123373
 ] 

Sean Owen commented on SPARK-21656:
---

Is this the 'busy driver' scenario that the PR contemplates? If not, then this 
may be true, but it's not the motivation of the PR, right? this is just a case 
where you need shorter locality timeout, or something. It's also not the 
0-executor scenario that is the motivation of the PR either.

If this is the 'busy driver' scenario, then I also wonder what happens if you 
increase the locality timeout. That was one unfinished thread in the PR 
discussion; why do the other executors get tasks only so very eventually?

I want to stay clear on what we're helping here, and also what the cost is: see 
the flip-side to this situation described in the PR, which could get worse.

> spark dynamic allocation should not idle timeout executors when there are 
> enough tasks to run on them
> -
>
> Key: SPARK-21656
> URL: https://issues.apache.org/jira/browse/SPARK-21656
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Jong Yoon Lee
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Right now with dynamic allocation spark starts by getting the number of 
> executors it needs to run all the tasks in parallel (or the configured 
> maximum) for that stage.  After it gets that number it will never reacquire 
> more unless either an executor dies, is explicitly killed by yarn or it goes 
> to the next stage.  The dynamic allocation manager has the concept of idle 
> timeout. Currently this says if a task hasn't been scheduled on that executor 
> for a configurable amount of time (60 seconds by default), then let that 
> executor go.  Note when it lets that executor go due to the idle timeout it 
> never goes back to see if it should reacquire more.
> This is a problem for multiple reasons:
> 1 . Things can happen in the system that are not expected that can cause 
> delays. Spark should be resilient to these. If the driver is GC'ing, you have 
> network delays, etc we could idle timeout executors even though there are 
> tasks to run on them its just the scheduler hasn't had time to start those 
> tasks.  Note that in the worst case this allows the number of executors to go 
> to 0 and we have a deadlock.
> 2. Internal Spark components have opposing requirements. The scheduler has a 
> requirement to try to get locality, the dynamic allocation doesn't know about 
> this and if it lets the executors go it hurts the scheduler from doing what 
> it was designed to do.  For example the scheduler first tries to schedule 
> node local, during this time it can skip scheduling on some executors.  After 
> a while though the scheduler falls back from node local to scheduler on rack 
> local, and then eventually on any node.  So during when the scheduler is 
> doing node local scheduling, the other executors can idle timeout.  This 
> means that when the scheduler does fall back to rack or any locality where it 
> would have used those executors, we have already let them go and it can't 
> scheduler all the tasks it could which can have a huge negative impact on job 
> run time.
>  
> In both of these cases when the executors idle timeout we never go back to 
> check to see if we need more executors (until the next stage starts).  In the 
> worst case you end up with 0 and deadlock, but generally this shows itself by 
> just going down to very few executors when you could have 10's of thousands 
> of tasks to run on them, which causes the job to take way more time (in my 
> case I've seen it should take minutes and it takes hours due to only been 
> left a few executors).  
> We should handle these situations in Spark.   The most straight forward 
> approach would be to not allow the executors to idle timeout when there are 
> tasks that could run on those executors. This would allow the scheduler to do 
> its job with locality scheduling.  In doing this it also fixes number 1 above 
> because you never can go into a deadlock as it will keep enough executors to 
> run all the tasks on. 
> There are other approaches to fix this, like explicitly prevent it from going 
> to 0 executors, that prevents a deadlock but can still cause the job to 
> slowdown greatly.  We could also change it at some point to just re-check to 
> see if we should get more executors, but this adds extra logic, we would have 
> to decide when to check, its also just overhead in letting them go and then 
> re-acquiring them again and this would cause some slowdown in the job as the 
> executors aren't immediately there for the scheduler to place things on. 



--
This message was sent by Atlassian JIRA

[jira] [Commented] (SPARK-21656) spark dynamic allocation should not idle timeout executors when there are enough tasks to run on them

2017-08-11 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123358#comment-16123358
 ] 

Thomas Graves commented on SPARK-21656:
---

example of test results with this.

We have production job running 21600 tasks.  With default locality the job 
takes 3.1 hours due to this issue. With the fix proposed in the pull request 
the job takes 17 minutes.  The resource utilization of the fix does use more 
resource but every executor eventually has multiple tasks run on it, 
demonstrating that if we hold on to them for a while the scheduler will fall 
back and use them. 

> spark dynamic allocation should not idle timeout executors when there are 
> enough tasks to run on them
> -
>
> Key: SPARK-21656
> URL: https://issues.apache.org/jira/browse/SPARK-21656
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Jong Yoon Lee
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Right now with dynamic allocation spark starts by getting the number of 
> executors it needs to run all the tasks in parallel (or the configured 
> maximum) for that stage.  After it gets that number it will never reacquire 
> more unless either an executor dies, is explicitly killed by yarn or it goes 
> to the next stage.  The dynamic allocation manager has the concept of idle 
> timeout. Currently this says if a task hasn't been scheduled on that executor 
> for a configurable amount of time (60 seconds by default), then let that 
> executor go.  Note when it lets that executor go due to the idle timeout it 
> never goes back to see if it should reacquire more.
> This is a problem for multiple reasons:
> 1 . Things can happen in the system that are not expected that can cause 
> delays. Spark should be resilient to these. If the driver is GC'ing, you have 
> network delays, etc we could idle timeout executors even though there are 
> tasks to run on them its just the scheduler hasn't had time to start those 
> tasks.  Note that in the worst case this allows the number of executors to go 
> to 0 and we have a deadlock.
> 2. Internal Spark components have opposing requirements. The scheduler has a 
> requirement to try to get locality, the dynamic allocation doesn't know about 
> this and if it lets the executors go it hurts the scheduler from doing what 
> it was designed to do.  For example the scheduler first tries to schedule 
> node local, during this time it can skip scheduling on some executors.  After 
> a while though the scheduler falls back from node local to scheduler on rack 
> local, and then eventually on any node.  So during when the scheduler is 
> doing node local scheduling, the other executors can idle timeout.  This 
> means that when the scheduler does fall back to rack or any locality where it 
> would have used those executors, we have already let them go and it can't 
> scheduler all the tasks it could which can have a huge negative impact on job 
> run time.
>  
> In both of these cases when the executors idle timeout we never go back to 
> check to see if we need more executors (until the next stage starts).  In the 
> worst case you end up with 0 and deadlock, but generally this shows itself by 
> just going down to very few executors when you could have 10's of thousands 
> of tasks to run on them, which causes the job to take way more time (in my 
> case I've seen it should take minutes and it takes hours due to only been 
> left a few executors).  
> We should handle these situations in Spark.   The most straight forward 
> approach would be to not allow the executors to idle timeout when there are 
> tasks that could run on those executors. This would allow the scheduler to do 
> its job with locality scheduling.  In doing this it also fixes number 1 above 
> because you never can go into a deadlock as it will keep enough executors to 
> run all the tasks on. 
> There are other approaches to fix this, like explicitly prevent it from going 
> to 0 executors, that prevents a deadlock but can still cause the job to 
> slowdown greatly.  We could also change it at some point to just re-check to 
> see if we should get more executors, but this adds extra logic, we would have 
> to decide when to check, its also just overhead in letting them go and then 
> re-acquiring them again and this would cause some slowdown in the job as the 
> executors aren't immediately there for the scheduler to place things on. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org