[ 
https://issues.apache.org/jira/browse/SPARK-21656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16117200#comment-16117200
 ] 

Thomas Graves commented on SPARK-21656:
---------------------------------------

why not fix the bug in dynamic allocation?  changing configs is a work around.  
like everything else what are the best configs for everyone's job.  

dynamic allocation is supposed to get you enough executors to run all your 
tasks in parallel (up to your config limits).  This is not allowing that and 
its code within SPARK that is doing it, not user code. Thus a bug in my opinion.

The documentation even hints at it. The problem is we just didn't catch this 
issue that in the initial code.

From:
http://spark.apache.org/docs/2.2.0/job-scheduling.html#remove-policy

"in that an executor should not be idle if there are still pending tasks to be 
scheduled"

One other option here would be to actually let them go and get new ones. This 
may or may not help depending on if it can get ones with better locality.  it 
might also just waste time releasing and reacquiring.

I personally would also be ok with changing the locality wait for node to 0 
which generally works around the problem, but I think this could happen in 
other cases and we should fix this bug too.  For instance say your driver does 
a full GC and can't schedule things within 60 seconds, you lose those executors 
and we never get them back.   What if you have temporary network congestion and 
your network timeout is plenty big to allow for, you could idle timeout.  yes 
we could increase the idle timeout, but in the normal working case the idle 
timeout is meant to be cases where you don't have any tasks to run on this 
executor.  Your stage has completed enough you can release some. This is not 
that case.

> spark dynamic allocation should not idle timeout executors when tasks still 
> to run
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-21656
>                 URL: https://issues.apache.org/jira/browse/SPARK-21656
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.1.1
>            Reporter: Jong Yoon Lee
>             Fix For: 2.1.1
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Right now spark lets go of executors when they are idle for the 60s (or 
> configurable time). I have seen spark let them go when they are idle but they 
> were really needed. I have seen this issue when the scheduler was waiting to 
> get node locality but that takes longer then the default idle timeout. In 
> these jobs the number of executors goes down really small (less than 10) but 
> there are still like 80,000 tasks to run.
> We should consider not allowing executors to idle timeout if they are still 
> needed according to the number of tasks to be run.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to