[GitHub] [spark] bmarcott commented on issue #26696: [WIP][SPARK-18886][CORE] Make locality wait time be the time since a TSM's available slots were fully utilized

2020-01-14 Thread GitBox
bmarcott commented on issue #26696: [WIP][SPARK-18886][CORE] Make locality wait 
time be the time since a TSM's available slots were fully utilized
URL: https://github.com/apache/spark/pull/26696#issuecomment-574471102
 
 
    


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bmarcott commented on issue #26696: [WIP][SPARK-18886][CORE] Make locality wait time be the time since a TSM's available slots were fully utilized

2020-01-13 Thread GitBox
bmarcott commented on issue #26696: [WIP][SPARK-18886][CORE] Make locality wait 
time be the time since a TSM's available slots were fully utilized
URL: https://github.com/apache/spark/pull/26696#issuecomment-574008231
 
 
   > Is this the key point? 
   
   yes
   
   > I'm not familiar with the resource offer part. 
   
   I listed all the different cases in which all resource offers and single 
resource offers are made in one of my comments above. Hope that helps.
   
   > When an offer is not "all resource", we never reset the timer?
   
   You can reset the timer on a single resource offer if there have been no 
rejects since the last all resource offer.
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bmarcott commented on issue #26696: [WIP][SPARK-18886][CORE] Make locality wait time be the time since a TSM's available slots were fully utilized

2020-01-12 Thread GitBox
bmarcott commented on issue #26696: [WIP][SPARK-18886][CORE] Make locality wait 
time be the time since a TSM's available slots were fully utilized
URL: https://github.com/apache/spark/pull/26696#issuecomment-573523899
 
 
   hope you had a good weekend. any new thoughts?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bmarcott commented on issue #26696: [WIP][SPARK-18886][CORE] Make locality wait time be the time since a TSM's available slots were fully utilized

2020-01-06 Thread GitBox
bmarcott commented on issue #26696: [WIP][SPARK-18886][CORE] Make locality wait 
time be the time since a TSM's available slots were fully utilized
URL: https://github.com/apache/spark/pull/26696#issuecomment-571476226
 
 
   Thanks for taking a look!


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bmarcott commented on issue #26696: [WIP][SPARK-18886][CORE] Make locality wait time be the time since a TSM's available slots were fully utilized

2020-01-06 Thread GitBox
bmarcott commented on issue #26696: [WIP][SPARK-18886][CORE] Make locality wait 
time be the time since a TSM's available slots were fully utilized
URL: https://github.com/apache/spark/pull/26696#issuecomment-571476099
 
 
   > ah it's an event sequence. sorry I misread it.
   Sorry if it wasn't clear. I relabeled it as an event sequence.
   
   > 
   > I'd like to see the high-level description as well. What this PR does is 
pretty clear: if a TSM doesn't fully utilize its slots because of delay 
scheduling, fall back to the next locality level after "locality wait" time.
   > 
   > It's unclear to me how the new proposal implements it.
   
   Hopefully my description above cleared it up. But lemme know if I am missing 
something or if it needs more explanation.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bmarcott commented on issue #26696: [WIP][SPARK-18886][CORE] Make locality wait time be the time since a TSM's available slots were fully utilized

2020-01-06 Thread GitBox
bmarcott commented on issue #26696: [WIP][SPARK-18886][CORE] Make locality wait 
time be the time since a TSM's available slots were fully utilized
URL: https://github.com/apache/spark/pull/26696#issuecomment-571475603
 
 
   >  I think it would be useful to add in a high level description of what 
exactly the approach is trying to do
   
   I agree. We should all have a good idea of what the goal is here.
   I understand the high level goal to be to only let delay scheduling be 
**delaying** the taskset for the period determined by locality wait. Delaying 
in this case means not scheduling a task, when it would have, if were not for 
delay scheduling (indicated by the new boolean return by TSM). We know there 
has been no **delay** when there have been no rejects since and including an 
"all resource offer". 
   I am now refraining from using the term "all resources utilized, " since 
there are other reasons they may not be fully utilized, such as blacklisting, 
which would have caused problems with my previous approach.
   
   
   > Please correct me if I'm wrong, but I think essentially at a high level 
you are proposing to only reset the timer whenever all the currently free 
resources are scheduled on. So if you start out with 4 executors, you don't 
reset the timer until all of them have been scheduled. At that point you reset 
the timer. 
   
   It is not strictly required that all 4 executors be utilized. The new 
approach directly measures effects due to delay scheduling. As long as there 
are no rejects due to delay scheduling the timer is reset.
   
   > Which means that when a running task finishes and you go to start a new 
one or a new executor comes in, it has to wait the timeout.
   
   The timer is per TSM, not per task, so a single task doesn't necessarily 
have to wait when another one completes. The same applies to executors.
   As long as the TSM is not rejecting resource offers due to delay scheduling, 
the timer is reset. 
   
   > And for the fair scheduler case, the way you are passing the flags in 
handles it resetting the timer if it has used its fair share?
   
   New flags are not passed it. New flags are returned by the TSM saying 
whether it rejected an offer due to delay scheduling.
   Fair or FIFO scheduling are handled the same way in the new approach. the 
scheduling determines what resources are offered to which TSMs, and the TSMs 
return whether they were "delayed."
   
   I believe another interesting bug which exists in master code (timer reset 
on task launched), is that assume because of FAIR scheduling (or any other 
reason) a TSM cannot launch tasks for the locality wait period. currently that 
TSM would be penalized and would increase its locality level.
   With the new approach I linked: there would be an all resource offer and 
that TSM would be offered 0 resources, but it also wouldn't reject any 
resources due to delay scheduling, so the timer would be reset.
   
   > 
   > Also just to clarifying a few things:
   > 
   > > offer 1 resource, no reject - timer is reset
   > 
   > so the assumption here is that we are out of all resources because we 
either had 1 task finish or we added a new executor and if nothing was rejected 
we scheduled that fully and we were fully scheduled before this offer? This is 
why you have the last line : offer 1 resource, no reject - no timer reset 
because previous offer was rejected -> since previous offer was rejected there 
are still free resources so we don't reset.
   Whatever assumptions or circumstances caused the case, the fact is that the 
TSM is not being delayed due to delay scheduling in that case, so the timer 
should be reset.
   
   When an "all resource offer" has no rejects, we know there is no delay, so 
we can reset the timer.
   For single offers, we only know there has been no delay if there hasn't been 
any delay since the last all resource offer.
   
   
   > So I assume by this you mean the startup case, but I'm not sure that is 
true. You get an "all free resource" case when you first submitTasks.
   I think there are 2 cases - static allocation and dynamic allocation. 
Generally with static you will get your executors before you start any 
application code, so it won't matter if it makes offers before that. With 
dynamic allocation generally you won't have any executors so this perhaps is 
the case on submitTasks you offer all but there are no offers because no 
executors yet. Which case are you referring to?
   
   No, I believe the approach doesn't matter whether executors are statically 
or dynamically allocated.
   The case I am referring to is: imagine you have 2 resources and an "all 
resource offer" is scheduled every second. when TSM1 is submitted, it'll also 
get an "all resource offer", and assume it rejects both, causing a prexisting 
TSM2 to utilize them. Assume those 2 tasks finish, and the freed resources are 
offered one by one to TSM1, which accepts both, all within 1 second (before 

[GitHub] [spark] bmarcott commented on issue #26696: [WIP][SPARK-18886][CORE] Make locality wait time be the time since a TSM's available slots were fully utilized

2020-01-06 Thread GitBox
bmarcott commented on issue #26696: [WIP][SPARK-18886][CORE] Make locality wait 
time be the time since a TSM's available slots were fully utilized
URL: https://github.com/apache/spark/pull/26696#issuecomment-571454805
 
 
   > > offer 1 resource that was rejected - no timer reset
   > > offer all resources with no rejects - timer is reset
   > > offer 1 resource, no reject - timer is reset
   > > offer 1 resource that was rejected - no timer reset
   > > offer 1 resource, no reject - no timer reset because previous offer was 
rejected
   > 
   > There are 4 "offer 1 resource", typo?
   
   no. Did I make a mistake?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bmarcott commented on issue #26696: [WIP][SPARK-18886][CORE] Make locality wait time be the time since a TSM's available slots were fully utilized

2020-01-03 Thread GitBox
bmarcott commented on issue #26696: [WIP][SPARK-18886][CORE] Make locality wait 
time be the time since a TSM's available slots were fully utilized
URL: https://github.com/apache/spark/pull/26696#issuecomment-570759691
 
 
   happy new year folks   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bmarcott commented on issue #26696: [WIP][SPARK-18886][CORE] Make locality wait time be the time since a TSM's available slots were fully utilized

2019-12-23 Thread GitBox
bmarcott commented on issue #26696: [WIP][SPARK-18886][CORE] Make locality wait 
time be the time since a TSM's available slots were fully utilized
URL: https://github.com/apache/spark/pull/26696#issuecomment-568617329
 
 
   [Here is an 
approach](https://github.com/apache/spark/compare/master...bmarcott:nmarcott-fulfill-slots-2?expand=1)
 which avoids trying to simulate any type of scheduling logic.
   This change resets locality timers only when no resources have been rejected 
due to delay scheduling.
   
   The complication in this change is that resetting the locality timer is 
dependent on when `TaskSchedulerImpl.resourceOffers` is called with all 
resources, rather than a single resource.
   That is guaranteed to happen only as frequent as 
`spark.scheduler.revive.interval` which defaults to 1s. This means there would 
be problems if `locality.wait.time` is smaller than the revive interval.
   Ways around this would be to either make the revive interval at least as 
small as locality wait or to always offer all free resources on calls to 
`TaskSchedulerImpl.resourceOffers`. We could also restrict how low you could 
make locality wait timeout...
   Thoughts or know of any other issues with this approach?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bmarcott commented on issue #26696: [WIP][SPARK-18886][CORE] Make locality wait time be the time since a TSM's available slots were fully utilized

2019-12-16 Thread GitBox
bmarcott commented on issue #26696: [WIP][SPARK-18886][CORE] Make locality wait 
time be the time since a TSM's available slots were fully utilized
URL: https://github.com/apache/spark/pull/26696#issuecomment-566398156
 
 
   back from vacation and updated the PR trying to address all the comments so 
far.
   let me know if I left one of your comments out.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bmarcott commented on issue #26696: [WIP][SPARK-18886][CORE] Make locality wait time be the time since a TSM's available slots were fully utilized

2019-12-10 Thread GitBox
bmarcott commented on issue #26696: [WIP][SPARK-18886][CORE] Make locality wait 
time be the time since a TSM's available slots were fully utilized
URL: https://github.com/apache/spark/pull/26696#issuecomment-563972952
 
 
   @squito 
   thanks a lot doing a thorough review of the tests. 
   I'll add more test cases and comments as suggested


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bmarcott commented on issue #26696: [WIP][SPARK-18886][CORE] Make locality wait time be the time since a TSM's available slots were fully utilized

2019-12-06 Thread GitBox
bmarcott commented on issue #26696: [WIP][SPARK-18886][CORE] Make locality wait 
time be the time since a TSM's available slots were fully utilized
URL: https://github.com/apache/spark/pull/26696#issuecomment-562798874
 
 
   
   >  arguably you should never reset back to a lower level. 
   
   Yea I was thinking the same thing, but I decided not to touch that part yet, 
since I have mostly thought about **when** to reset.
   
   
   
   > Note on the fairscheduler side - I think you do have the issue Kay brought 
up in v2 of comment: 
https://issues.apache.org/jira/browse/SPARK-18886?focusedCommentId=15931009=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15931009
   
   I do not believe this PR has that issue, as once the non-greedy taskset has 
waited its locality wait time it will increase locality level since it wasn't 
able to take advantage of its slots. Her proposed v1 solution had that problem 
since it only increased locality level if there were slots not being used by 
**any job**.
   
   > I assume it fixes the performance on #26633? thanks for adding more unit 
tests, did run any manual tests as well?
   
   I haven't looked in depth at their problem, but it does seem it would solve 
most of the problem.
   I have not yet had the chance to run manual tests. 
   
   > so looking at the fairscheduler side of this it brings me back to the real 
question of what is the definition of a task wait. If you look at Kay's example:
   
   I did not catch your point from the comment that started with above quote. 
This PR handles the case she mentioned since locality level would not increase 
because it is utilizing all its available slots (timers will keep getting 
reset). That means when the taskset is offered the new non-local resource, it 
wouldn't immediately take advantage of it
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bmarcott commented on issue #26696: [WIP][SPARK-18886][CORE] Make locality wait time be the time since a TSM's available slots were fully utilized

2019-12-06 Thread GitBox
bmarcott commented on issue #26696: [WIP][SPARK-18886][CORE] Make locality wait 
time be the time since a TSM's available slots were fully utilized
URL: https://github.com/apache/spark/pull/26696#issuecomment-562780021
 
 
   hi folks, thanks for leaving more comments.
   I am currently on vacation until the 15th of Dec, so will likely not be able 
to respond.
   Once I get back I'll address comments.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org