[GitHub] [spark] Ngone51 commented on pull request #30650: [SPARK-24818][CORE] Support delay scheduling for barrier execution

2021-02-19 Thread GitBox


Ngone51 commented on pull request #30650:
URL: https://github.com/apache/spark/pull/30650#issuecomment-782547345


   thanks for the help @tgravescs @mridulm 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on pull request #30650: [SPARK-24818][CORE] Support delay scheduling for barrier execution

2021-02-18 Thread GitBox


Ngone51 commented on pull request #30650:
URL: https://github.com/apache/spark/pull/30650#issuecomment-781180966


   Addressed the last comment from @tgravescs. Please take another look, thanks!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on pull request #30650: [SPARK-24818][CORE] Support delay scheduling for barrier execution

2021-02-08 Thread GitBox


Ngone51 commented on pull request #30650:
URL: https://github.com/apache/spark/pull/30650#issuecomment-775643884


   Oh I mean, for the behavior of disabling legacy delay scheduling for the 
barrier stage.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on pull request #30650: [SPARK-24818][CORE] Support delay scheduling for barrier execution

2021-02-07 Thread GitBox


Ngone51 commented on pull request #30650:
URL: https://github.com/apache/spark/pull/30650#issuecomment-774876158


   cc @mridulm @tgravescs Please take another look when you're available:)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on pull request #30650: [SPARK-24818][CORE] Support delay scheduling for barrier execution

2021-02-05 Thread GitBox


Ngone51 commented on pull request #30650:
URL: https://github.com/apache/spark/pull/30650#issuecomment-773860013


   That sounds great to me. I've addressed it in 555b3a8 with a unit test. 
Please take another look.
   
   BTW, do you think we need a conf to be able to restore the original behavior?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on pull request #30650: [SPARK-24818][CORE] Support delay scheduling for barrier execution

2021-02-05 Thread GitBox


Ngone51 commented on pull request #30650:
URL: https://github.com/apache/spark/pull/30650#issuecomment-773860013


   That sounds great to me. I've addressed it in 555b3a8 with a unit test. 
Please take another look.
   
   BTW, do you think we need a conf to be able to restore the original behavior?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on pull request #30650: [SPARK-24818][CORE] Support delay scheduling for barrier execution

2021-02-02 Thread GitBox


Ngone51 commented on pull request #30650:
URL: https://github.com/apache/spark/pull/30650#issuecomment-771326460


   > If there is insufficient resources to run the entire taskset at (some) 
highest locality level, and there is atleast one task that can be launched at 
that level - we will end up always failing the barrier taksset (unless delay is 
set to 0) ?
   
   Yes
   
   > If yes, as a follow up, do we want to make delay as 0 for barrier tasksets 
?
   
   Before this PR, we always recommend users disable delay scheduling by 
setting delay to 0 as a workaround in the error message. And after this PR, we 
recommend users use the non-legacy delay scheduling. Unless there's a strong 
reason that users must use legacy delay scheduling, I personally don't have a 
strong feeling to have the followup since the non-legacy delay scheduling is 
enabled by default too. But yes, set delay to 0 would let legacy way work 
anyway. Although I think this PR can't benefit it in that case since we'll 
never have a chance to revert. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on pull request #30650: [SPARK-24818][CORE] Support delay scheduling for barrier execution

2021-02-01 Thread GitBox


Ngone51 commented on pull request #30650:
URL: https://github.com/apache/spark/pull/30650#issuecomment-771326460


   > If there is insufficient resources to run the entire taskset at (some) 
highest locality level, and there is atleast one task that can be launched at 
that level - we will end up always failing the barrier taksset (unless delay is 
set to 0) ?
   
   Yes
   
   > If yes, as a follow up, do we want to make delay as 0 for barrier tasksets 
?
   
   Before this PR, we always recommend users disable delay scheduling by 
setting delay to 0 as a workaround in the error message. And after this PR, we 
recommend users use the non-legacy delay scheduling. Unless there's a strong 
reason that users must use legacy delay scheduling, I personally don't have a 
strong feeling to have the followup since the non-legacy delay scheduling is 
enabled by default too. But yes, set delay to 0 would let legacy way work 
anyway. Although I think this PR can't benefit it in that case since we'll 
never have a chance to revert. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on pull request #30650: [SPARK-24818][CORE] Support delay scheduling for barrier execution

2021-01-28 Thread GitBox


Ngone51 commented on pull request #30650:
URL: https://github.com/apache/spark/pull/30650#issuecomment-769587926


   retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on pull request #30650: [SPARK-24818][CORE] Support delay scheduling for barrier execution

2021-01-28 Thread GitBox


Ngone51 commented on pull request #30650:
URL: https://github.com/apache/spark/pull/30650#issuecomment-769587860


   kindly ping @mridulm @tgravescs



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on pull request #30650: [SPARK-24818][CORE] Support delay scheduling for barrier execution

2021-01-26 Thread GitBox


Ngone51 commented on pull request #30650:
URL: https://github.com/apache/spark/pull/30650#issuecomment-767433788


   Hi @mridulm  @tgravescs sorry for the delay.
   
   After times thinking, I think we should just keep the original behavior for 
the barrier taskset with the legacy delay scheduling. That means we should 
still abort the taskset and throw an exception when tasks are partially 
launched in that case. 
   
   Think about a case under the legacy delay scheduling, saying we have 2 tasks 
for barrier taskset and one task prefers executor-0 and another task has not 
preferred locations. On the other hand, we only have the resources (executor-0, 
host-0), (executor-1, host-1) for each resourceOffers. Then, within each 
resourceOffers, one task can always get scheduled at executor-0 first and 
**reset** the timer and current locality to PROCESS_LOCAL. And then, of course, 
another task can get scheduled at PROCESS_LOCAL. And if we try the next 
resourceOffer, we still can not launch the whole taskset since we'd start from 
the locality PROCESS_LOCAL again. Therefore, we'd never have a chance to get 
the barrier taskset launched.
   
   Non-legacy delay scheduling doesn't have this issue because it only resets 
when all tasks get launched in a single resourceOffer round. That means the 
locality level will go up (from local to ANY) as time goes by until we launched 
the taskset successfully. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on pull request #30650: [SPARK-24818][CORE] Support delay scheduling for barrier execution

2021-01-11 Thread GitBox


Ngone51 commented on pull request #30650:
URL: https://github.com/apache/spark/pull/30650#issuecomment-757960531


   Wait, I just realize that the test "SPARK-24818: test resource revert of 
barrier TaskSetManager" can not pass with legacy delay scheduling. It does have 
the issue with legacy delay scheduling. I'm working on it.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on pull request #30650: [SPARK-24818][CORE] Support delay scheduling for barrier execution

2021-01-09 Thread GitBox


Ngone51 commented on pull request #30650:
URL: https://github.com/apache/spark/pull/30650#issuecomment-757337359







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on pull request #30650: [SPARK-24818][CORE] Support delay scheduling for barrier execution

2021-01-09 Thread GitBox


Ngone51 commented on pull request #30650:
URL: https://github.com/apache/spark/pull/30650#issuecomment-757337593


   @tgravescs @mridulm  I have added the log.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on pull request #30650: [SPARK-24818][CORE] Support delay scheduling for barrier execution

2021-01-09 Thread GitBox


Ngone51 commented on pull request #30650:
URL: https://github.com/apache/spark/pull/30650#issuecomment-757337359


   > any subtle dynamic resource allocation interactions though ?
   
   Barrier execution currently doesn't support dynamic resource allocation. So 
it's not a problem.
   
   > Are there any challenges/issues with doing the locality reset ?
   
   Hmm..I assume that you mean reset the locality during the time we reverting 
the resources? If so, I had reset the locality index to 0 during reverting. But 
I don't reset the time. So, it eventually gets launched since locality level 
would go up (from PROCESS_LOCAL to ANY) as time goes by.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on pull request #30650: [SPARK-24818][CORE] Support delay scheduling for barrier execution

2021-01-06 Thread GitBox


Ngone51 commented on pull request #30650:
URL: https://github.com/apache/spark/pull/30650#issuecomment-755867630


   Thanks for the review @tgravescs 
   
   > The only thing is now we could hang infinitely if we never fill the slots, 
correct? I guess we could hang before if we didn't ever get enough slots as it 
only info logs at that point.
   
   Theoretically, the barrier tsm should get scheduled at the end since we have 
ensured enough slots (via calculateAvailableSlots) before scheduling it. But 
I'm not sure now is that, maybe in legacy mode, the barrier tsm would hang 
because of level resetting? (like you mentioned above). I need to think more 
about this.
   
   > Perhaps we should at least add a debug log (if not info log) when we reset 
things to let the user know we have enough slots but something else must be 
using them.
   
   Good idea!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on pull request #30650: [SPARK-24818][CORE] Support delay scheduling for barrier execution

2020-12-17 Thread GitBox


Ngone51 commented on pull request #30650:
URL: https://github.com/apache/spark/pull/30650#issuecomment-747884574


   Also cc @tgravescs for more inputs.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on pull request #30650: [SPARK-24818][CORE] Support delay scheduling for barrier execution

2020-12-17 Thread GitBox


Ngone51 commented on pull request #30650:
URL: https://github.com/apache/spark/pull/30650#issuecomment-747884488


   Thank you very much @mridulm 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on pull request #30650: [SPARK-24818][CORE] Support delay scheduling for barrier execution

2020-12-11 Thread GitBox


Ngone51 commented on pull request #30650:
URL: https://github.com/apache/spark/pull/30650#issuecomment-743248529


   @mridulm Sorry for the confusion. I was confused at the beginning too and 
thought about creating a JIRA ticket. But I gave up since there's already a 
TODO ticket commented.
   
   I could probably update the JIRA description a little.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on pull request #30650: [SPARK-24818][CORE] Support delay scheduling for barrier execution

2020-12-07 Thread GitBox


Ngone51 commented on pull request #30650:
URL: https://github.com/apache/spark/pull/30650#issuecomment-739976799


   @jiangxb1987 Please help review, thanks!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org