[ 
https://issues.apache.org/jira/browse/YARN-9278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16780101#comment-16780101
 ] 

Wilfred Spiegelenburg edited comment on YARN-9278 at 3/12/19 12:25 PM:
-----------------------------------------------------------------------

Two things:
* I still think limiting the number of nodes is something we need to approach 
with care.
* randomising a 10,000 entry long list each time we pre-empt will also become 
expensive.
 
I was thinking more of something like this:
{code:java}
  int preEmptionBatchSize = conf.getPreEmptionBatchSize();
  List<FSSchedulerNode> potentialNodes = 
scheduler.getNodeTracker().getNodesByResourceName(rr.getResourceName());
  int size = potentialNodes.size();
  int stop = 0;
  int current = 0;
  // find a start point somewhere in the list if it is long
  if (size > preEmptionBatchSize) {
    Random rand = new Random();
    current = rand.nextInt(size / preEmptionBatchSize) * preEmptionBatchSize;
    stop = (preEmptionBatchSize > size) ? current : ((current + 
preEmptionBatchSize) % size);
  }
  do {
    FSSchedulerNode mine = potentialNodes.get(current);
    // Identify the containers
    ....
    current++;
    // flip at the end of the list  
    if (current > size) {
      current = 0;
    }
  } while (current != stop);
{code}

Pre-emption runs in a loop and we could be considering different applications 
one after the other. Shuffling that node list continually is not good from a 
performance perspective. A simple cut in like above gives the same kind of 
behaviour. 
We could then still limit the number of "batches" we process. With some more 
smarts the stop condition could be based on the fact that we have processed as 
an example 10 * the batch size in nodes (a batch of nodes could be deemed 
equivalent with the number of nodes in a rack):
{code}  stop = ((10 * preEmptionBatchSize) > size) ? current : (((10 * 
preEmptionBatchSize) + current) % size););
{code}  

That gives a lot of flexibility and still a decent performance in a large 
cluster.


was (Author: wilfreds):
Two things:
* I still think limiting the number of nodes is something we need to approach 
with care.
* randomising a 10,000 entry long list each time we pre-empt will also become 
expensive.
 
I was thinking more of something like this:
{code:java}
  int preEmptionBatchSize = conf.getPreEmptionBatchSize();
  List<FSSchedulerNode> potentialNodes = 
scheduler.getNodeTracker().getNodesByResourceName(rr.getResourceName());
  int size = potentialNodes.size();
  int stop = 0;
  int current = 0;
  // find a start point somewhere in the list if it is long
  if (size > preEmptionBatchSize) {
    Random rand = new Random();
    current = rand.nextInt(size / preEmptionBatchSize) * preEmptionBatchSize;
    stop = current;
  }
  do {
    FSSchedulerNode mine = potentialNodes.get(current);
    // Identify the containers
    ....
    current++;
    // flip at the end of the list  
    if (current > size) {
      current = 0;
    }
  } while (current != stop);
{code}

Pre-emption runs in a loop and we could be considering different applications 
one after the other. Shuffling that node list continually is not good from a 
performance perspective. A simple cut in like above gives the same kind of 
behaviour. 
We could then still limit the number of "batches" we process. With some more 
smarts the stop condition could be based on the fact that we have processed as 
an example 10 * the batch size in nodes (a batch of nodes could be deemed 
equivalent with the number of nodes in a rack):
{code}  stop = ((10 * preEmptionBatchSize) > size) ? current : (((10 * 
preEmptionBatchSize) + current) % size););
{code}  

That gives a lot of flexibility and still a decent performance in a large 
cluster.

> Shuffle nodes when selecting to be preempted nodes
> --------------------------------------------------
>
>                 Key: YARN-9278
>                 URL: https://issues.apache.org/jira/browse/YARN-9278
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: fairscheduler
>            Reporter: Zhaohui Xin
>            Assignee: Zhaohui Xin
>            Priority: Major
>         Attachments: YARN-9278.001.patch
>
>
> We should *shuffle* the nodes to avoid some nodes being preempted frequently. 
> Also, we should *limit* the num of nodes to make preemption more efficient.
> Just like this,
> {code:java}
> // we should not iterate all nodes, that will be very slow
> long maxTryNodeNum = 
> context.getPreemptionConfig().getToBePreemptedNodeMaxNumOnce();
> if (potentialNodes.size() > maxTryNodeNum){
>   Collections.shuffle(potentialNodes);
>   List<FSSchedulerNode> newPotentialNodes = new ArrayList<FSSchedulerNode>();
> for (int i = 0; i < maxTryNodeNum; i++){
>   newPotentialNodes.add(potentialNodes.get(i));
> }
> potentialNodes = newPotentialNodes;
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to