[ 
https://issues.apache.org/jira/browse/HADOOP-5003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665797#action_12665797
 ] 

Vivek Ratan commented on HADOOP-5003:
-------------------------------------

Let me summarize the various points I'm making: 

# If you accept that reclaim SLA times can be smaller than 10 mins (which is 
the time to detect if a TT is dead), starting timers when no queue is running 
over capacity is a waste. When the timer goes off, there is nothing to kill. A 
user will still see their queue is not getting enough slots, irrespective of 
whether you start a timer or not. What they can also see is that no other queue 
is running over capacity, and this should be a good enough indicator that some 
TT is down (of course, we can document this explanation to help them make 
sense). Starting a timer will not change what the user sees. 
# The SLA requirement currently states "...the system will guarantee that 
excess resources taken from an Org will be restored to it within N minutes of 
its need for them". I read that as starting a timer when a queue has enough 
pending tasks AND somebody else has stolen capacity from it. If you start a 
timer too early, you kill too early, and you lose the benefit of letting queues 
borrow capacities from others for N minutes. 



> When computing absoluet guaranteed capacity (GC) from a percent value, 
> Capacity Scheduler should round up floats, rather than truncate them.
> --------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5003
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5003
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>            Reporter: Vivek Ratan
>            Priority: Minor
>
> The Capacity Scheduler calculates a queue's absolute GC value by getting its 
> percent of the total cluster capacity (which is a float, since the configured 
> GC% is a float) and casting it to an int. Casting a float to an int always 
> rounds down. For very small clusters, this can result in the GC of a queue 
> being one lower than what it should be. For example, if Q1 has a GC of 50%, 
> Q2 has a GC of 40%, and Q3 has a GC of 10%, and if the cluster capacity is 4 
> (as we have, in our test cases), Q1's GC works out to 2, Q2's to 1, and Q3's 
> to 0 with today's code. Q2's capacity should really be 2, as 40% of 4, 
> rounded up, should be 2. 
> Simple fix is to use Math.round() rather than cast to an int. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to