[
https://issues.apache.org/jira/browse/HADOOP-5003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665797#action_12665797
]
Vivek Ratan commented on HADOOP-5003:
-------------------------------------
Let me summarize the various points I'm making:
# If you accept that reclaim SLA times can be smaller than 10 mins (which is
the time to detect if a TT is dead), starting timers when no queue is running
over capacity is a waste. When the timer goes off, there is nothing to kill. A
user will still see their queue is not getting enough slots, irrespective of
whether you start a timer or not. What they can also see is that no other queue
is running over capacity, and this should be a good enough indicator that some
TT is down (of course, we can document this explanation to help them make
sense). Starting a timer will not change what the user sees.
# The SLA requirement currently states "...the system will guarantee that
excess resources taken from an Org will be restored to it within N minutes of
its need for them". I read that as starting a timer when a queue has enough
pending tasks AND somebody else has stolen capacity from it. If you start a
timer too early, you kill too early, and you lose the benefit of letting queues
borrow capacities from others for N minutes.
> When computing absoluet guaranteed capacity (GC) from a percent value,
> Capacity Scheduler should round up floats, rather than truncate them.
> --------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-5003
> URL: https://issues.apache.org/jira/browse/HADOOP-5003
> Project: Hadoop Core
> Issue Type: Bug
> Components: contrib/capacity-sched
> Reporter: Vivek Ratan
> Priority: Minor
>
> The Capacity Scheduler calculates a queue's absolute GC value by getting its
> percent of the total cluster capacity (which is a float, since the configured
> GC% is a float) and casting it to an int. Casting a float to an int always
> rounds down. For very small clusters, this can result in the GC of a queue
> being one lower than what it should be. For example, if Q1 has a GC of 50%,
> Q2 has a GC of 40%, and Q3 has a GC of 10%, and if the cluster capacity is 4
> (as we have, in our test cases), Q1's GC works out to 2, Q2's to 1, and Q3's
> to 0 with today's code. Q2's capacity should really be 2, as 40% of 4,
> rounded up, should be 2.
> Simple fix is to use Math.round() rather than cast to an int.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.