[ 
https://issues.apache.org/jira/browse/HADOOP-5003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666237#action_12666237
 ] 

Hemanth Yamijala commented on HADOOP-5003:
------------------------------------------

Vivek, the connection between the reclaim time limit and the time to detect TT 
failures lies in the corner cases that you so rightly identified. We know there 
are scenarios when we cannot honor the SLA. So, not committing to it during the 
time seems like a conservation approach to me, that's all.

bq. Like I said earlier, setting reclaim time limits to larger than 10 mins is 
not going to fly well with users, IMO.

I had a brief chat with Milind on this, and he actually did think timeouts in 
the order of 15-20 minutes is going to be OK, because users typically wait much 
longer than that currently.

An in-between approach could be as follows:
- Start a timer irrespective of whether there's spare capacity.
- When you need to reclaim, and find that there's no capacity to reclaim, log a 
warning that can be viewed by users (so maybe in some format in the Web UI for 
the queue) that the SLA may not be met becaue the system state is not 
synchronized.
- Then reset the timer to check again after a lost tracker timeout interval.

Would that work ? Thoughts ?


> When computing absoluet guaranteed capacity (GC) from a percent value, 
> Capacity Scheduler should round up floats, rather than truncate them.
> --------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5003
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5003
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>            Reporter: Vivek Ratan
>            Priority: Minor
>
> The Capacity Scheduler calculates a queue's absolute GC value by getting its 
> percent of the total cluster capacity (which is a float, since the configured 
> GC% is a float) and casting it to an int. Casting a float to an int always 
> rounds down. For very small clusters, this can result in the GC of a queue 
> being one lower than what it should be. For example, if Q1 has a GC of 50%, 
> Q2 has a GC of 40%, and Q3 has a GC of 10%, and if the cluster capacity is 4 
> (as we have, in our test cases), Q1's GC works out to 2, Q2's to 1, and Q3's 
> to 0 with today's code. Q2's capacity should really be 2, as 40% of 4, 
> rounded up, should be 2. 
> Simple fix is to use Math.round() rather than cast to an int. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to