[jira] Commented: (HADOOP-5003) When computing absoluet guaranteed capacity (GC) from a percent value, Capacity Scheduler should round up floats, rather than truncate them.

Vivek Ratan (JIRA) Tue, 20 Jan 2009 22:29:33 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-5003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665722#action_12665722
 ]


Vivek Ratan commented on HADOOP-5003:
-------------------------------------

People are not going to want SLAs that are 10 mins or higher. I think they'll 
be OK waiting a few mins, maybe 5. It should be OK if we let them set whatever 
time they want for the queue, but print a suitable warning message indicating 
that if the capacity is not reclaimed, it's likely because of failed TTs. Small 
values for the SLA seem perfectly reasonable, especially when all TTs are 
running. 

bq. From Q2's point of view, he doesn't see that TT1 is down. He sees that he 
is allocated 50% and that he isn't getting the 5 slots he should. He gets mad 
that no timer is running to get him his slot back.
Well, even if you start a timer the moment the job is submitted, there is no 
task to kill because nobody is running over capacity. So the timer is wasted. 
This is, of course, with the assumption that the reclaim time is less than the 
TT failure detection time (which, as I wrote earlier, should be allowed). 

I still don't agree with #2 and #3 of your 'Take home messages'. As per 
existing requirements, the SLA is in force if there are resources to be 
claimed. We can change the requirements, but it's not clear to me that we 
should. I'd rather modify the documentation to let users know that if they 
don't see a timer being started, it's because some TTs are down. Versus asking 
them to set SLAs higher than 10mins. 

I understand that you want it to be clear to the user why they're not getting 
all their slots. But your only choices seem to be to force SLA times to be very 
high, or to provide an explanation somewhere (in documentation, UI, or 
whatever). We should do the latter, but realize that if we're accepting smaller 
SLA times, setting timers early will not help - users will still not get their 
slots back if TTs are down. 

> When computing absoluet guaranteed capacity (GC) from a percent value, 
> Capacity Scheduler should round up floats, rather than truncate them.
> --------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5003
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5003
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>            Reporter: Vivek Ratan
>            Priority: Minor
>
> The Capacity Scheduler calculates a queue's absolute GC value by getting its 
> percent of the total cluster capacity (which is a float, since the configured 
> GC% is a float) and casting it to an int. Casting a float to an int always 
> rounds down. For very small clusters, this can result in the GC of a queue 
> being one lower than what it should be. For example, if Q1 has a GC of 50%, 
> Q2 has a GC of 40%, and Q3 has a GC of 10%, and if the cluster capacity is 4 
> (as we have, in our test cases), Q1's GC works out to 2, Q2's to 1, and Q3's 
> to 0 with today's code. Q2's capacity should really be 2, as 40% of 4, 
> rounded up, should be 2. 
> Simple fix is to use Math.round() rather than cast to an int. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5003) When computing absoluet guaranteed capacity (GC) from a percent value, Capacity Scheduler should round up floats, rather than truncate them.

Reply via email to