Re: [PATCH V6 00/18] blk-throttle: add .low limit

2017-09-22 Thread Paolo Valente

> Il giorno 05 set 2017, alle ore 23:02, Shaohua Li  ha 
> scritto:
> 
> On Thu, Aug 31, 2017 at 09:24:23AM +0200, Paolo VALENTE wrote:
>> 
>>> Il giorno 15 gen 2017, alle ore 04:42, Shaohua Li  ha scritto:
>>> 
>>> Hi,
>>> 
>>> cgroup still lacks a good iocontroller. CFQ works well for hard disk, but 
>>> not
>>> much for SSD. This patch set try to add a conservative limit for 
>>> blk-throttle.
>>> It isn't a proportional scheduling, but can help prioritize cgroups. There 
>>> are
>>> several advantages we choose blk-throttle:
>>> - blk-throttle resides early in the block stack. It works for both bio and
>>> request based queues.
>>> - blk-throttle is light weight in general. It still takes queue lock, but 
>>> it's
>>> not hard to implement a per-cpu cache and remove the lock contention.
>>> - blk-throttle doesn't use 'idle disk' mechanism, which is used by CFQ/BFQ. 
>>> The
>>> mechanism is proved to harm performance for fast SSD.
>>> 
>>> The patch set add a new io.low limit for blk-throttle. It's only for 
>>> cgroup2.
>>> The existing io.max is a hard limit throttling. cgroup with a max limit 
>>> never
>>> dispatch more IO than its max limit. While io.low is a best effort 
>>> throttling.
>>> cgroups with 'low' limit can run above their 'low' limit at appropriate 
>>> time.
>>> Specifically, if all cgroups reach their 'low' limit, all cgroups can run 
>>> above
>>> their 'low' limit. If any cgroup runs under its 'low' limit, all other 
>>> cgroups
>>> will run according to their 'low' limit. So the 'low' limit could act as two
>>> roles, it allows cgroups using free bandwidth and it protects cgroups from
>>> their 'low' limit.
>>> 
>>> An example usage is we have a high prio cgroup with high 'low' limit and a 
>>> low
>>> prio cgroup with low 'low' limit. If the high prio cgroup isn't running, 
>>> the low
>>> prio can run above its 'low' limit, so we don't waste the bandwidth. When 
>>> the
>>> high prio cgroup runs and is below its 'low' limit, low prio cgroup will run
>>> under its 'low' limit. This will protect high prio cgroup to get more
>>> resources.
>>> 
>> 
>> Hi Shaohua,
> 
> Hi,
> 
> Sorry for the late response.
>> I would like to ask you some questions, to make sure I fully
>> understand how the 'low' limit and the idle-group detection work in
>> your above scenario.  Suppose that: the drive has a random-I/O peak
>> rate of 100MB/s, the high prio group has a 'low' limit of 90 MB/s, and
>> the low prio group has a 'low' limit of 10 MB/s.  If
>> - the high prio process happens to do, say, only 5 MB/s for a given
>>  long time
>> - the low prio process constantly does greedy I/O
>> - the idle-group detection is not being used
>> then the low prio process is limited to 10 MB/s during all this time
>> interval.  And only 10% of the device bandwidth is utilized.
>> 
>> To recover lost bandwidth through idle-group detection, we need to set
>> a target IO latency for the high-prio group.  The high prio group
>> should happen to be below the threshold, and thus to be detected as
>> idle, leaving the low prio group free too use all the bandwidth.
>> 
>> Here are my questions:
>> 1) Is all I wrote above correct?
> 
> Yes
>> 2) In particular, maybe there are other better mechanism to saturate
>> the bandwidth in the above scenario?
> 
> Assume it's the 4) below.
>> If what I wrote above is correct:
>> 3) Doesn't fluctuation occur?  I mean: when the low prio group gets
>> full bandwidth, the latency threshold of the high prio group may be
>> overcome, causing the high prio group to not be considered idle any
>> longer, and thus the low prio group to be limited again; this in turn
>> will cause the threshold to not be overcome any longer, and so on.
> 
> That's true. We try to mitigate the fluctuation by increasing the low prio
> cgroup bandwidth graduately though.
> 
>> 4) Is there a way to compute an appropriate target latency of the high
>> prio group, if it is a generic group, for which the latency
>> requirements of the processes it contains are only partially known or
>> completely unknown?  By appropriate target latency, I mean a target
>> latency that enables the framework to fully utilize the device
>> bandwidth while the high prio group is doing less I/O than its limit.
> 
> Not sure how we can do this. The device max bandwidth varies based on request
> size and read/write ratio. We don't know when the max bandwidth is reached.
> Also I think we must consider a case that the workloads never use the full
> bandwidth of a disk, which is pretty common for SSD (at least in our
> environment).
> 

Hi Shaohua,
sorry for adding this bit so late (and of course thanks for your
previous explanations).  By fully utilizing the device bandwidth, I
(imprecisely) didn't mean reaching peak rate, but being close to, and
thus utilizing, the maximum possible throughput achievable with the
workload to serve.  But the only way to know what such maximum
throughput would be, one 

Re: [PATCH V6 00/18] blk-throttle: add .low limit

2017-09-06 Thread Shaohua Li
On Wed, Sep 06, 2017 at 09:12:20AM +0800, Joseph Qi wrote:
> Hi Shaohua,
> 
> On 17/9/6 05:02, Shaohua Li wrote:
> > On Thu, Aug 31, 2017 at 09:24:23AM +0200, Paolo VALENTE wrote:
> >>
> >>> Il giorno 15 gen 2017, alle ore 04:42, Shaohua Li  ha 
> >>> scritto:
> >>>
> >>> Hi,
> >>>
> >>> cgroup still lacks a good iocontroller. CFQ works well for hard disk, but 
> >>> not
> >>> much for SSD. This patch set try to add a conservative limit for 
> >>> blk-throttle.
> >>> It isn't a proportional scheduling, but can help prioritize cgroups. 
> >>> There are
> >>> several advantages we choose blk-throttle:
> >>> - blk-throttle resides early in the block stack. It works for both bio and
> >>>  request based queues.
> >>> - blk-throttle is light weight in general. It still takes queue lock, but 
> >>> it's
> >>>  not hard to implement a per-cpu cache and remove the lock contention.
> >>> - blk-throttle doesn't use 'idle disk' mechanism, which is used by 
> >>> CFQ/BFQ. The
> >>>  mechanism is proved to harm performance for fast SSD.
> >>>
> >>> The patch set add a new io.low limit for blk-throttle. It's only for 
> >>> cgroup2.
> >>> The existing io.max is a hard limit throttling. cgroup with a max limit 
> >>> never
> >>> dispatch more IO than its max limit. While io.low is a best effort 
> >>> throttling.
> >>> cgroups with 'low' limit can run above their 'low' limit at appropriate 
> >>> time.
> >>> Specifically, if all cgroups reach their 'low' limit, all cgroups can run 
> >>> above
> >>> their 'low' limit. If any cgroup runs under its 'low' limit, all other 
> >>> cgroups
> >>> will run according to their 'low' limit. So the 'low' limit could act as 
> >>> two
> >>> roles, it allows cgroups using free bandwidth and it protects cgroups from
> >>> their 'low' limit.
> >>>
> >>> An example usage is we have a high prio cgroup with high 'low' limit and 
> >>> a low
> >>> prio cgroup with low 'low' limit. If the high prio cgroup isn't running, 
> >>> the low
> >>> prio can run above its 'low' limit, so we don't waste the bandwidth. When 
> >>> the
> >>> high prio cgroup runs and is below its 'low' limit, low prio cgroup will 
> >>> run
> >>> under its 'low' limit. This will protect high prio cgroup to get more
> >>> resources.
> >>>
> >>
> >> Hi Shaohua,
> > 
> > Hi,
> > 
> > Sorry for the late response.
> >> I would like to ask you some questions, to make sure I fully
> >> understand how the 'low' limit and the idle-group detection work in
> >> your above scenario.  Suppose that: the drive has a random-I/O peak
> >> rate of 100MB/s, the high prio group has a 'low' limit of 90 MB/s, and
> >> the low prio group has a 'low' limit of 10 MB/s.  If
> >> - the high prio process happens to do, say, only 5 MB/s for a given
> >>   long time
> >> - the low prio process constantly does greedy I/O
> >> - the idle-group detection is not being used
> >> then the low prio process is limited to 10 MB/s during all this time
> >> interval.  And only 10% of the device bandwidth is utilized.
> >>
> >> To recover lost bandwidth through idle-group detection, we need to set
> >> a target IO latency for the high-prio group.  The high prio group
> >> should happen to be below the threshold, and thus to be detected as
> >> idle, leaving the low prio group free too use all the bandwidth.
> >>
> >> Here are my questions:
> >> 1) Is all I wrote above correct?
> > 
> > Yes
> >> 2) In particular, maybe there are other better mechanism to saturate
> >> the bandwidth in the above scenario?
> > 
> > Assume it's the 4) below.
> >> If what I wrote above is correct:
> >> 3) Doesn't fluctuation occur?  I mean: when the low prio group gets
> >> full bandwidth, the latency threshold of the high prio group may be
> >> overcome, causing the high prio group to not be considered idle any
> >> longer, and thus the low prio group to be limited again; this in turn
> >> will cause the threshold to not be overcome any longer, and so on.
> > 
> > That's true. We try to mitigate the fluctuation by increasing the low prio
> > cgroup bandwidth graduately though.
> > 
> >> 4) Is there a way to compute an appropriate target latency of the high
> >> prio group, if it is a generic group, for which the latency
> >> requirements of the processes it contains are only partially known or
> >> completely unknown?  By appropriate target latency, I mean a target
> >> latency that enables the framework to fully utilize the device
> >> bandwidth while the high prio group is doing less I/O than its limit.
> > 
> > Not sure how we can do this. The device max bandwidth varies based on 
> > request
> > size and read/write ratio. We don't know when the max bandwidth is reached.
> > Also I think we must consider a case that the workloads never use the full
> > bandwidth of a disk, which is pretty common for SSD (at least in our
> > environment).
> > 
> I have a question on the base latency tracking.
> From my test on SSD, write latency is much lower than read when doing
> 

Re: [PATCH V6 00/18] blk-throttle: add .low limit

2017-09-05 Thread Joseph Qi
Hi Shaohua,

On 17/9/6 05:02, Shaohua Li wrote:
> On Thu, Aug 31, 2017 at 09:24:23AM +0200, Paolo VALENTE wrote:
>>
>>> Il giorno 15 gen 2017, alle ore 04:42, Shaohua Li  ha scritto:
>>>
>>> Hi,
>>>
>>> cgroup still lacks a good iocontroller. CFQ works well for hard disk, but 
>>> not
>>> much for SSD. This patch set try to add a conservative limit for 
>>> blk-throttle.
>>> It isn't a proportional scheduling, but can help prioritize cgroups. There 
>>> are
>>> several advantages we choose blk-throttle:
>>> - blk-throttle resides early in the block stack. It works for both bio and
>>>  request based queues.
>>> - blk-throttle is light weight in general. It still takes queue lock, but 
>>> it's
>>>  not hard to implement a per-cpu cache and remove the lock contention.
>>> - blk-throttle doesn't use 'idle disk' mechanism, which is used by CFQ/BFQ. 
>>> The
>>>  mechanism is proved to harm performance for fast SSD.
>>>
>>> The patch set add a new io.low limit for blk-throttle. It's only for 
>>> cgroup2.
>>> The existing io.max is a hard limit throttling. cgroup with a max limit 
>>> never
>>> dispatch more IO than its max limit. While io.low is a best effort 
>>> throttling.
>>> cgroups with 'low' limit can run above their 'low' limit at appropriate 
>>> time.
>>> Specifically, if all cgroups reach their 'low' limit, all cgroups can run 
>>> above
>>> their 'low' limit. If any cgroup runs under its 'low' limit, all other 
>>> cgroups
>>> will run according to their 'low' limit. So the 'low' limit could act as two
>>> roles, it allows cgroups using free bandwidth and it protects cgroups from
>>> their 'low' limit.
>>>
>>> An example usage is we have a high prio cgroup with high 'low' limit and a 
>>> low
>>> prio cgroup with low 'low' limit. If the high prio cgroup isn't running, 
>>> the low
>>> prio can run above its 'low' limit, so we don't waste the bandwidth. When 
>>> the
>>> high prio cgroup runs and is below its 'low' limit, low prio cgroup will run
>>> under its 'low' limit. This will protect high prio cgroup to get more
>>> resources.
>>>
>>
>> Hi Shaohua,
> 
> Hi,
> 
> Sorry for the late response.
>> I would like to ask you some questions, to make sure I fully
>> understand how the 'low' limit and the idle-group detection work in
>> your above scenario.  Suppose that: the drive has a random-I/O peak
>> rate of 100MB/s, the high prio group has a 'low' limit of 90 MB/s, and
>> the low prio group has a 'low' limit of 10 MB/s.  If
>> - the high prio process happens to do, say, only 5 MB/s for a given
>>   long time
>> - the low prio process constantly does greedy I/O
>> - the idle-group detection is not being used
>> then the low prio process is limited to 10 MB/s during all this time
>> interval.  And only 10% of the device bandwidth is utilized.
>>
>> To recover lost bandwidth through idle-group detection, we need to set
>> a target IO latency for the high-prio group.  The high prio group
>> should happen to be below the threshold, and thus to be detected as
>> idle, leaving the low prio group free too use all the bandwidth.
>>
>> Here are my questions:
>> 1) Is all I wrote above correct?
> 
> Yes
>> 2) In particular, maybe there are other better mechanism to saturate
>> the bandwidth in the above scenario?
> 
> Assume it's the 4) below.
>> If what I wrote above is correct:
>> 3) Doesn't fluctuation occur?  I mean: when the low prio group gets
>> full bandwidth, the latency threshold of the high prio group may be
>> overcome, causing the high prio group to not be considered idle any
>> longer, and thus the low prio group to be limited again; this in turn
>> will cause the threshold to not be overcome any longer, and so on.
> 
> That's true. We try to mitigate the fluctuation by increasing the low prio
> cgroup bandwidth graduately though.
> 
>> 4) Is there a way to compute an appropriate target latency of the high
>> prio group, if it is a generic group, for which the latency
>> requirements of the processes it contains are only partially known or
>> completely unknown?  By appropriate target latency, I mean a target
>> latency that enables the framework to fully utilize the device
>> bandwidth while the high prio group is doing less I/O than its limit.
> 
> Not sure how we can do this. The device max bandwidth varies based on request
> size and read/write ratio. We don't know when the max bandwidth is reached.
> Also I think we must consider a case that the workloads never use the full
> bandwidth of a disk, which is pretty common for SSD (at least in our
> environment).
> 
I have a question on the base latency tracking.
>From my test on SSD, write latency is much lower than read when doing
mixed read/write, but currently we only track read request and then use
it's average as base latency. In other words, we don't distinguish read
and write now. As a result, all write request's latency will always be
considered as good. So I think we have to track read and write latency
separately. Or 

Re: [PATCH V6 00/18] blk-throttle: add .low limit

2017-09-05 Thread Shaohua Li
On Thu, Aug 31, 2017 at 09:24:23AM +0200, Paolo VALENTE wrote:
> 
> > Il giorno 15 gen 2017, alle ore 04:42, Shaohua Li  ha scritto:
> > 
> > Hi,
> > 
> > cgroup still lacks a good iocontroller. CFQ works well for hard disk, but 
> > not
> > much for SSD. This patch set try to add a conservative limit for 
> > blk-throttle.
> > It isn't a proportional scheduling, but can help prioritize cgroups. There 
> > are
> > several advantages we choose blk-throttle:
> > - blk-throttle resides early in the block stack. It works for both bio and
> >  request based queues.
> > - blk-throttle is light weight in general. It still takes queue lock, but 
> > it's
> >  not hard to implement a per-cpu cache and remove the lock contention.
> > - blk-throttle doesn't use 'idle disk' mechanism, which is used by CFQ/BFQ. 
> > The
> >  mechanism is proved to harm performance for fast SSD.
> > 
> > The patch set add a new io.low limit for blk-throttle. It's only for 
> > cgroup2.
> > The existing io.max is a hard limit throttling. cgroup with a max limit 
> > never
> > dispatch more IO than its max limit. While io.low is a best effort 
> > throttling.
> > cgroups with 'low' limit can run above their 'low' limit at appropriate 
> > time.
> > Specifically, if all cgroups reach their 'low' limit, all cgroups can run 
> > above
> > their 'low' limit. If any cgroup runs under its 'low' limit, all other 
> > cgroups
> > will run according to their 'low' limit. So the 'low' limit could act as two
> > roles, it allows cgroups using free bandwidth and it protects cgroups from
> > their 'low' limit.
> > 
> > An example usage is we have a high prio cgroup with high 'low' limit and a 
> > low
> > prio cgroup with low 'low' limit. If the high prio cgroup isn't running, 
> > the low
> > prio can run above its 'low' limit, so we don't waste the bandwidth. When 
> > the
> > high prio cgroup runs and is below its 'low' limit, low prio cgroup will run
> > under its 'low' limit. This will protect high prio cgroup to get more
> > resources.
> > 
> 
> Hi Shaohua,

Hi,

Sorry for the late response.
> I would like to ask you some questions, to make sure I fully
> understand how the 'low' limit and the idle-group detection work in
> your above scenario.  Suppose that: the drive has a random-I/O peak
> rate of 100MB/s, the high prio group has a 'low' limit of 90 MB/s, and
> the low prio group has a 'low' limit of 10 MB/s.  If
> - the high prio process happens to do, say, only 5 MB/s for a given
>   long time
> - the low prio process constantly does greedy I/O
> - the idle-group detection is not being used
> then the low prio process is limited to 10 MB/s during all this time
> interval.  And only 10% of the device bandwidth is utilized.
> 
> To recover lost bandwidth through idle-group detection, we need to set
> a target IO latency for the high-prio group.  The high prio group
> should happen to be below the threshold, and thus to be detected as
> idle, leaving the low prio group free too use all the bandwidth.
> 
> Here are my questions:
> 1) Is all I wrote above correct?

Yes
> 2) In particular, maybe there are other better mechanism to saturate
> the bandwidth in the above scenario?

Assume it's the 4) below.
> If what I wrote above is correct:
> 3) Doesn't fluctuation occur?  I mean: when the low prio group gets
> full bandwidth, the latency threshold of the high prio group may be
> overcome, causing the high prio group to not be considered idle any
> longer, and thus the low prio group to be limited again; this in turn
> will cause the threshold to not be overcome any longer, and so on.

That's true. We try to mitigate the fluctuation by increasing the low prio
cgroup bandwidth graduately though.

> 4) Is there a way to compute an appropriate target latency of the high
> prio group, if it is a generic group, for which the latency
> requirements of the processes it contains are only partially known or
> completely unknown?  By appropriate target latency, I mean a target
> latency that enables the framework to fully utilize the device
> bandwidth while the high prio group is doing less I/O than its limit.

Not sure how we can do this. The device max bandwidth varies based on request
size and read/write ratio. We don't know when the max bandwidth is reached.
Also I think we must consider a case that the workloads never use the full
bandwidth of a disk, which is pretty common for SSD (at least in our
environment).

Thanks,
Shaohua


Re: [PATCH V6 00/18] blk-throttle: add .low limit

2017-08-31 Thread Paolo VALENTE

> Il giorno 15 gen 2017, alle ore 04:42, Shaohua Li  ha scritto:
> 
> Hi,
> 
> cgroup still lacks a good iocontroller. CFQ works well for hard disk, but not
> much for SSD. This patch set try to add a conservative limit for blk-throttle.
> It isn't a proportional scheduling, but can help prioritize cgroups. There are
> several advantages we choose blk-throttle:
> - blk-throttle resides early in the block stack. It works for both bio and
>  request based queues.
> - blk-throttle is light weight in general. It still takes queue lock, but it's
>  not hard to implement a per-cpu cache and remove the lock contention.
> - blk-throttle doesn't use 'idle disk' mechanism, which is used by CFQ/BFQ. 
> The
>  mechanism is proved to harm performance for fast SSD.
> 
> The patch set add a new io.low limit for blk-throttle. It's only for cgroup2.
> The existing io.max is a hard limit throttling. cgroup with a max limit never
> dispatch more IO than its max limit. While io.low is a best effort throttling.
> cgroups with 'low' limit can run above their 'low' limit at appropriate time.
> Specifically, if all cgroups reach their 'low' limit, all cgroups can run 
> above
> their 'low' limit. If any cgroup runs under its 'low' limit, all other cgroups
> will run according to their 'low' limit. So the 'low' limit could act as two
> roles, it allows cgroups using free bandwidth and it protects cgroups from
> their 'low' limit.
> 
> An example usage is we have a high prio cgroup with high 'low' limit and a low
> prio cgroup with low 'low' limit. If the high prio cgroup isn't running, the 
> low
> prio can run above its 'low' limit, so we don't waste the bandwidth. When the
> high prio cgroup runs and is below its 'low' limit, low prio cgroup will run
> under its 'low' limit. This will protect high prio cgroup to get more
> resources.
> 

Hi Shaohua,
I would like to ask you some questions, to make sure I fully
understand how the 'low' limit and the idle-group detection work in
your above scenario.  Suppose that: the drive has a random-I/O peak
rate of 100MB/s, the high prio group has a 'low' limit of 90 MB/s, and
the low prio group has a 'low' limit of 10 MB/s.  If
- the high prio process happens to do, say, only 5 MB/s for a given
  long time
- the low prio process constantly does greedy I/O
- the idle-group detection is not being used
then the low prio process is limited to 10 MB/s during all this time
interval.  And only 10% of the device bandwidth is utilized.

To recover lost bandwidth through idle-group detection, we need to set
a target IO latency for the high-prio group.  The high prio group
should happen to be below the threshold, and thus to be detected as
idle, leaving the low prio group free too use all the bandwidth.

Here are my questions:
1) Is all I wrote above correct?
2) In particular, maybe there are other better mechanism to saturate
the bandwidth in the above scenario?

If what I wrote above is correct:
3) Doesn't fluctuation occur?  I mean: when the low prio group gets
full bandwidth, the latency threshold of the high prio group may be
overcome, causing the high prio group to not be considered idle any
longer, and thus the low prio group to be limited again; this in turn
will cause the threshold to not be overcome any longer, and so on.
4) Is there a way to compute an appropriate target latency of the high
prio group, if it is a generic group, for which the latency
requirements of the processes it contains are only partially known or
completely unknown?  By appropriate target latency, I mean a target
latency that enables the framework to fully utilize the device
bandwidth while the high prio group is doing less I/O than its limit.

Thanks,
Paolo

> The implementation is simple. The disk queue has a state machine. We have 2
> states LIMIT_LOW and LIMIT_MAX. In each disk state, we throttle cgroups
> according to the limit of the state. That is io.low limit for LIMIT_LOW state,
> io.max limit for LIMIT_MAX. The disk state can be upgraded/downgraded between
> LIMIT_LOW and LIMIT_MAX according to the rule aboe. Initially disk state is
> LIMIT_MAX. And if no cgroup sets io.low, the disk state will remain in
> LIMIT_MAX state. Systems with only io.max set will find nothing changed with 
> the
> patches.
> 
> The first 10 patches implement the basic framework. Add interface, handle
> upgrade and downgrade logic. The patch 10 detects a special case a cgroup is
> completely idle. In this case, we ignore the cgroup's limit. The patch 11-18
> adds more heuristics.
> 
> The basic framework has 2 major issues.
> 
> 1. fluctuation. When the state is upgraded from LIMIT_LOW to LIMIT_MAX, the
> cgroup's bandwidth can change dramatically, sometimes in a way we are not
> expected. For example, one cgroup's bandwidth will drop below its io.low limit
> very soon after a upgrade. patch 10 has more details about the issue.
> 
> 2. idle cgroup. cgroup with a io.low limit doesn't always dispatch enough IO.
> In above 

[PATCH V6 00/18] blk-throttle: add .low limit

2017-01-14 Thread Shaohua Li
Hi,

cgroup still lacks a good iocontroller. CFQ works well for hard disk, but not
much for SSD. This patch set try to add a conservative limit for blk-throttle.
It isn't a proportional scheduling, but can help prioritize cgroups. There are
several advantages we choose blk-throttle:
- blk-throttle resides early in the block stack. It works for both bio and
  request based queues.
- blk-throttle is light weight in general. It still takes queue lock, but it's
  not hard to implement a per-cpu cache and remove the lock contention.
- blk-throttle doesn't use 'idle disk' mechanism, which is used by CFQ/BFQ. The
  mechanism is proved to harm performance for fast SSD.

The patch set add a new io.low limit for blk-throttle. It's only for cgroup2.
The existing io.max is a hard limit throttling. cgroup with a max limit never
dispatch more IO than its max limit. While io.low is a best effort throttling.
cgroups with 'low' limit can run above their 'low' limit at appropriate time.
Specifically, if all cgroups reach their 'low' limit, all cgroups can run above
their 'low' limit. If any cgroup runs under its 'low' limit, all other cgroups
will run according to their 'low' limit. So the 'low' limit could act as two
roles, it allows cgroups using free bandwidth and it protects cgroups from
their 'low' limit.

An example usage is we have a high prio cgroup with high 'low' limit and a low
prio cgroup with low 'low' limit. If the high prio cgroup isn't running, the low
prio can run above its 'low' limit, so we don't waste the bandwidth. When the
high prio cgroup runs and is below its 'low' limit, low prio cgroup will run
under its 'low' limit. This will protect high prio cgroup to get more
resources.

The implementation is simple. The disk queue has a state machine. We have 2
states LIMIT_LOW and LIMIT_MAX. In each disk state, we throttle cgroups
according to the limit of the state. That is io.low limit for LIMIT_LOW state,
io.max limit for LIMIT_MAX. The disk state can be upgraded/downgraded between
LIMIT_LOW and LIMIT_MAX according to the rule aboe. Initially disk state is
LIMIT_MAX. And if no cgroup sets io.low, the disk state will remain in
LIMIT_MAX state. Systems with only io.max set will find nothing changed with the
patches.

The first 10 patches implement the basic framework. Add interface, handle
upgrade and downgrade logic. The patch 10 detects a special case a cgroup is
completely idle. In this case, we ignore the cgroup's limit. The patch 11-18
adds more heuristics.

The basic framework has 2 major issues.

1. fluctuation. When the state is upgraded from LIMIT_LOW to LIMIT_MAX, the
cgroup's bandwidth can change dramatically, sometimes in a way we are not
expected. For example, one cgroup's bandwidth will drop below its io.low limit
very soon after a upgrade. patch 10 has more details about the issue.

2. idle cgroup. cgroup with a io.low limit doesn't always dispatch enough IO.
In above upgrade rule, the disk will remain in LIMIT_LOW state and all other
cgroups can't dispatch more IO above their 'low' limit. Hence there is waste.
patch 11 has more details about the issue.

For issue 1, we make cgroup bandwidth increase/decrease smoothly after a
upgrade/downgrade. This will reduce the chance a cgroup's bandwidth drop under
its 'low' limit rapidly. The smoothness means we could waste some bandwidth in
the transition though. But we must pay something for sharing.

The issue 2 is very hard. We introduce two mechanisms for this. One is 'idle
time' or 'think time' borrowed from CFQ. If a cgroup's average idle time is
high, we treat it's idle and its 'low' limit isn't respected. Please see patch
12 - 14 for details. The other is 'latency target'. If a cgroup's io latency is
low, we treat it's idle and its 'low' limit isn't resptected. Please see patch
15 - 18 for fetails. Both mechanisms only happen when a cgroup runs below its
'low' limit.

The disadvantages of blk-throttle is it exports a kind of low level knobs.
Configuration would not be easy for normal users. It would be powerful for
experienced users though.

More tuning is required of course, but otherwise this works well. Please
review, test and consider merge.

Thanks,
Shaohua

V5->V6:
- Change default setting for io.low limit. It's 0 now, which makes more sense
- The default setting for latency is still 0, the default setting for idle time
  becomes bigger. So with the default settings, cgroups have small latency but
  disk sharing could be harmed
- Addressed other issues pointed out by Tejun

V4->V5, basically address Tejun's comments:
- Change interface from 'io.high' to 'io.low' so consistent with memcg
- Change interface for 'idle time' and 'latency target'
- Make 'idle time' per-cgroup-disk instead of per-cgroup
- Chnage interface name for 'throttle slice'. It's not a real slice
- Make downgrade smooth too
- Make latency sampling work for both bio and request based queue
- Change latency estimation method from 'line fitting' to 'bucket based