Hi, On Thu, Jan 21, 2016 at 05:41:57PM -0500, Tejun Heo wrote: > Hello, Shaohua. > > On Thu, Jan 21, 2016 at 02:24:51PM -0800, Shaohua Li wrote: > > > Have you tried with some level, say 5, of nesting? IIRC, how it > > > implements hierarchical control is rather braindead (and yeah I'm > > > responsible for the damage). > > > > Not yet. Agree nesting increases the locking time. But my test is > > already an extreme case. I had 32 threads in 2 nodes running IO and the > > IOPS is 1M/s. Don't think real workload will act like this. The locking > > issue definitely should be revisited in the future though. > > The thing is that most of the possible contentions can be removed by > implementing per-cpu cache which shouldn't be too difficult. 10% > extra cost on current gen hardware is already pretty high.
I did think about this. per-cpu cache does sound straightforward, but it could severely impact fairness. For example, we give each cpu a budget, see 1MB. If a cgroup doesn't use the 1M budget, we don't hold the lock. But if we have 128 CPUs, the cgroup can use 128 * 1M more budget, which breaks fairness very much. I have no idea how this can be fixed. > > Disagree io time is a better choice. Actually I think IO time will be > > If IO time isn't the right term, let's call it IO cost. Whatever the > term, the actual fraction of cost that each IO is incurring. > > > the least we shoule consider for SSD. Idealy if we know each IO cost and > > total disk capability, things will be easy. Unfortunately there is no > > way to know IO cost. Bandwidth isn't perfect, but might be the best. > > > > I don't know why you think devices are predictable. SSD is never > > predictable. I'm not sure how you will measure IO time. Morden SSD has > > large queue depth (blk-mq support 10k queue depth). That means we can > > send 10k IO in several ns. Measuring IO start/finish time doesn't help > > too. a 4k IO with 1 io depth might use 10us. a 4k IO with 100 io depth > > might use more than 100us. The IO time will increase with higher io > > depth. The fundamental problem is disk with large queue depth can buffer > > infinite IO request. I think IO time only works for queue depth 1 disk. > > They're way more predictable than rotational devices when measured > over a period. I don't think we'll be able to measure anything > meaningful at individual command level but aggregate numbers should be > fairly stable. A simple approximation of IO cost such as fixed cost > per IO + cost proportional to IO size would do a far better job than > just depending on bandwidth or iops and that requires approximating > two variables over time. I'm not sure how easy / feasible that > actually would be tho. It still sounds like IO time, otherwise I can't imagine we can measure the cost. If we use some sort of aggregate number, it likes a variation of bandwidth. eg cost = bandwidth/ios. I understand you probably want something like: get disk total resource, predicate resource of each IO, and then use the info to arbitrate cgroups. I don't know how it's possible. A disk which uses all its resources can still accept new IO queuing. Maybe someday a fancy device can export the info. Thanks, Shaohua