On Wed, Jan 20, 2016 at 11:34:48AM -0800, Shaohua Li wrote:
> On Wed, Jan 20, 2016 at 02:05:35PM -0500, Vivek Goyal wrote:
> > On Wed, Jan 20, 2016 at 09:49:16AM -0800, Shaohua Li wrote:
> > > Hi,
> > > 
> > > Currently we have 2 iocontrollers. blk-throttling is bandwidth based. CFQ 
> > > is
> > > weight based. It would be great there is a unified iocontroller for the 
> > > two.
> > > And blk-mq doesn't support ioscheduler, leaving blk-throttling the only 
> > > option
> > > for blk-mq. It's time to have a scalable iocontroller supporting both
> > > bandwidth/weight based control and working with blk-mq.
> > > 
> > > blk-throttling is a good candidate, it works for both blk-mq and legacy 
> > > queue.
> > > It has a global lock which is scaring for scalability, but it's not 
> > > terrible in
> > > practice. In my test, the NVMe IOPS can reach 1M/s and I have all CPU run 
> > > IO. Enabling
> > > blk-throttle has around 2~3% IOPS and 10% cpu utilization impact. I'd 
> > > expect
> > > this isn't a big problem for today's workload. This patchset then try to 
> > > make a
> > > unified iocontroller. I'm leveraging blk-throttling.
> > > 
> > > The idea is pretty simple. If we know disk total bandwidth, we can 
> > > calculate
> > > cgroup bandwidth according to its weight. blk-throttling can use the 
> > > calculated
> > > bandwidth to throttle cgroup. Disk total bandwidth changes dramatically 
> > > per IO
> > > pattern. Long history is meaningless. The simple algorithm in patch 1 
> > > works
> > > pretty well when IO pattern changes.
> > > 
> > > This is a feedback system. If we underestimate disk total bandwidth, we 
> > > assign
> > > less bandwidth to cgroup. cgroup will dispatch less IO and finally lower 
> > > disk
> > > total bandwidth is estimated. To break the loop, cgroup bandwidth 
> > > calculation
> > > always uses (1 + 1/8) * disk_bandwidth. Another issue is cgroup could be
> > > inactive. If inactive cgroup is accounted in, other cgroup will be 
> > > assigned
> > > less bandwidth and so dispatch less IO, and disk total bandwidth drops 
> > > further.
> > > To avoid the issue, we periodically check cgroups and exclude inactive 
> > > ones.
> > > 
> > > To test this, create two fio jobs and assign them different weight. You 
> > > will
> > > see the jobs have different bandwidth roughly according to their weight.
> > 
> > Patches look pretty small. Nice to see an implementation which will work
> > with faster devices and get away from dependency on cfq.
> > 
> > How does one switch between weight based vs bandwidth based throttling?
> > What's the default. 
> > 
> > So this has been implemented at throttling layer. By default is weight 
> > based throttling enabled or one needs to enable it explicitly.
> 
> So in current implementation, only one of weight/bandwidth can be
> enabled. After one is enabled, switching to the other is forbidden. It
> should not be hard to enable switching. But mixing the two in one
> hierarchy sounds not trivial.

So is this selection per device? Would be good if you also provide steps
to test it. I am going through code now and will figure out ultimately,
just that if you give steps, it makes it little easier.

Is this one way selection system wide or per device?

Thanks
Vivek

Reply via email to