Re: [RFC PATCH 0/3] cgroup: fsio throttle controller
On Tue, Jan 29, 2019 at 07:39:38PM +0100, Andrea Righi wrote: > On Mon, Jan 28, 2019 at 02:26:20PM -0500, Vivek Goyal wrote: > > On Mon, Jan 28, 2019 at 06:41:29PM +0100, Andrea Righi wrote: > > > Hi Vivek, > > > > > > sorry for the late reply. > > > > > > On Mon, Jan 21, 2019 at 04:47:15PM -0500, Vivek Goyal wrote: > > > > On Sat, Jan 19, 2019 at 11:08:27AM +0100, Andrea Righi wrote: > > > > > > > > [..] > > > > > Alright, let's skip the root cgroup for now. I think the point here is > > > > > if we want to provide sync() isolation among cgroups or not. > > > > > > > > > > According to the manpage: > > > > > > > > > >sync() causes all pending modifications to filesystem > > > > > metadata and cached file data to be > > > > >written to the underlying filesystems. > > > > > > > > > > And: > > > > >According to the standard specification (e.g., POSIX.1-2001), > > > > > sync() schedules the writes, but > > > > >may return before the actual writing is done. However > > > > > Linux waits for I/O completions, and > > > > >thus sync() or syncfs() provide the same guarantees as fsync > > > > > called on every file in the sys‐ > > > > >tem or filesystem respectively. > > > > > > > > > > Excluding the root cgroup, do you think a sync() issued inside a > > > > > specific cgroup should wait for I/O completions only for the writes > > > > > that > > > > > have been generated by that cgroup? > > > > > > > > Can we account I/O towards the cgroup which issued "sync" only if write > > > > rate of sync cgroup is higher than cgroup to which page belongs to. Will > > > > that solve problem, assuming its doable? > > > > > > Maybe this would mitigate the problem, in part, but it doesn't solve it. > > > > > > The thing is, if a dirty page belongs to a slow cgroup and a fast cgroup > > > issues "sync", the fast cgroup needs to wait a lot, because writeback is > > > happening at the speed of the slow cgroup. > > > > Hi Andrea, > > > > But that's true only for I/O which has already been submitted to block > > layer, right? Any new I/O yet to be submitted could still be attributed > > to faster cgroup requesting sync. > > Right. If we could bump up the new I/O yet to be submitted I think we > could effectively prevent the priority inversion problem (the ongoing > writeback I/O should be negligible). > > > > > Until and unless cgroups limits are absurdly low, it should not take very > > long for already submitted I/O to finish. If yes, then in practice, it > > might not be a big problem? > > I was actually doing my tests with a very low limit (1MB/s both for rbps > and wbps), but this shows the problem very well I think. > > Here's what I'm doing: > > [ slow cgroup (1Mbps read/write) ] > >$ cat /sys/fs/cgroup/unified/cg1/io.max >259:0 rbps=1048576 wbps=1048576 riops=max wiops=max >$ cat /proc/self/cgroup >0::/cg1 > >$ fio --rw=write --bs=1M --size=32M --numjobs=16 --name=writer > --time_based --runtime=30 > > [ fast cgroup (root cgroup, no limitation) ] > ># cat /proc/self/cgroup >0::/ > ># time sync >real 9m32,618s >user 0m0,000s >sys0m0,018s > > With this simple test I can easily trigger hung task timeout warnings > and make the whole system totally sluggish (even the processes running > in the root cgroup). > > When fio ends, writeback is still taking forever to complete, as you can > see by the insane amount that sync takes to complete. > Yeah sync() needs to be treated differently, but its kind of special too. We don't want slow to run sync() and backup fast doing sync() because we make all of the io go based on the submitting cgroup. The problem here is we don't know who's more important until we get to the blk cgroup layer, and even then sometimes we can't tell (different hierarchies would make this tricky with io.weight or io.latency). We could treat it like REQ_META and just let everything go through and back charge. This feels like a way for the slow group to cheat though, unless we just throttle the shit out of him before returning to user space. I'll have to think about this some more. Thanks, Josef
Re: [RFC PATCH 0/3] cgroup: fsio throttle controller
On Mon, Jan 28, 2019 at 02:26:20PM -0500, Vivek Goyal wrote: > On Mon, Jan 28, 2019 at 06:41:29PM +0100, Andrea Righi wrote: > > Hi Vivek, > > > > sorry for the late reply. > > > > On Mon, Jan 21, 2019 at 04:47:15PM -0500, Vivek Goyal wrote: > > > On Sat, Jan 19, 2019 at 11:08:27AM +0100, Andrea Righi wrote: > > > > > > [..] > > > > Alright, let's skip the root cgroup for now. I think the point here is > > > > if we want to provide sync() isolation among cgroups or not. > > > > > > > > According to the manpage: > > > > > > > >sync() causes all pending modifications to filesystem > > > > metadata and cached file data to be > > > >written to the underlying filesystems. > > > > > > > > And: > > > >According to the standard specification (e.g., POSIX.1-2001), > > > > sync() schedules the writes, but > > > >may return before the actual writing is done. However Linux > > > > waits for I/O completions, and > > > >thus sync() or syncfs() provide the same guarantees as fsync > > > > called on every file in the sys‐ > > > >tem or filesystem respectively. > > > > > > > > Excluding the root cgroup, do you think a sync() issued inside a > > > > specific cgroup should wait for I/O completions only for the writes that > > > > have been generated by that cgroup? > > > > > > Can we account I/O towards the cgroup which issued "sync" only if write > > > rate of sync cgroup is higher than cgroup to which page belongs to. Will > > > that solve problem, assuming its doable? > > > > Maybe this would mitigate the problem, in part, but it doesn't solve it. > > > > The thing is, if a dirty page belongs to a slow cgroup and a fast cgroup > > issues "sync", the fast cgroup needs to wait a lot, because writeback is > > happening at the speed of the slow cgroup. > > Hi Andrea, > > But that's true only for I/O which has already been submitted to block > layer, right? Any new I/O yet to be submitted could still be attributed > to faster cgroup requesting sync. Right. If we could bump up the new I/O yet to be submitted I think we could effectively prevent the priority inversion problem (the ongoing writeback I/O should be negligible). > > Until and unless cgroups limits are absurdly low, it should not take very > long for already submitted I/O to finish. If yes, then in practice, it > might not be a big problem? I was actually doing my tests with a very low limit (1MB/s both for rbps and wbps), but this shows the problem very well I think. Here's what I'm doing: [ slow cgroup (1Mbps read/write) ] $ cat /sys/fs/cgroup/unified/cg1/io.max 259:0 rbps=1048576 wbps=1048576 riops=max wiops=max $ cat /proc/self/cgroup 0::/cg1 $ fio --rw=write --bs=1M --size=32M --numjobs=16 --name=writer --time_based --runtime=30 [ fast cgroup (root cgroup, no limitation) ] # cat /proc/self/cgroup 0::/ # time sync real 9m32,618s user 0m0,000s sys 0m0,018s With this simple test I can easily trigger hung task timeout warnings and make the whole system totally sluggish (even the processes running in the root cgroup). When fio ends, writeback is still taking forever to complete, as you can see by the insane amount that sync takes to complete. -Andrea
Re: [RFC PATCH 0/3] cgroup: fsio throttle controller
On Mon, Jan 28, 2019 at 06:41:29PM +0100, Andrea Righi wrote: > Hi Vivek, > > sorry for the late reply. > > On Mon, Jan 21, 2019 at 04:47:15PM -0500, Vivek Goyal wrote: > > On Sat, Jan 19, 2019 at 11:08:27AM +0100, Andrea Righi wrote: > > > > [..] > > > Alright, let's skip the root cgroup for now. I think the point here is > > > if we want to provide sync() isolation among cgroups or not. > > > > > > According to the manpage: > > > > > >sync() causes all pending modifications to filesystem > > > metadata and cached file data to be > > >written to the underlying filesystems. > > > > > > And: > > >According to the standard specification (e.g., POSIX.1-2001), > > > sync() schedules the writes, but > > >may return before the actual writing is done. However Linux > > > waits for I/O completions, and > > >thus sync() or syncfs() provide the same guarantees as fsync > > > called on every file in the sys‐ > > >tem or filesystem respectively. > > > > > > Excluding the root cgroup, do you think a sync() issued inside a > > > specific cgroup should wait for I/O completions only for the writes that > > > have been generated by that cgroup? > > > > Can we account I/O towards the cgroup which issued "sync" only if write > > rate of sync cgroup is higher than cgroup to which page belongs to. Will > > that solve problem, assuming its doable? > > Maybe this would mitigate the problem, in part, but it doesn't solve it. > > The thing is, if a dirty page belongs to a slow cgroup and a fast cgroup > issues "sync", the fast cgroup needs to wait a lot, because writeback is > happening at the speed of the slow cgroup. Hi Andrea, But that's true only for I/O which has already been submitted to block layer, right? Any new I/O yet to be submitted could still be attributed to faster cgroup requesting sync. Until and unless cgroups limits are absurdly low, it should not take very long for already submitted I/O to finish. If yes, then in practice, it might not be a big problem? Vivek > > Ideally in this case we should bump up the writeback speed, maybe even > temporarily inherit the write rate of the sync cgroup, similarly to a > priority-inversion locking scenario, but I think it's not doable at the > moment without applying big changes. > > Or we could isolate the sync domain, meaning that a cgroup issuing a > sync will only wait for the syncing of the pages that belong to that > sync cgroup. But probably also this method requires big changes... > > -Andrea
Re: [RFC PATCH 0/3] cgroup: fsio throttle controller
Hi Vivek, sorry for the late reply. On Mon, Jan 21, 2019 at 04:47:15PM -0500, Vivek Goyal wrote: > On Sat, Jan 19, 2019 at 11:08:27AM +0100, Andrea Righi wrote: > > [..] > > Alright, let's skip the root cgroup for now. I think the point here is > > if we want to provide sync() isolation among cgroups or not. > > > > According to the manpage: > > > >sync() causes all pending modifications to filesystem metadata > > and cached file data to be > >written to the underlying filesystems. > > > > And: > >According to the standard specification (e.g., POSIX.1-2001), sync() > > schedules the writes, but > >may return before the actual writing is done. However Linux > > waits for I/O completions, and > >thus sync() or syncfs() provide the same guarantees as fsync called > > on every file in the sys‐ > >tem or filesystem respectively. > > > > Excluding the root cgroup, do you think a sync() issued inside a > > specific cgroup should wait for I/O completions only for the writes that > > have been generated by that cgroup? > > Can we account I/O towards the cgroup which issued "sync" only if write > rate of sync cgroup is higher than cgroup to which page belongs to. Will > that solve problem, assuming its doable? Maybe this would mitigate the problem, in part, but it doesn't solve it. The thing is, if a dirty page belongs to a slow cgroup and a fast cgroup issues "sync", the fast cgroup needs to wait a lot, because writeback is happening at the speed of the slow cgroup. Ideally in this case we should bump up the writeback speed, maybe even temporarily inherit the write rate of the sync cgroup, similarly to a priority-inversion locking scenario, but I think it's not doable at the moment without applying big changes. Or we could isolate the sync domain, meaning that a cgroup issuing a sync will only wait for the syncing of the pages that belong to that sync cgroup. But probably also this method requires big changes... -Andrea
Re: [RFC PATCH 0/3] cgroup: fsio throttle controller
On Sat, Jan 19, 2019 at 11:08:27AM +0100, Andrea Righi wrote: [..] > Alright, let's skip the root cgroup for now. I think the point here is > if we want to provide sync() isolation among cgroups or not. > > According to the manpage: > >sync() causes all pending modifications to filesystem metadata > and cached file data to be >written to the underlying filesystems. > > And: >According to the standard specification (e.g., POSIX.1-2001), sync() > schedules the writes, but >may return before the actual writing is done. However Linux waits > for I/O completions, and >thus sync() or syncfs() provide the same guarantees as fsync called on > every file in the sys‐ >tem or filesystem respectively. > > Excluding the root cgroup, do you think a sync() issued inside a > specific cgroup should wait for I/O completions only for the writes that > have been generated by that cgroup? Can we account I/O towards the cgroup which issued "sync" only if write rate of sync cgroup is higher than cgroup to which page belongs to. Will that solve problem, assuming its doable? Vivek
Re: [RFC PATCH 0/3] cgroup: fsio throttle controller
On Fri, Jan 18, 2019 at 02:46:53PM -0500, Josef Bacik wrote: > On Fri, Jan 18, 2019 at 07:44:03PM +0100, Andrea Righi wrote: > > On Fri, Jan 18, 2019 at 11:35:31AM -0500, Josef Bacik wrote: > > > On Fri, Jan 18, 2019 at 11:31:24AM +0100, Andrea Righi wrote: > > > > This is a redesign of my old cgroup-io-throttle controller: > > > > https://lwn.net/Articles/330531/ > > > > > > > > I'm resuming this old patch to point out a problem that I think is still > > > > not solved completely. > > > > > > > > = Problem = > > > > > > > > The io.max controller works really well at limiting synchronous I/O > > > > (READs), but a lot of I/O requests are initiated outside the context of > > > > the process that is ultimately responsible for its creation (e.g., > > > > WRITEs). > > > > > > > > Throttling at the block layer in some cases is too late and we may end > > > > up slowing down processes that are not responsible for the I/O that > > > > is being processed at that level. > > > > > > How so? The writeback threads are per-cgroup and have the cgroup stuff > > > set > > > properly. So if you dirty a bunch of pages, they are associated with your > > > cgroup, and then writeback happens and it's done in the writeback thread > > > associated with your cgroup and then that is throttled. Then you are > > > throttled > > > at balance_dirty_pages() because the writeout is taking longer. > > > > Right, writeback is per-cgroup and slowing down writeback affects only > > that specific cgroup, but, there are cases where other processes from > > other cgroups may require to wait on that writeback to complete before > > doing I/O (for example an fsync() to a file shared among different > > cgroups). In this case we may end up blocking cgroups that shouldn't be > > blocked, that looks like a priority-inversion problem. This is the > > problem that I'm trying to address. > > Well this case is a misconfiguration, you shouldn't be sharing files between > cgroups. But even if you are, fsync() is synchronous, we should be getting > the > context from the process itself and thus should have its own rules applied. > There's nothing we can do for outstanding IO, but that shouldn't be that much. > That would need to be dealt with on a per-contoller basis. OK, fair point. We shouldn't be sharing files between cgroups. I'm still not sure if we can have similar issues with metadata I/O (that may introduce latencies like the sync() scenario), I have to investigate more and do more tests. > > > > > > > > > I introduced the blk_cgroup_congested() stuff for paths that it's not > > > easy to > > > clearly tie IO to the thing generating the IO, such as readahead and > > > such. If > > > you are running into this case that may be something worth using. Course > > > it > > > only works for io.latency now but there's no reason you can't add support > > > to it > > > for io.max or whatever. > > > > IIUC blk_cgroup_congested() is used in readahead I/O (and swap with > > memcg), something like this: if the cgroup is already congested don't > > generate extra I/O due to readahead. Am I right? > > Yeah, but that's just how it's currently used, it can be used any which way we > feel like. I think it'd be very interesting to have the possibility to either throttle I/O before writeback or during writeback. Right now we can only throttle writeback. Maybe we can try to introduce some kind of dirty page throttling controller using blk_cgroup_congested()... Opinions? > > > > > > > > > > > > > > = Proposed solution = > > > > > > > > The main idea of this controller is to split I/O measurement and I/O > > > > throttling: I/O is measured at the block layer for READS, at page cache > > > > (dirty pages) for WRITEs, and processes are limited while they're > > > > generating I/O at the VFS level, based on the measured I/O. > > > > > > > > > > This is what blk_cgroup_congested() is meant to accomplish, I would > > > suggest > > > looking into that route and simply changing the existing io controller > > > you are > > > using to take advantage of that so it will actually throttle things. > > > Then just > > > sprinkle it around the areas where we indirectly generate IO. Thanks, > > > > Absolutely, I can probably use blk_cgroup_congested() as a method to > > determine when a cgroup should be throttled (instead of doing my own > > I/O measuring), but to prevent the "slow writeback slowing down other > > cgroups" issue I still need to apply throttling when pages are dirtied > > in page cache. > > Again this is just a fuckup from a configuration stand point. The argument > could be made that sync() is probably broken here, but I think the right > solution here is to just pass the cgroup context along with the writeback > information and use that if it's set instead. Thanks, Alright, let's skip the root cgroup for now. I think the point here is if we want to provide sync() isolation among cgroups or not. According to the manpage:
Re: [RFC PATCH 0/3] cgroup: fsio throttle controller
On Fri, Jan 18, 2019 at 07:44:03PM +0100, Andrea Righi wrote: > On Fri, Jan 18, 2019 at 11:35:31AM -0500, Josef Bacik wrote: > > On Fri, Jan 18, 2019 at 11:31:24AM +0100, Andrea Righi wrote: > > > This is a redesign of my old cgroup-io-throttle controller: > > > https://lwn.net/Articles/330531/ > > > > > > I'm resuming this old patch to point out a problem that I think is still > > > not solved completely. > > > > > > = Problem = > > > > > > The io.max controller works really well at limiting synchronous I/O > > > (READs), but a lot of I/O requests are initiated outside the context of > > > the process that is ultimately responsible for its creation (e.g., > > > WRITEs). > > > > > > Throttling at the block layer in some cases is too late and we may end > > > up slowing down processes that are not responsible for the I/O that > > > is being processed at that level. > > > > How so? The writeback threads are per-cgroup and have the cgroup stuff set > > properly. So if you dirty a bunch of pages, they are associated with your > > cgroup, and then writeback happens and it's done in the writeback thread > > associated with your cgroup and then that is throttled. Then you are > > throttled > > at balance_dirty_pages() because the writeout is taking longer. > > Right, writeback is per-cgroup and slowing down writeback affects only > that specific cgroup, but, there are cases where other processes from > other cgroups may require to wait on that writeback to complete before > doing I/O (for example an fsync() to a file shared among different > cgroups). In this case we may end up blocking cgroups that shouldn't be > blocked, that looks like a priority-inversion problem. This is the > problem that I'm trying to address. Well this case is a misconfiguration, you shouldn't be sharing files between cgroups. But even if you are, fsync() is synchronous, we should be getting the context from the process itself and thus should have its own rules applied. There's nothing we can do for outstanding IO, but that shouldn't be that much. That would need to be dealt with on a per-contoller basis. > > > > > I introduced the blk_cgroup_congested() stuff for paths that it's not easy > > to > > clearly tie IO to the thing generating the IO, such as readahead and such. > > If > > you are running into this case that may be something worth using. Course it > > only works for io.latency now but there's no reason you can't add support > > to it > > for io.max or whatever. > > IIUC blk_cgroup_congested() is used in readahead I/O (and swap with > memcg), something like this: if the cgroup is already congested don't > generate extra I/O due to readahead. Am I right? Yeah, but that's just how it's currently used, it can be used any which way we feel like. > > > > > > > > > = Proposed solution = > > > > > > The main idea of this controller is to split I/O measurement and I/O > > > throttling: I/O is measured at the block layer for READS, at page cache > > > (dirty pages) for WRITEs, and processes are limited while they're > > > generating I/O at the VFS level, based on the measured I/O. > > > > > > > This is what blk_cgroup_congested() is meant to accomplish, I would suggest > > looking into that route and simply changing the existing io controller you > > are > > using to take advantage of that so it will actually throttle things. Then > > just > > sprinkle it around the areas where we indirectly generate IO. Thanks, > > Absolutely, I can probably use blk_cgroup_congested() as a method to > determine when a cgroup should be throttled (instead of doing my own > I/O measuring), but to prevent the "slow writeback slowing down other > cgroups" issue I still need to apply throttling when pages are dirtied > in page cache. Again this is just a fuckup from a configuration stand point. The argument could be made that sync() is probably broken here, but I think the right solution here is to just pass the cgroup context along with the writeback information and use that if it's set instead. Thanks, Josef
Re: [RFC PATCH 0/3] cgroup: fsio throttle controller
On Fri, Jan 18, 2019 at 06:07:45PM +0100, Paolo Valente wrote: > > > > Il giorno 18 gen 2019, alle ore 17:35, Josef Bacik > > ha scritto: > > > > On Fri, Jan 18, 2019 at 11:31:24AM +0100, Andrea Righi wrote: > >> This is a redesign of my old cgroup-io-throttle controller: > >> https://lwn.net/Articles/330531/ > >> > >> I'm resuming this old patch to point out a problem that I think is still > >> not solved completely. > >> > >> = Problem = > >> > >> The io.max controller works really well at limiting synchronous I/O > >> (READs), but a lot of I/O requests are initiated outside the context of > >> the process that is ultimately responsible for its creation (e.g., > >> WRITEs). > >> > >> Throttling at the block layer in some cases is too late and we may end > >> up slowing down processes that are not responsible for the I/O that > >> is being processed at that level. > > > > How so? The writeback threads are per-cgroup and have the cgroup stuff set > > properly. So if you dirty a bunch of pages, they are associated with your > > cgroup, and then writeback happens and it's done in the writeback thread > > associated with your cgroup and then that is throttled. Then you are > > throttled > > at balance_dirty_pages() because the writeout is taking longer. > > > > IIUC, Andrea described this problem: certain processes in a certain group > dirty a > lot of pages, causing write back to start. Then some other blameless > process in the same group experiences very high latency, in spite of > the fact that it has to do little I/O. > > Does your blk_cgroup_congested() stuff solves this issue? > > Or simply I didn't get what Andrea meant at all :) > > Thanks, > Paolo Yes, there is also this problem: provide fairness among processes running inside the same cgroup. This is a tough one, because once the I/O limit is reached whoever process comes next gets punished, even if it hasn't done any I/O before. Well, my proposal doesn't solve this problem. To solve this one in the "throttling" scenario, we should probably save some information about the previously generated I/O activity and apply a delay proportional to that (like a dynamic weight for each process inside each cgroup). AFAICS the io.max controller doesn't solve this problem either. -Andrea
Re: [RFC PATCH 0/3] cgroup: fsio throttle controller
On Fri, Jan 18, 2019 at 11:35:31AM -0500, Josef Bacik wrote: > On Fri, Jan 18, 2019 at 11:31:24AM +0100, Andrea Righi wrote: > > This is a redesign of my old cgroup-io-throttle controller: > > https://lwn.net/Articles/330531/ > > > > I'm resuming this old patch to point out a problem that I think is still > > not solved completely. > > > > = Problem = > > > > The io.max controller works really well at limiting synchronous I/O > > (READs), but a lot of I/O requests are initiated outside the context of > > the process that is ultimately responsible for its creation (e.g., > > WRITEs). > > > > Throttling at the block layer in some cases is too late and we may end > > up slowing down processes that are not responsible for the I/O that > > is being processed at that level. > > How so? The writeback threads are per-cgroup and have the cgroup stuff set > properly. So if you dirty a bunch of pages, they are associated with your > cgroup, and then writeback happens and it's done in the writeback thread > associated with your cgroup and then that is throttled. Then you are > throttled > at balance_dirty_pages() because the writeout is taking longer. Right, writeback is per-cgroup and slowing down writeback affects only that specific cgroup, but, there are cases where other processes from other cgroups may require to wait on that writeback to complete before doing I/O (for example an fsync() to a file shared among different cgroups). In this case we may end up blocking cgroups that shouldn't be blocked, that looks like a priority-inversion problem. This is the problem that I'm trying to address. > > I introduced the blk_cgroup_congested() stuff for paths that it's not easy to > clearly tie IO to the thing generating the IO, such as readahead and such. If > you are running into this case that may be something worth using. Course it > only works for io.latency now but there's no reason you can't add support to > it > for io.max or whatever. IIUC blk_cgroup_congested() is used in readahead I/O (and swap with memcg), something like this: if the cgroup is already congested don't generate extra I/O due to readahead. Am I right? > > > > > = Proposed solution = > > > > The main idea of this controller is to split I/O measurement and I/O > > throttling: I/O is measured at the block layer for READS, at page cache > > (dirty pages) for WRITEs, and processes are limited while they're > > generating I/O at the VFS level, based on the measured I/O. > > > > This is what blk_cgroup_congested() is meant to accomplish, I would suggest > looking into that route and simply changing the existing io controller you are > using to take advantage of that so it will actually throttle things. Then > just > sprinkle it around the areas where we indirectly generate IO. Thanks, Absolutely, I can probably use blk_cgroup_congested() as a method to determine when a cgroup should be throttled (instead of doing my own I/O measuring), but to prevent the "slow writeback slowing down other cgroups" issue I still need to apply throttling when pages are dirtied in page cache. Thanks, -Andrea
Re: [RFC PATCH 0/3] cgroup: fsio throttle controller
On Fri, Jan 18, 2019 at 06:07:45PM +0100, Paolo Valente wrote: > > > > Il giorno 18 gen 2019, alle ore 17:35, Josef Bacik > > ha scritto: > > > > On Fri, Jan 18, 2019 at 11:31:24AM +0100, Andrea Righi wrote: > >> This is a redesign of my old cgroup-io-throttle controller: > >> https://lwn.net/Articles/330531/ > >> > >> I'm resuming this old patch to point out a problem that I think is still > >> not solved completely. > >> > >> = Problem = > >> > >> The io.max controller works really well at limiting synchronous I/O > >> (READs), but a lot of I/O requests are initiated outside the context of > >> the process that is ultimately responsible for its creation (e.g., > >> WRITEs). > >> > >> Throttling at the block layer in some cases is too late and we may end > >> up slowing down processes that are not responsible for the I/O that > >> is being processed at that level. > > > > How so? The writeback threads are per-cgroup and have the cgroup stuff set > > properly. So if you dirty a bunch of pages, they are associated with your > > cgroup, and then writeback happens and it's done in the writeback thread > > associated with your cgroup and then that is throttled. Then you are > > throttled > > at balance_dirty_pages() because the writeout is taking longer. > > > > IIUC, Andrea described this problem: certain processes in a certain group > dirty a > lot of pages, causing write back to start. Then some other blameless > process in the same group experiences very high latency, in spite of > the fact that it has to do little I/O. > In that case the io controller isn't doing it's job and needs to be fixed (or reconfigured). io.latency guards against this, I assume io.max would keep this from happening if it were configured properly. > Does your blk_cgroup_congested() stuff solves this issue? > > Or simply I didn't get what Andrea meant at all :) > I _think_ Andrea is talking about the fact that we can generate IO indirectly and never get throttled for it, which is what blk_cgroup_congested() is meant to address. I added it specifically because some low prio task was just allocating all of the memory on the system and causing a lot of pressure because of swapping, but there was no direct feedback loop there. blk_cgroup_congested() provides that feedback loop. Course I could be wrong too and we're all just talking past each other ;). Thanks, Josef
Re: [RFC PATCH 0/3] cgroup: fsio throttle controller
> Il giorno 18 gen 2019, alle ore 17:35, Josef Bacik ha > scritto: > > On Fri, Jan 18, 2019 at 11:31:24AM +0100, Andrea Righi wrote: >> This is a redesign of my old cgroup-io-throttle controller: >> https://lwn.net/Articles/330531/ >> >> I'm resuming this old patch to point out a problem that I think is still >> not solved completely. >> >> = Problem = >> >> The io.max controller works really well at limiting synchronous I/O >> (READs), but a lot of I/O requests are initiated outside the context of >> the process that is ultimately responsible for its creation (e.g., >> WRITEs). >> >> Throttling at the block layer in some cases is too late and we may end >> up slowing down processes that are not responsible for the I/O that >> is being processed at that level. > > How so? The writeback threads are per-cgroup and have the cgroup stuff set > properly. So if you dirty a bunch of pages, they are associated with your > cgroup, and then writeback happens and it's done in the writeback thread > associated with your cgroup and then that is throttled. Then you are > throttled > at balance_dirty_pages() because the writeout is taking longer. > IIUC, Andrea described this problem: certain processes in a certain group dirty a lot of pages, causing write back to start. Then some other blameless process in the same group experiences very high latency, in spite of the fact that it has to do little I/O. Does your blk_cgroup_congested() stuff solves this issue? Or simply I didn't get what Andrea meant at all :) Thanks, Paolo > I introduced the blk_cgroup_congested() stuff for paths that it's not easy to > clearly tie IO to the thing generating the IO, such as readahead and such. If > you are running into this case that may be something worth using. Course it > only works for io.latency now but there's no reason you can't add support to > it > for io.max or whatever. > >> >> = Proposed solution = >> >> The main idea of this controller is to split I/O measurement and I/O >> throttling: I/O is measured at the block layer for READS, at page cache >> (dirty pages) for WRITEs, and processes are limited while they're >> generating I/O at the VFS level, based on the measured I/O. >> > > This is what blk_cgroup_congested() is meant to accomplish, I would suggest > looking into that route and simply changing the existing io controller you are > using to take advantage of that so it will actually throttle things. Then > just > sprinkle it around the areas where we indirectly generate IO. Thanks, > > Josef
Re: [RFC PATCH 0/3] cgroup: fsio throttle controller
On Fri, Jan 18, 2019 at 11:31:24AM +0100, Andrea Righi wrote: > This is a redesign of my old cgroup-io-throttle controller: > https://lwn.net/Articles/330531/ > > I'm resuming this old patch to point out a problem that I think is still > not solved completely. > > = Problem = > > The io.max controller works really well at limiting synchronous I/O > (READs), but a lot of I/O requests are initiated outside the context of > the process that is ultimately responsible for its creation (e.g., > WRITEs). > > Throttling at the block layer in some cases is too late and we may end > up slowing down processes that are not responsible for the I/O that > is being processed at that level. How so? The writeback threads are per-cgroup and have the cgroup stuff set properly. So if you dirty a bunch of pages, they are associated with your cgroup, and then writeback happens and it's done in the writeback thread associated with your cgroup and then that is throttled. Then you are throttled at balance_dirty_pages() because the writeout is taking longer. I introduced the blk_cgroup_congested() stuff for paths that it's not easy to clearly tie IO to the thing generating the IO, such as readahead and such. If you are running into this case that may be something worth using. Course it only works for io.latency now but there's no reason you can't add support to it for io.max or whatever. > > = Proposed solution = > > The main idea of this controller is to split I/O measurement and I/O > throttling: I/O is measured at the block layer for READS, at page cache > (dirty pages) for WRITEs, and processes are limited while they're > generating I/O at the VFS level, based on the measured I/O. > This is what blk_cgroup_congested() is meant to accomplish, I would suggest looking into that route and simply changing the existing io controller you are using to take advantage of that so it will actually throttle things. Then just sprinkle it around the areas where we indirectly generate IO. Thanks, Josef
Re: [RFC PATCH 0/3] cgroup: fsio throttle controller
> Il giorno 18 gen 2019, alle ore 12:10, Andrea Righi > ha scritto: > > On Fri, Jan 18, 2019 at 12:04:17PM +0100, Paolo Valente wrote: >> >> >>> Il giorno 18 gen 2019, alle ore 11:31, Andrea Righi >>> ha scritto: >>> >>> This is a redesign of my old cgroup-io-throttle controller: >>> https://lwn.net/Articles/330531/ >>> >>> I'm resuming this old patch to point out a problem that I think is still >>> not solved completely. >>> >>> = Problem = >>> >>> The io.max controller works really well at limiting synchronous I/O >>> (READs), but a lot of I/O requests are initiated outside the context of >>> the process that is ultimately responsible for its creation (e.g., >>> WRITEs). >>> >>> Throttling at the block layer in some cases is too late and we may end >>> up slowing down processes that are not responsible for the I/O that >>> is being processed at that level. >>> >>> = Proposed solution = >>> >>> The main idea of this controller is to split I/O measurement and I/O >>> throttling: I/O is measured at the block layer for READS, at page cache >>> (dirty pages) for WRITEs, and processes are limited while they're >>> generating I/O at the VFS level, based on the measured I/O. >>> >> >> Hi Andrea, >> what the about the case where two processes are dirtying the same >> pages? Which will be charged? >> >> Thanks, >> Paolo > > Hi Paolo, > > in this case only the first one will be charged for the I/O activity > (the one that changes a page from clean to dirty). This is probably not > totally fair in some cases, but I think it's a good compromise, Absolutely, I just wanted to better understand this point. > at the > end rewriting the same page over and over while it's already dirty > doesn't actually generate I/O activity, until the page is flushed back > to disk. > Right. Thanks, Paolo > Obviously I'm open to other better ideas and suggestions. > > Thanks! > -Andrea
Re: [RFC PATCH 0/3] cgroup: fsio throttle controller
On Fri, Jan 18, 2019 at 12:04:17PM +0100, Paolo Valente wrote: > > > > Il giorno 18 gen 2019, alle ore 11:31, Andrea Righi > > ha scritto: > > > > This is a redesign of my old cgroup-io-throttle controller: > > https://lwn.net/Articles/330531/ > > > > I'm resuming this old patch to point out a problem that I think is still > > not solved completely. > > > > = Problem = > > > > The io.max controller works really well at limiting synchronous I/O > > (READs), but a lot of I/O requests are initiated outside the context of > > the process that is ultimately responsible for its creation (e.g., > > WRITEs). > > > > Throttling at the block layer in some cases is too late and we may end > > up slowing down processes that are not responsible for the I/O that > > is being processed at that level. > > > > = Proposed solution = > > > > The main idea of this controller is to split I/O measurement and I/O > > throttling: I/O is measured at the block layer for READS, at page cache > > (dirty pages) for WRITEs, and processes are limited while they're > > generating I/O at the VFS level, based on the measured I/O. > > > > Hi Andrea, > what the about the case where two processes are dirtying the same > pages? Which will be charged? > > Thanks, > Paolo Hi Paolo, in this case only the first one will be charged for the I/O activity (the one that changes a page from clean to dirty). This is probably not totally fair in some cases, but I think it's a good compromise, at the end rewriting the same page over and over while it's already dirty doesn't actually generate I/O activity, until the page is flushed back to disk. Obviously I'm open to other better ideas and suggestions. Thanks! -Andrea
Re: [RFC PATCH 0/3] cgroup: fsio throttle controller
> Il giorno 18 gen 2019, alle ore 11:31, Andrea Righi > ha scritto: > > This is a redesign of my old cgroup-io-throttle controller: > https://lwn.net/Articles/330531/ > > I'm resuming this old patch to point out a problem that I think is still > not solved completely. > > = Problem = > > The io.max controller works really well at limiting synchronous I/O > (READs), but a lot of I/O requests are initiated outside the context of > the process that is ultimately responsible for its creation (e.g., > WRITEs). > > Throttling at the block layer in some cases is too late and we may end > up slowing down processes that are not responsible for the I/O that > is being processed at that level. > > = Proposed solution = > > The main idea of this controller is to split I/O measurement and I/O > throttling: I/O is measured at the block layer for READS, at page cache > (dirty pages) for WRITEs, and processes are limited while they're > generating I/O at the VFS level, based on the measured I/O. > Hi Andrea, what the about the case where two processes are dirtying the same pages? Which will be charged? Thanks, Paolo > = Example = > > Here's a trivial example: create 2 cgroups, set an io.max limit of > 10MB/s, run a write-intensive workload on both and after a while, from a > root cgroup, run "sync". > > # cat /proc/self/cgroup > 0::/cg1 > # fio --rw=write --bs=1M --size=32M --numjobs=16 --name=seeker --time_based > --runtime=30 > > # cat /proc/self/cgroup > 0::/cg2 > # fio --rw=write --bs=1M --size=32M --numjobs=16 --name=seeker --time_based > --runtime=30 > > - io.max controller: > > # echo "259:0 rbps=10485760 wbps=10485760" > /sys/fs/cgroup/unified/cg1/io.max > # echo "259:0 rbps=10485760 wbps=10485760" > /sys/fs/cgroup/unified/cg2/io.max > > # cat /proc/self/cgroup > 0::/ > # time sync > > real 0m51,241s > user 0m0,000s > sys 0m0,113s > > Ideally "sync" should complete almost immediately, because the root > cgroup is unlimited and it's not doing any I/O at all, but instead it's > blocked for more than 50 sec with io.max, because the writeback is > throttled to satisfy the io.max limits. > > - fsio controller: > > # echo "259:0 10 10" > /sys/fs/cgroup/unified/cg1/fsio.max_mbs > # echo "259:0 10 10" > /sys/fs/cgroup/unified/cg2/fsio.max_mbs > > [you can find details about the syntax in the documentation patch] > > # cat /proc/self/cgroup > 0::/ > # time sync > > real 0m0,146s > user 0m0,003s > sys 0m0,001s > > = Questions = > > Q: Do we need another controller? > A: Probably no, I think it would be better to integrate this policy (or > something similar) in the current blkio controller, this is just to > highlight the problem and get some ideas on how to address it. > > Q: What about proportional limits / latency? > A: It should be trivial to add latency-based limits if we integrate this in > the > current I/O controller. About proportional limits (weights), they're > strictly related to I/O scheduling and since this controller doesn't touch > I/O dispatching policies it's not trivial to implement proportional limits > (bandwidth limiting is definitely more straightforward). > > Q: Applying delays at the VFS layer doesn't prevent I/O spikes during > writeback, right? > A: Correct, the tradeoff here is to tolerate I/O bursts during writeback to > avoid priority inversion problems in the system. > > Andrea Righi (3): > fsio-throttle: documentation > fsio-throttle: controller infrastructure > fsio-throttle: instrumentation > > Documentation/cgroup-v1/fsio-throttle.txt | 142 + > block/blk-core.c | 10 + > include/linux/cgroup_subsys.h | 4 + > include/linux/fsio-throttle.h | 43 +++ > include/linux/writeback.h | 7 +- > init/Kconfig | 11 + > kernel/cgroup/Makefile| 1 + > kernel/cgroup/fsio-throttle.c | 501 ++ > mm/filemap.c | 20 +- > mm/page-writeback.c | 14 +- > 10 files changed, 749 insertions(+), 4 deletions(-) >
[RFC PATCH 0/3] cgroup: fsio throttle controller
This is a redesign of my old cgroup-io-throttle controller: https://lwn.net/Articles/330531/ I'm resuming this old patch to point out a problem that I think is still not solved completely. = Problem = The io.max controller works really well at limiting synchronous I/O (READs), but a lot of I/O requests are initiated outside the context of the process that is ultimately responsible for its creation (e.g., WRITEs). Throttling at the block layer in some cases is too late and we may end up slowing down processes that are not responsible for the I/O that is being processed at that level. = Proposed solution = The main idea of this controller is to split I/O measurement and I/O throttling: I/O is measured at the block layer for READS, at page cache (dirty pages) for WRITEs, and processes are limited while they're generating I/O at the VFS level, based on the measured I/O. = Example = Here's a trivial example: create 2 cgroups, set an io.max limit of 10MB/s, run a write-intensive workload on both and after a while, from a root cgroup, run "sync". # cat /proc/self/cgroup 0::/cg1 # fio --rw=write --bs=1M --size=32M --numjobs=16 --name=seeker --time_based --runtime=30 # cat /proc/self/cgroup 0::/cg2 # fio --rw=write --bs=1M --size=32M --numjobs=16 --name=seeker --time_based --runtime=30 - io.max controller: # echo "259:0 rbps=10485760 wbps=10485760" > /sys/fs/cgroup/unified/cg1/io.max # echo "259:0 rbps=10485760 wbps=10485760" > /sys/fs/cgroup/unified/cg2/io.max # cat /proc/self/cgroup 0::/ # time sync real 0m51,241s user 0m0,000s sys0m0,113s Ideally "sync" should complete almost immediately, because the root cgroup is unlimited and it's not doing any I/O at all, but instead it's blocked for more than 50 sec with io.max, because the writeback is throttled to satisfy the io.max limits. - fsio controller: # echo "259:0 10 10" > /sys/fs/cgroup/unified/cg1/fsio.max_mbs # echo "259:0 10 10" > /sys/fs/cgroup/unified/cg2/fsio.max_mbs [you can find details about the syntax in the documentation patch] # cat /proc/self/cgroup 0::/ # time sync real 0m0,146s user 0m0,003s sys0m0,001s = Questions = Q: Do we need another controller? A: Probably no, I think it would be better to integrate this policy (or something similar) in the current blkio controller, this is just to highlight the problem and get some ideas on how to address it. Q: What about proportional limits / latency? A: It should be trivial to add latency-based limits if we integrate this in the current I/O controller. About proportional limits (weights), they're strictly related to I/O scheduling and since this controller doesn't touch I/O dispatching policies it's not trivial to implement proportional limits (bandwidth limiting is definitely more straightforward). Q: Applying delays at the VFS layer doesn't prevent I/O spikes during writeback, right? A: Correct, the tradeoff here is to tolerate I/O bursts during writeback to avoid priority inversion problems in the system. Andrea Righi (3): fsio-throttle: documentation fsio-throttle: controller infrastructure fsio-throttle: instrumentation Documentation/cgroup-v1/fsio-throttle.txt | 142 + block/blk-core.c | 10 + include/linux/cgroup_subsys.h | 4 + include/linux/fsio-throttle.h | 43 +++ include/linux/writeback.h | 7 +- init/Kconfig | 11 + kernel/cgroup/Makefile| 1 + kernel/cgroup/fsio-throttle.c | 501 ++ mm/filemap.c | 20 +- mm/page-writeback.c | 14 +- 10 files changed, 749 insertions(+), 4 deletions(-)