Re: [PATCH v2] blk-throttle: fix race between blkcg_bio_issue_check and cgroup_rmdir
On 18/3/5 04:23, Tejun Heo wrote: > Hello, Joseph. > > Sorry about late reply. > > On Wed, Feb 28, 2018 at 02:52:10PM +0800, Joseph Qi wrote: >> In current code, I'm afraid pd_offline_fn() as well as the rest >> destruction have to be called together under the same blkcg->lock and >> q->queue_lock. >> For example, if we split the pd_offline_fn() and radix_tree_delete() >> into 2 phases, it may introduce a race between blkcg_deactivate_policy() >> when exit queue and blkcg_css_free(), which will result in >> pd_offline_fn() to be called twice. > > So, yeah, the sync scheme aroung blkg is pretty brittle and we'd need > some restructuring to separate out blkg offlining and release, but it > looks like that'd be the right thing to do, no? > Agree, except the restriction above, as of now I don't find any more. I'll try to fix in the way you suggested and post v3. Thanks, Joseph
Re: [PATCH v2] blk-throttle: fix race between blkcg_bio_issue_check and cgroup_rmdir
Hello, Joseph. Sorry about late reply. On Wed, Feb 28, 2018 at 02:52:10PM +0800, Joseph Qi wrote: > In current code, I'm afraid pd_offline_fn() as well as the rest > destruction have to be called together under the same blkcg->lock and > q->queue_lock. > For example, if we split the pd_offline_fn() and radix_tree_delete() > into 2 phases, it may introduce a race between blkcg_deactivate_policy() > when exit queue and blkcg_css_free(), which will result in > pd_offline_fn() to be called twice. So, yeah, the sync scheme aroung blkg is pretty brittle and we'd need some restructuring to separate out blkg offlining and release, but it looks like that'd be the right thing to do, no? Thanks. -- tejun
Re: [PATCH v2] blk-throttle: fix race between blkcg_bio_issue_check and cgroup_rmdir
Hi Tejun, On 18/2/28 02:33, Tejun Heo wrote: > Hello, Joseph. > > On Sat, Feb 24, 2018 at 09:45:49AM +0800, Joseph Qi wrote: >>> IIRC, as long as the blkcg and the device are there, the blkgs aren't >>> gonna be destroyed. So, if you have a ref to the blkcg through >>> tryget, the blkg shouldn't go away. >>> >> >> Maybe we have misunderstanding here. >> >> In this case, blkg doesn't go away as we have rcu protect, but >> blkg_destroy() can be called, in which blkg_put() will put the last >> refcnt and then schedule __blkg_release_rcu(). >> >> css refcnt can't prevent blkcg css from offlining, instead it is css >> online_cnt. >> >> css_tryget() will only get a refcnt of blkcg css, but can't be >> guaranteed to fail when css is confirmed to kill. > > Ah, you're right. I was thinking we only destroy blkgs from blkcg > release path. Given that we primarily use blkcg refcnting to pin > them, I believe that's what we should do - ie. only call > pd_offline_fn() from blkcg_css_offline() path and do the rest of > destruction from blkcg_css_free(). What do you think? > In current code, I'm afraid pd_offline_fn() as well as the rest destruction have to be called together under the same blkcg->lock and q->queue_lock. For example, if we split the pd_offline_fn() and radix_tree_delete() into 2 phases, it may introduce a race between blkcg_deactivate_policy() when exit queue and blkcg_css_free(), which will result in pd_offline_fn() to be called twice. Thanks, Joseph
Re: [PATCH v2] blk-throttle: fix race between blkcg_bio_issue_check and cgroup_rmdir
Hello, Joseph. On Sat, Feb 24, 2018 at 09:45:49AM +0800, Joseph Qi wrote: > > IIRC, as long as the blkcg and the device are there, the blkgs aren't > > gonna be destroyed. So, if you have a ref to the blkcg through > > tryget, the blkg shouldn't go away. > > > > Maybe we have misunderstanding here. > > In this case, blkg doesn't go away as we have rcu protect, but > blkg_destroy() can be called, in which blkg_put() will put the last > refcnt and then schedule __blkg_release_rcu(). > > css refcnt can't prevent blkcg css from offlining, instead it is css > online_cnt. > > css_tryget() will only get a refcnt of blkcg css, but can't be > guaranteed to fail when css is confirmed to kill. Ah, you're right. I was thinking we only destroy blkgs from blkcg release path. Given that we primarily use blkcg refcnting to pin them, I believe that's what we should do - ie. only call pd_offline_fn() from blkcg_css_offline() path and do the rest of destruction from blkcg_css_free(). What do you think? Thanks. -- tejun
Re: [PATCH v2] blk-throttle: fix race between blkcg_bio_issue_check and cgroup_rmdir
Hi Tejun, Could you please help take a look at this and give a confirmation? Thanks, Joseph On 18/2/24 09:45, Joseph Qi wrote: > Hi Tejun, > > On 18/2/23 22:23, Tejun Heo wrote: >> Hello, >> >> On Fri, Feb 23, 2018 at 09:56:54AM +0800, xuejiufei wrote: On Thu, Feb 22, 2018 at 02:14:34PM +0800, Joseph Qi wrote: > I still don't get how css_tryget can work here. > > The race happens when: > 1) writeback kworker has found the blkg with rcu; > 2) blkcg is during offlining and blkg_destroy() has already been called. > Then, writeback kworker will take queue lock and access the blkg with > refcount 0. Yeah, then tryget would fail and it should go through the root. >>> In this race, the refcount of blkg becomes zero and is destroyed. >>> However css may still have refcount, and css_tryget can return success >>> before other callers put the refcount. >>> So I don't get how css_tryget can fix this race? Or I wonder if we can >>> add another function blkg_tryget? >> >> IIRC, as long as the blkcg and the device are there, the blkgs aren't >> gonna be destroyed. So, if you have a ref to the blkcg through >> tryget, the blkg shouldn't go away. >> > > Maybe we have misunderstanding here. > > In this case, blkg doesn't go away as we have rcu protect, but > blkg_destroy() can be called, in which blkg_put() will put the last > refcnt and then schedule __blkg_release_rcu(). > > css refcnt can't prevent blkcg css from offlining, instead it is css > online_cnt. > > css_tryget() will only get a refcnt of blkcg css, but can't be > guaranteed to fail when css is confirmed to kill. > > The race sequence: > writeback kworker cgroup_rmdir > cgroup_destroy_locked > kill_css > css_killed_ref_fn > css_killed_work_fn > offline_css > blkcg_css_offline > blkcg_bio_issue_check > rcu_read_lock > blkg_lookup > spin_trylock(q->queue_lock) > blkg_destroy > spin_unlock(q->queue_lock) > blk_throtl_bio > spin_lock_irq(q->queue_lock) > spin_unlock_irq(q->queue_lock) > rcu_read_unlock > > Thanks, > Joseph >
Re: [PATCH v2] blk-throttle: fix race between blkcg_bio_issue_check and cgroup_rmdir
Hi Tejun, On 18/2/23 22:23, Tejun Heo wrote: > Hello, > > On Fri, Feb 23, 2018 at 09:56:54AM +0800, xuejiufei wrote: >>> On Thu, Feb 22, 2018 at 02:14:34PM +0800, Joseph Qi wrote: I still don't get how css_tryget can work here. The race happens when: 1) writeback kworker has found the blkg with rcu; 2) blkcg is during offlining and blkg_destroy() has already been called. Then, writeback kworker will take queue lock and access the blkg with refcount 0. >>> >>> Yeah, then tryget would fail and it should go through the root. >>> >> In this race, the refcount of blkg becomes zero and is destroyed. >> However css may still have refcount, and css_tryget can return success >> before other callers put the refcount. >> So I don't get how css_tryget can fix this race? Or I wonder if we can >> add another function blkg_tryget? > > IIRC, as long as the blkcg and the device are there, the blkgs aren't > gonna be destroyed. So, if you have a ref to the blkcg through > tryget, the blkg shouldn't go away. > Maybe we have misunderstanding here. In this case, blkg doesn't go away as we have rcu protect, but blkg_destroy() can be called, in which blkg_put() will put the last refcnt and then schedule __blkg_release_rcu(). css refcnt can't prevent blkcg css from offlining, instead it is css online_cnt. css_tryget() will only get a refcnt of blkcg css, but can't be guaranteed to fail when css is confirmed to kill. The race sequence: writeback kworker cgroup_rmdir cgroup_destroy_locked kill_css css_killed_ref_fn css_killed_work_fn offline_css blkcg_css_offline blkcg_bio_issue_check rcu_read_lock blkg_lookup spin_trylock(q->queue_lock) blkg_destroy spin_unlock(q->queue_lock) blk_throtl_bio spin_lock_irq(q->queue_lock) spin_unlock_irq(q->queue_lock) rcu_read_unlock Thanks, Joseph
Re: [PATCH v2] blk-throttle: fix race between blkcg_bio_issue_check and cgroup_rmdir
Hello, On Fri, Feb 23, 2018 at 09:56:54AM +0800, xuejiufei wrote: > > On Thu, Feb 22, 2018 at 02:14:34PM +0800, Joseph Qi wrote: > >> I still don't get how css_tryget can work here. > >> > >> The race happens when: > >> 1) writeback kworker has found the blkg with rcu; > >> 2) blkcg is during offlining and blkg_destroy() has already been called. > >> Then, writeback kworker will take queue lock and access the blkg with > >> refcount 0. > > > > Yeah, then tryget would fail and it should go through the root. > > > In this race, the refcount of blkg becomes zero and is destroyed. > However css may still have refcount, and css_tryget can return success > before other callers put the refcount. > So I don't get how css_tryget can fix this race? Or I wonder if we can > add another function blkg_tryget? IIRC, as long as the blkcg and the device are there, the blkgs aren't gonna be destroyed. So, if you have a ref to the blkcg through tryget, the blkg shouldn't go away. Thanks. -- tejun
Re: [PATCH v2] blk-throttle: fix race between blkcg_bio_issue_check and cgroup_rmdir
Hi Tejun, On 2018/2/22 下午11:18, Tejun Heo wrote: > Hello, > > On Thu, Feb 22, 2018 at 02:14:34PM +0800, Joseph Qi wrote: >> I still don't get how css_tryget can work here. >> >> The race happens when: >> 1) writeback kworker has found the blkg with rcu; >> 2) blkcg is during offlining and blkg_destroy() has already been called. >> Then, writeback kworker will take queue lock and access the blkg with >> refcount 0. > > Yeah, then tryget would fail and it should go through the root. > In this race, the refcount of blkg becomes zero and is destroyed. However css may still have refcount, and css_tryget can return success before other callers put the refcount. So I don't get how css_tryget can fix this race? Or I wonder if we can add another function blkg_tryget? Thanks, Jiufei > Thanks. >
Re: [PATCH v2] blk-throttle: fix race between blkcg_bio_issue_check and cgroup_rmdir
Hi Tejun, Sorry for the delayed reply. On 18/2/13 01:11, Tejun Heo wrote: > Hello, Joseph. > > On Fri, Feb 09, 2018 at 10:15:19AM +0800, Joseph Qi wrote: >> IIUC, we have to identify it is in blkcg_css_offline now which will >> blkg_put. Since percpu_ref_kill_and_confirm in kill_css will set flag >> __PERCPU_REF_DEAD, so we can use this to avoid the race. IOW, if >> __PERCPU_REF_DEAD is set now, we know blkcg css is in offline and >> continue access blkg may risk double free. Thus we choose to skip these >> ios. >> I don't get how css_tryget works since it doesn't care the flag >> __PERCPU_REF_DEAD. Also css_tryget can't prevent blkcg_css from >> offlining since the race happens blkcg_css_offline is in progress. >> Am I missing something here? > > Once marked dead, the ref is in atomic mode and css_tryget() would hit > the atomic counter. Here, we don't care about the offlining and > draining. A draining memcg can still have a lot of memory to be > written back attached to it and we don't want punt all of them to the > root cgroup. I still don't get how css_tryget can work here. The race happens when: 1) writeback kworker has found the blkg with rcu; 2) blkcg is during offlining and blkg_destroy() has already been called. Then, writeback kworker will take queue lock and access the blkg with refcount 0. So, I think we should take queue lock when lookup blkg if we want to throttle the ios during offline (the way this patch tries), or use css_tryget_online() to skip the further use of the risky blkg, which means these ios won't be throttled either. Thanks, Joseph
Re: [PATCH v2] blk-throttle: fix race between blkcg_bio_issue_check and cgroup_rmdir
Hello, Joseph. On Fri, Feb 09, 2018 at 10:15:19AM +0800, Joseph Qi wrote: > IIUC, we have to identify it is in blkcg_css_offline now which will > blkg_put. Since percpu_ref_kill_and_confirm in kill_css will set flag > __PERCPU_REF_DEAD, so we can use this to avoid the race. IOW, if > __PERCPU_REF_DEAD is set now, we know blkcg css is in offline and > continue access blkg may risk double free. Thus we choose to skip these > ios. > I don't get how css_tryget works since it doesn't care the flag > __PERCPU_REF_DEAD. Also css_tryget can't prevent blkcg_css from > offlining since the race happens blkcg_css_offline is in progress. > Am I missing something here? Once marked dead, the ref is in atomic mode and css_tryget() would hit the atomic counter. Here, we don't care about the offlining and draining. A draining memcg can still have a lot of memory to be written back attached to it and we don't want punt all of them to the root cgroup. Thanks. -- tejun
Re: [PATCH v2] blk-throttle: fix race between blkcg_bio_issue_check and cgroup_rmdir
Hi Tejun, On 18/2/8 23:23, Tejun Heo wrote: > Hello, Joseph. > > On Thu, Feb 08, 2018 at 10:29:43AM +0800, Joseph Qi wrote: >> So you mean checking css->refcnt to prevent the further use of >> blkg_get? I think it makes sense. > > Yes. > >> IMO, we should use css_tryget_online instead, and rightly after taking > > Not really. An offline css still can have a vast amount of IO to > drain and write out. > IIUC, we have to identify it is in blkcg_css_offline now which will blkg_put. Since percpu_ref_kill_and_confirm in kill_css will set flag __PERCPU_REF_DEAD, so we can use this to avoid the race. IOW, if __PERCPU_REF_DEAD is set now, we know blkcg css is in offline and continue access blkg may risk double free. Thus we choose to skip these ios. I don't get how css_tryget works since it doesn't care the flag __PERCPU_REF_DEAD. Also css_tryget can't prevent blkcg_css from offlining since the race happens blkcg_css_offline is in progress. Am I missing something here? Thanks, Joseph >> queue_lock. Because there may be more use of blkg_get in blk_throtl_bio >> in the futher. Actually it already has two now. One is in >> blk_throtl_assoc_bio, and the other is in throtl_qnode_add_bio. >> What do you think of this? > > As long as we avoid making the bypass paths expensive, whatever makes > the code easier to read and maintain would be better. css ref ops are > extremely cheap anyway. > > Thanks. >
Re: [PATCH v2] blk-throttle: fix race between blkcg_bio_issue_check and cgroup_rmdir
Hello, Joseph. On Thu, Feb 08, 2018 at 10:29:43AM +0800, Joseph Qi wrote: > So you mean checking css->refcnt to prevent the further use of > blkg_get? I think it makes sense. Yes. > IMO, we should use css_tryget_online instead, and rightly after taking Not really. An offline css still can have a vast amount of IO to drain and write out. > queue_lock. Because there may be more use of blkg_get in blk_throtl_bio > in the futher. Actually it already has two now. One is in > blk_throtl_assoc_bio, and the other is in throtl_qnode_add_bio. > What do you think of this? As long as we avoid making the bypass paths expensive, whatever makes the code easier to read and maintain would be better. css ref ops are extremely cheap anyway. Thanks. -- tejun
Re: [PATCH v2] blk-throttle: fix race between blkcg_bio_issue_check and cgroup_rmdir
Hi Tejun, Thanks very much for reviewing this patch. On 18/2/8 05:38, Tejun Heo wrote: > Hello, Joseph. > > On Wed, Feb 07, 2018 at 04:40:02PM +0800, Joseph Qi wrote: >> writeback kworker >> blkcg_bio_issue_check >> rcu_read_lock >> blkg_lookup >> <<< *race window* >> blk_throtl_bio >> spin_lock_irq(q->queue_lock) >> spin_unlock_irq(q->queue_lock) >> rcu_read_unlock >> >> cgroup_rmdir >> cgroup_destroy_locked >> kill_css >> css_killed_ref_fn >> css_killed_work_fn >> offline_css >> blkcg_css_offline >> spin_trylock(q->queue_lock) >> blkg_destroy >> spin_unlock(q->queue_lock) > > Ah, right. Thanks for spotting the bug. > >> Since rcu can only prevent blkg from releasing when it is being used, >> the blkg->refcnt can be decreased to 0 during blkg_destroy and schedule >> blkg release. >> Then trying to blkg_get in blk_throtl_bio will complains the WARNING. >> And then the corresponding blkg_put will schedule blkg release again, >> which result in double free. >> This race is introduced by commit ae1188963611 ("blkcg: consolidate blkg >> creation in blkcg_bio_issue_check()"). Before this commit, it will lookup >> first and then try to lookup/create again with queue_lock. So revive >> this logic to fix the race. > > The change seems a bit drastic to me. Can't we do something like the > following instead? > > blk_throtl_bio() > { > ... non throttled cases ... > > /* out-of-limit, queue to @tg */ > > /* >* We can look up and retry but the race window is tiny here. >* Just letting it through should be good enough. >*/ > if (!css_tryget(blkcg->css)) > goto out; > > ... actual queueing ... > css_put(blkcg->css); > ... > } So you mean checking css->refcnt to prevent the further use of blkg_get? I think it makes sense. IMO, we should use css_tryget_online instead, and rightly after taking queue_lock. Because there may be more use of blkg_get in blk_throtl_bio in the futher. Actually it already has two now. One is in blk_throtl_assoc_bio, and the other is in throtl_qnode_add_bio. What do you think of this? Thanks, Joseph
Re: [PATCH v2] blk-throttle: fix race between blkcg_bio_issue_check and cgroup_rmdir
Hello, Joseph. On Wed, Feb 07, 2018 at 04:40:02PM +0800, Joseph Qi wrote: > writeback kworker > blkcg_bio_issue_check > rcu_read_lock > blkg_lookup > <<< *race window* > blk_throtl_bio > spin_lock_irq(q->queue_lock) > spin_unlock_irq(q->queue_lock) > rcu_read_unlock > > cgroup_rmdir > cgroup_destroy_locked > kill_css > css_killed_ref_fn > css_killed_work_fn > offline_css > blkcg_css_offline > spin_trylock(q->queue_lock) > blkg_destroy > spin_unlock(q->queue_lock) Ah, right. Thanks for spotting the bug. > Since rcu can only prevent blkg from releasing when it is being used, > the blkg->refcnt can be decreased to 0 during blkg_destroy and schedule > blkg release. > Then trying to blkg_get in blk_throtl_bio will complains the WARNING. > And then the corresponding blkg_put will schedule blkg release again, > which result in double free. > This race is introduced by commit ae1188963611 ("blkcg: consolidate blkg > creation in blkcg_bio_issue_check()"). Before this commit, it will lookup > first and then try to lookup/create again with queue_lock. So revive > this logic to fix the race. The change seems a bit drastic to me. Can't we do something like the following instead? blk_throtl_bio() { ... non throttled cases ... /* out-of-limit, queue to @tg */ /* * We can look up and retry but the race window is tiny here. * Just letting it through should be good enough. */ if (!css_tryget(blkcg->css)) goto out; ... actual queueing ... css_put(blkcg->css); ... } Thanks. -- tejun