Re: dcache shrink list corruption?
On Tue, May 06, 2014 at 10:01:21AM -0700, Linus Torvalds wrote: > On Tue, May 6, 2014 at 9:52 AM, Al Viro wrote: > > > > OK... There's one more thing I would like to put there, if you are going to > > be away for the week. It has sat in -next for a while, and it could stay > > there, except that there's a _lot_ of followups touching stuff all over the > > tree and I'd obviously prefer those to go into subsystem trees. Which > > means inter-tree dependencies ;-/ Would you be OK if I included that one > > into pull request? It just turns kvfree() into inline and takes it to > > mm.h, next to is_vmalloc_addr(); > > Is there any particular reason for inlining this? I'd actually rather > not, _especially_ if it means that a lot of "is_vmalloc_addr()" usage > goes away and we may end up with this as a real interface. OK, I can live with that. dcache fixes + kvfree() (uninlined, exported by mm/util.c) + posix_acl bugfix from hch. Please, pull from git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for-linus Shortlog: Al Viro (9): fix races between __d_instantiate() and checks of dentry flags fold d_kill() and d_free() fold try_prune_one_dentry() new helper: dentry_free() expand the call of dentry_lru_del() in dentry_kill() dentry_kill(): don't try to remove from shrink list don't remove from shrink list in select_collect() more graceful recovery in umount_collect() nick kvfree() from apparmor Christoph Hellwig (1): posix_acl: handle NULL ACL in posix_acl_equiv_mode Miklos Szeredi (1): dcache: don't need rcu in shrink_dentry_list() Diffstat: fs/dcache.c | 318 --- fs/namei.c |6 +- fs/posix_acl.c |6 ++ include/linux/dcache.h |2 + include/linux/mm.h |2 + mm/util.c| 10 +++ security/apparmor/include/apparmor.h |1 - security/apparmor/lib.c | 14 8 files changed, 125 insertions(+), 234 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Tue, May 6, 2014 at 9:52 AM, Al Viro wrote: > > OK... There's one more thing I would like to put there, if you are going to > be away for the week. It has sat in -next for a while, and it could stay > there, except that there's a _lot_ of followups touching stuff all over the > tree and I'd obviously prefer those to go into subsystem trees. Which > means inter-tree dependencies ;-/ Would you be OK if I included that one > into pull request? It just turns kvfree() into inline and takes it to > mm.h, next to is_vmalloc_addr(); Is there any particular reason for inlining this? I'd actually rather not, _especially_ if it means that a lot of "is_vmalloc_addr()" usage goes away and we may end up with this as a real interface. But no, I don't hate the patch. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Tue, May 06, 2014 at 07:53:19AM -0700, Linus Torvalds wrote: > On Tue, May 6, 2014 at 3:17 AM, Miklos Szeredi wrote: > > > > Patches look okay to me. > > > > Reviewed-by: Miklos Szeredi > > > >> dentry_kill(): don't try to remove from shrink list > > > > Backport of this to 3.12 was tested by IBM and apparently fixes the > > issue for them (I didn't backport the cleanup patches only the actual > > fix) > > Ok, good. > > >> don't remove from shrink list in select_collect() > > > > I've also asked them to test this, although I think this is even > > harder to trigger. But at least the non-racy codepaths need to be > > tested. > > I'll be incommunicado all next week, so I think I should merge this > all now rather than later. > > Al, mind doing a real pull request? OK... There's one more thing I would like to put there, if you are going to be away for the week. It has sat in -next for a while, and it could stay there, except that there's a _lot_ of followups touching stuff all over the tree and I'd obviously prefer those to go into subsystem trees. Which means inter-tree dependencies ;-/ Would you be OK if I included that one into pull request? It just turns kvfree() into inline and takes it to mm.h, next to is_vmalloc_addr(); we have *lots* of open-coded instances of that all over the place. As the matter of fact, more than a half of is_vmalloc_addr() call sites are of that sort... commit 3c91dc1ce40f973bbceded3d8ce96bda7c4d480c Author: Al Viro Date: Wed Apr 23 10:13:03 2014 -0400 nick kvfree() from apparmor/lib too many open-coded instances Signed-off-by: Al Viro diff --git a/include/linux/mm.h b/include/linux/mm.h index bf9811e..a784964 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -370,6 +370,17 @@ static inline int is_vmalloc_or_module_addr(const void *x) } #endif +static inline void kvfree(const void *x) +{ + /* include order mess... */ + extern void kfree(const void *); + extern void vfree(const void *); + if (is_vmalloc_addr(x)) + vfree(x); + else + kfree(x); +} + static inline void compound_lock(struct page *page) { #ifdef CONFIG_TRANSPARENT_HUGEPAGE diff --git a/security/apparmor/include/apparmor.h b/security/apparmor/include/apparmor.h index 8fb1488..97130f8 100644 --- a/security/apparmor/include/apparmor.h +++ b/security/apparmor/include/apparmor.h @@ -66,7 +66,6 @@ extern int apparmor_initialized __initdata; char *aa_split_fqname(char *args, char **ns_name); void aa_info_message(const char *str); void *__aa_kvmalloc(size_t size, gfp_t flags); -void kvfree(void *buffer); static inline void *kvmalloc(size_t size) { diff --git a/security/apparmor/lib.c b/security/apparmor/lib.c index 6968992..c1827e0 100644 --- a/security/apparmor/lib.c +++ b/security/apparmor/lib.c @@ -104,17 +104,3 @@ void *__aa_kvmalloc(size_t size, gfp_t flags) } return buffer; } - -/** - * kvfree - free an allocation do by kvmalloc - * @buffer: buffer to free (MAYBE_NULL) - * - * Free a buffer allocated by kvmalloc - */ -void kvfree(void *buffer) -{ - if (is_vmalloc_addr(buffer)) - vfree(buffer); - else - kfree(buffer); -} -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Tue, May 6, 2014 at 3:17 AM, Miklos Szeredi wrote: > > Patches look okay to me. > > Reviewed-by: Miklos Szeredi > >> dentry_kill(): don't try to remove from shrink list > > Backport of this to 3.12 was tested by IBM and apparently fixes the > issue for them (I didn't backport the cleanup patches only the actual > fix) Ok, good. >> don't remove from shrink list in select_collect() > > I've also asked them to test this, although I think this is even > harder to trigger. But at least the non-racy codepaths need to be > tested. I'll be incommunicado all next week, so I think I should merge this all now rather than later. Al, mind doing a real pull request? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Sun, May 4, 2014 at 8:29 AM, Al Viro wrote: > On Sat, May 03, 2014 at 07:21:11PM +0100, Al Viro wrote: >> On Sat, May 03, 2014 at 05:26:04AM +0100, Al Viro wrote: >> >> > See vfs.git#dentry_kill-3; warning - this is completely untested and I >> > would >> > really like comments on spinning case there (i.e. the one where >> > select_collect() >> > finds some stuff already on some other shrink list and nothing with zero >> > refcount that wouldn't be there). In that case (and it's basically >> > "somebody >> > else is evicting stuff in our subtree and they'd already picked everything >> > we want evicted") I just let the loop in check_submounts_and_drop() repeat >> > (we do have cond_resched() there). Any better suggestions would be >> > welcome... >> >> Hmm... As the matter of fact, the whole shrink_dcache_for_umount() could >> be made a lot saner. What we need is to reuse shrink_dcache_parent() >> and follow it with d_walk() that would just go through whatever remains and >> complain about the leaves of that. For anon roots we'll obviously need to >> wrap that into dget and d_drop/dput. >> >> I'm testing that right now; everything seems to be working so far and if >> it survives, I'll push that sucker out. Total since the beginning of >> the whole series: >> fs/dcache.c| 310 >> - >> include/linux/dcache.h |2 + >> 2 files changed, 112 insertions(+), 200 deletions(-) > > No regressions compared to mainline; force-pushed into vfs#dentry_kill-3. > Review and testing would be very welcome... Summary for that branch: > (it's *not* a pull request yet; the thing really needs review) Patches look okay to me. Reviewed-by: Miklos Szeredi > dentry_kill(): don't try to remove from shrink list Backport of this to 3.12 was tested by IBM and apparently fixes the issue for them (I didn't backport the cleanup patches only the actual fix) > don't remove from shrink list in select_collect() I've also asked them to test this, although I think this is even harder to trigger. But at least the non-racy codepaths need to be tested. Thanks, Miklos -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Sun, May 4, 2014 at 8:29 AM, Al Viro v...@zeniv.linux.org.uk wrote: On Sat, May 03, 2014 at 07:21:11PM +0100, Al Viro wrote: On Sat, May 03, 2014 at 05:26:04AM +0100, Al Viro wrote: See vfs.git#dentry_kill-3; warning - this is completely untested and I would really like comments on spinning case there (i.e. the one where select_collect() finds some stuff already on some other shrink list and nothing with zero refcount that wouldn't be there). In that case (and it's basically somebody else is evicting stuff in our subtree and they'd already picked everything we want evicted) I just let the loop in check_submounts_and_drop() repeat (we do have cond_resched() there). Any better suggestions would be welcome... Hmm... As the matter of fact, the whole shrink_dcache_for_umount() could be made a lot saner. What we need is to reuse shrink_dcache_parent() and follow it with d_walk() that would just go through whatever remains and complain about the leaves of that. For anon roots we'll obviously need to wrap that into dget and d_drop/dput. I'm testing that right now; everything seems to be working so far and if it survives, I'll push that sucker out. Total since the beginning of the whole series: fs/dcache.c| 310 - include/linux/dcache.h |2 + 2 files changed, 112 insertions(+), 200 deletions(-) No regressions compared to mainline; force-pushed into vfs#dentry_kill-3. Review and testing would be very welcome... Summary for that branch: (it's *not* a pull request yet; the thing really needs review) Patches look okay to me. Reviewed-by: Miklos Szeredi mszer...@suse.cz dentry_kill(): don't try to remove from shrink list Backport of this to 3.12 was tested by IBM and apparently fixes the issue for them (I didn't backport the cleanup patches only the actual fix) don't remove from shrink list in select_collect() I've also asked them to test this, although I think this is even harder to trigger. But at least the non-racy codepaths need to be tested. Thanks, Miklos -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Tue, May 6, 2014 at 3:17 AM, Miklos Szeredi mik...@szeredi.hu wrote: Patches look okay to me. Reviewed-by: Miklos Szeredi mszer...@suse.cz dentry_kill(): don't try to remove from shrink list Backport of this to 3.12 was tested by IBM and apparently fixes the issue for them (I didn't backport the cleanup patches only the actual fix) Ok, good. don't remove from shrink list in select_collect() I've also asked them to test this, although I think this is even harder to trigger. But at least the non-racy codepaths need to be tested. I'll be incommunicado all next week, so I think I should merge this all now rather than later. Al, mind doing a real pull request? Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Tue, May 06, 2014 at 07:53:19AM -0700, Linus Torvalds wrote: On Tue, May 6, 2014 at 3:17 AM, Miklos Szeredi mik...@szeredi.hu wrote: Patches look okay to me. Reviewed-by: Miklos Szeredi mszer...@suse.cz dentry_kill(): don't try to remove from shrink list Backport of this to 3.12 was tested by IBM and apparently fixes the issue for them (I didn't backport the cleanup patches only the actual fix) Ok, good. don't remove from shrink list in select_collect() I've also asked them to test this, although I think this is even harder to trigger. But at least the non-racy codepaths need to be tested. I'll be incommunicado all next week, so I think I should merge this all now rather than later. Al, mind doing a real pull request? OK... There's one more thing I would like to put there, if you are going to be away for the week. It has sat in -next for a while, and it could stay there, except that there's a _lot_ of followups touching stuff all over the tree and I'd obviously prefer those to go into subsystem trees. Which means inter-tree dependencies ;-/ Would you be OK if I included that one into pull request? It just turns kvfree() into inline and takes it to mm.h, next to is_vmalloc_addr(); we have *lots* of open-coded instances of that all over the place. As the matter of fact, more than a half of is_vmalloc_addr() call sites are of that sort... commit 3c91dc1ce40f973bbceded3d8ce96bda7c4d480c Author: Al Viro v...@zeniv.linux.org.uk Date: Wed Apr 23 10:13:03 2014 -0400 nick kvfree() from apparmor/lib too many open-coded instances Signed-off-by: Al Viro v...@zeniv.linux.org.uk diff --git a/include/linux/mm.h b/include/linux/mm.h index bf9811e..a784964 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -370,6 +370,17 @@ static inline int is_vmalloc_or_module_addr(const void *x) } #endif +static inline void kvfree(const void *x) +{ + /* include order mess... */ + extern void kfree(const void *); + extern void vfree(const void *); + if (is_vmalloc_addr(x)) + vfree(x); + else + kfree(x); +} + static inline void compound_lock(struct page *page) { #ifdef CONFIG_TRANSPARENT_HUGEPAGE diff --git a/security/apparmor/include/apparmor.h b/security/apparmor/include/apparmor.h index 8fb1488..97130f8 100644 --- a/security/apparmor/include/apparmor.h +++ b/security/apparmor/include/apparmor.h @@ -66,7 +66,6 @@ extern int apparmor_initialized __initdata; char *aa_split_fqname(char *args, char **ns_name); void aa_info_message(const char *str); void *__aa_kvmalloc(size_t size, gfp_t flags); -void kvfree(void *buffer); static inline void *kvmalloc(size_t size) { diff --git a/security/apparmor/lib.c b/security/apparmor/lib.c index 6968992..c1827e0 100644 --- a/security/apparmor/lib.c +++ b/security/apparmor/lib.c @@ -104,17 +104,3 @@ void *__aa_kvmalloc(size_t size, gfp_t flags) } return buffer; } - -/** - * kvfree - free an allocation do by kvmalloc - * @buffer: buffer to free (MAYBE_NULL) - * - * Free a buffer allocated by kvmalloc - */ -void kvfree(void *buffer) -{ - if (is_vmalloc_addr(buffer)) - vfree(buffer); - else - kfree(buffer); -} -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Tue, May 6, 2014 at 9:52 AM, Al Viro v...@zeniv.linux.org.uk wrote: OK... There's one more thing I would like to put there, if you are going to be away for the week. It has sat in -next for a while, and it could stay there, except that there's a _lot_ of followups touching stuff all over the tree and I'd obviously prefer those to go into subsystem trees. Which means inter-tree dependencies ;-/ Would you be OK if I included that one into pull request? It just turns kvfree() into inline and takes it to mm.h, next to is_vmalloc_addr(); Is there any particular reason for inlining this? I'd actually rather not, _especially_ if it means that a lot of is_vmalloc_addr() usage goes away and we may end up with this as a real interface. But no, I don't hate the patch. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Tue, May 06, 2014 at 10:01:21AM -0700, Linus Torvalds wrote: On Tue, May 6, 2014 at 9:52 AM, Al Viro v...@zeniv.linux.org.uk wrote: OK... There's one more thing I would like to put there, if you are going to be away for the week. It has sat in -next for a while, and it could stay there, except that there's a _lot_ of followups touching stuff all over the tree and I'd obviously prefer those to go into subsystem trees. Which means inter-tree dependencies ;-/ Would you be OK if I included that one into pull request? It just turns kvfree() into inline and takes it to mm.h, next to is_vmalloc_addr(); Is there any particular reason for inlining this? I'd actually rather not, _especially_ if it means that a lot of is_vmalloc_addr() usage goes away and we may end up with this as a real interface. OK, I can live with that. dcache fixes + kvfree() (uninlined, exported by mm/util.c) + posix_acl bugfix from hch. Please, pull from git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for-linus Shortlog: Al Viro (9): fix races between __d_instantiate() and checks of dentry flags fold d_kill() and d_free() fold try_prune_one_dentry() new helper: dentry_free() expand the call of dentry_lru_del() in dentry_kill() dentry_kill(): don't try to remove from shrink list don't remove from shrink list in select_collect() more graceful recovery in umount_collect() nick kvfree() from apparmor Christoph Hellwig (1): posix_acl: handle NULL ACL in posix_acl_equiv_mode Miklos Szeredi (1): dcache: don't need rcu in shrink_dentry_list() Diffstat: fs/dcache.c | 318 --- fs/namei.c |6 +- fs/posix_acl.c |6 ++ include/linux/dcache.h |2 + include/linux/mm.h |2 + mm/util.c| 10 +++ security/apparmor/include/apparmor.h |1 - security/apparmor/lib.c | 14 8 files changed, 125 insertions(+), 234 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Sat, May 03, 2014 at 07:21:11PM +0100, Al Viro wrote: > On Sat, May 03, 2014 at 05:26:04AM +0100, Al Viro wrote: > > > See vfs.git#dentry_kill-3; warning - this is completely untested and I would > > really like comments on spinning case there (i.e. the one where > > select_collect() > > finds some stuff already on some other shrink list and nothing with zero > > refcount that wouldn't be there). In that case (and it's basically > > "somebody > > else is evicting stuff in our subtree and they'd already picked everything > > we want evicted") I just let the loop in check_submounts_and_drop() repeat > > (we do have cond_resched() there). Any better suggestions would be > > welcome... > > Hmm... As the matter of fact, the whole shrink_dcache_for_umount() could > be made a lot saner. What we need is to reuse shrink_dcache_parent() > and follow it with d_walk() that would just go through whatever remains and > complain about the leaves of that. For anon roots we'll obviously need to > wrap that into dget and d_drop/dput. > > I'm testing that right now; everything seems to be working so far and if > it survives, I'll push that sucker out. Total since the beginning of > the whole series: > fs/dcache.c| 310 > - > include/linux/dcache.h |2 + > 2 files changed, 112 insertions(+), 200 deletions(-) No regressions compared to mainline; force-pushed into vfs#dentry_kill-3. Review and testing would be very welcome... Summary for that branch: (it's *not* a pull request yet; the thing really needs review) Shortlog: Al Viro (8): fix races between __d_instantiate() and checks of dentry flags fold d_kill() and d_free() fold try_prune_one_dentry() new helper: dentry_free() expand the call of dentry_lru_del() in dentry_kill() dentry_kill(): don't try to remove from shrink list don't remove from shrink list in select_collect() more graceful recovery in umount_collect() Miklos Szeredi (1): dcache: don't need rcu in shrink_dentry_list() Diffstat: fs/dcache.c| 318 + fs/namei.c |6 +- include/linux/dcache.h |2 + 3 files changed, 107 insertions(+), 219 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Sat, May 03, 2014 at 07:21:11PM +0100, Al Viro wrote: On Sat, May 03, 2014 at 05:26:04AM +0100, Al Viro wrote: See vfs.git#dentry_kill-3; warning - this is completely untested and I would really like comments on spinning case there (i.e. the one where select_collect() finds some stuff already on some other shrink list and nothing with zero refcount that wouldn't be there). In that case (and it's basically somebody else is evicting stuff in our subtree and they'd already picked everything we want evicted) I just let the loop in check_submounts_and_drop() repeat (we do have cond_resched() there). Any better suggestions would be welcome... Hmm... As the matter of fact, the whole shrink_dcache_for_umount() could be made a lot saner. What we need is to reuse shrink_dcache_parent() and follow it with d_walk() that would just go through whatever remains and complain about the leaves of that. For anon roots we'll obviously need to wrap that into dget and d_drop/dput. I'm testing that right now; everything seems to be working so far and if it survives, I'll push that sucker out. Total since the beginning of the whole series: fs/dcache.c| 310 - include/linux/dcache.h |2 + 2 files changed, 112 insertions(+), 200 deletions(-) No regressions compared to mainline; force-pushed into vfs#dentry_kill-3. Review and testing would be very welcome... Summary for that branch: (it's *not* a pull request yet; the thing really needs review) Shortlog: Al Viro (8): fix races between __d_instantiate() and checks of dentry flags fold d_kill() and d_free() fold try_prune_one_dentry() new helper: dentry_free() expand the call of dentry_lru_del() in dentry_kill() dentry_kill(): don't try to remove from shrink list don't remove from shrink list in select_collect() more graceful recovery in umount_collect() Miklos Szeredi (1): dcache: don't need rcu in shrink_dentry_list() Diffstat: fs/dcache.c| 318 + fs/namei.c |6 +- include/linux/dcache.h |2 + 3 files changed, 107 insertions(+), 219 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Sat, May 03, 2014 at 11:07:57AM -0700, Linus Torvalds wrote: > Sure, umount itself should be serialized by the sb lock, so there > should be only one umount dentry collector. But why wouldn't there be > shrinkers active due to memory pressure? > > generic_unmount_super() is called by ->kill_sb(), which is done > *before* the superblock shrinker is unregistered So any memory > pressure during that will cause dentries to be shrunk other ways. > > What am I missing? This: if (!grab_super_passive(sb)) return SHRINK_STOP; before calling prune_dcache_sb(). grab_super_passive() returns with ->s_umount held shared on success (with down_read_trylock()) and ->kill_sb() is called only with ->s_umount held exclusive. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Sat, May 03, 2014 at 05:26:04AM +0100, Al Viro wrote: > See vfs.git#dentry_kill-3; warning - this is completely untested and I would > really like comments on spinning case there (i.e. the one where > select_collect() > finds some stuff already on some other shrink list and nothing with zero > refcount that wouldn't be there). In that case (and it's basically "somebody > else is evicting stuff in our subtree and they'd already picked everything > we want evicted") I just let the loop in check_submounts_and_drop() repeat > (we do have cond_resched() there). Any better suggestions would be welcome... Hmm... As the matter of fact, the whole shrink_dcache_for_umount() could be made a lot saner. What we need is to reuse shrink_dcache_parent() and follow it with d_walk() that would just go through whatever remains and complain about the leaves of that. For anon roots we'll obviously need to wrap that into dget and d_drop/dput. I'm testing that right now; everything seems to be working so far and if it survives, I'll push that sucker out. Total since the beginning of the whole series: fs/dcache.c| 310 - include/linux/dcache.h |2 + 2 files changed, 112 insertions(+), 200 deletions(-) Anyway, I'm off to finish with acct.c races, then it's time for Eric's lazy umount on unlink series; this work is getting unpleasantly close to the areas he's touching, so I'd better get it on top of this stuff and into -next... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Fri, May 2, 2014 at 9:26 PM, Al Viro wrote: > > See vfs.git#dentry_kill-3; warning - this is completely untested and I would > really like comments on spinning case there (i.e. the one where > select_collect() > finds some stuff already on some other shrink list and nothing with zero > refcount that wouldn't be there). I'm not seeing why you say that there can be no other shrinkers active during umount (that whole "buggered" thing). Sure, umount itself should be serialized by the sb lock, so there should be only one umount dentry collector. But why wouldn't there be shrinkers active due to memory pressure? generic_unmount_super() is called by ->kill_sb(), which is done *before* the superblock shrinker is unregistered So any memory pressure during that will cause dentries to be shrunk other ways. What am I missing? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Fri, May 2, 2014 at 9:26 PM, Al Viro v...@zeniv.linux.org.uk wrote: See vfs.git#dentry_kill-3; warning - this is completely untested and I would really like comments on spinning case there (i.e. the one where select_collect() finds some stuff already on some other shrink list and nothing with zero refcount that wouldn't be there). I'm not seeing why you say that there can be no other shrinkers active during umount (that whole buggered thing). Sure, umount itself should be serialized by the sb lock, so there should be only one umount dentry collector. But why wouldn't there be shrinkers active due to memory pressure? generic_unmount_super() is called by -kill_sb(), which is done *before* the superblock shrinker is unregistered So any memory pressure during that will cause dentries to be shrunk other ways. What am I missing? Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Sat, May 03, 2014 at 05:26:04AM +0100, Al Viro wrote: See vfs.git#dentry_kill-3; warning - this is completely untested and I would really like comments on spinning case there (i.e. the one where select_collect() finds some stuff already on some other shrink list and nothing with zero refcount that wouldn't be there). In that case (and it's basically somebody else is evicting stuff in our subtree and they'd already picked everything we want evicted) I just let the loop in check_submounts_and_drop() repeat (we do have cond_resched() there). Any better suggestions would be welcome... Hmm... As the matter of fact, the whole shrink_dcache_for_umount() could be made a lot saner. What we need is to reuse shrink_dcache_parent() and follow it with d_walk() that would just go through whatever remains and complain about the leaves of that. For anon roots we'll obviously need to wrap that into dget and d_drop/dput. I'm testing that right now; everything seems to be working so far and if it survives, I'll push that sucker out. Total since the beginning of the whole series: fs/dcache.c| 310 - include/linux/dcache.h |2 + 2 files changed, 112 insertions(+), 200 deletions(-) Anyway, I'm off to finish with acct.c races, then it's time for Eric's lazy umount on unlink series; this work is getting unpleasantly close to the areas he's touching, so I'd better get it on top of this stuff and into -next... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Sat, May 03, 2014 at 11:07:57AM -0700, Linus Torvalds wrote: Sure, umount itself should be serialized by the sb lock, so there should be only one umount dentry collector. But why wouldn't there be shrinkers active due to memory pressure? generic_unmount_super() is called by -kill_sb(), which is done *before* the superblock shrinker is unregistered So any memory pressure during that will cause dentries to be shrunk other ways. What am I missing? This: if (!grab_super_passive(sb)) return SHRINK_STOP; before calling prune_dcache_sb(). grab_super_passive() returns with -s_umount held shared on success (with down_read_trylock()) and -kill_sb() is called only with -s_umount held exclusive. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Fri, May 02, 2014 at 11:40:22PM +0100, Al Viro wrote: > On Fri, May 02, 2014 at 02:18:43PM -0700, Linus Torvalds wrote: > > On Fri, May 2, 2014 at 2:08 PM, Miklos Szeredi wrote: > > > There's more of the "delete from shrink list not owned by us" in select > > > parent. > > > Proposed patch appended. > > > > Ahh. Clearly this needs more work before I pull. > > *nod* > > Besides, I want to put Miklos' "don't bother with RCU in shrink_dentry_list()" > in there as soon as select_collect() has been dealt with. I don't think > that the currently posted patch for select_collect() is right, though - > see my reply to parent posting. Basically, I think we should treat "it's > on the shrink list already" as "increment data->found and keep going". IOW, > if (on shrink list) { > data->found++; > } else { > if (on lru list) > d_lru_del > if (refcount is zero) { > d_shrink_add > data->found++; > } > } > if (data->found) > ret = need_resched() ? D_WALK_QUIT : D_WALK_NORETRY; See vfs.git#dentry_kill-3; warning - this is completely untested and I would really like comments on spinning case there (i.e. the one where select_collect() finds some stuff already on some other shrink list and nothing with zero refcount that wouldn't be there). In that case (and it's basically "somebody else is evicting stuff in our subtree and they'd already picked everything we want evicted") I just let the loop in check_submounts_and_drop() repeat (we do have cond_resched() there). Any better suggestions would be welcome... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Fri, May 02, 2014 at 11:40:22PM +0100, Al Viro wrote: > On Fri, May 02, 2014 at 02:18:43PM -0700, Linus Torvalds wrote: > > On Fri, May 2, 2014 at 2:08 PM, Miklos Szeredi wrote: > > > There's more of the "delete from shrink list not owned by us" in select > > > parent. > > > Proposed patch appended. > > > > Ahh. Clearly this needs more work before I pull. > > *nod* > > Besides, I want to put Miklos' "don't bother with RCU in shrink_dentry_list()" > in there as soon as select_collect() has been dealt with. I don't think > that the currently posted patch for select_collect() is right, though - > see my reply to parent posting. Basically, I think we should treat "it's > on the shrink list already" as "increment data->found and keep going". IOW, > if (on shrink list) { > data->found++; > } else { > if (on lru list) > d_lru_del > if (refcount is zero) { > d_shrink_add > data->found++; > } > } > if (data->found) > ret = need_resched() ? D_WALK_QUIT : D_WALK_NORETRY; While we are at it - BUG() in umount_collect() is probably a bad idea. At that point we are holding ->s_umount, so it guarantees that a lot of stuff from that point on will get stuck. Starting with sync(2). And I really doubt that damage from WARN() instead will be more... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Fri, May 02, 2014 at 02:18:43PM -0700, Linus Torvalds wrote: > On Fri, May 2, 2014 at 2:08 PM, Miklos Szeredi wrote: > > There's more of the "delete from shrink list not owned by us" in select > > parent. > > Proposed patch appended. > > Ahh. Clearly this needs more work before I pull. *nod* Besides, I want to put Miklos' "don't bother with RCU in shrink_dentry_list()" in there as soon as select_collect() has been dealt with. I don't think that the currently posted patch for select_collect() is right, though - see my reply to parent posting. Basically, I think we should treat "it's on the shrink list already" as "increment data->found and keep going". IOW, if (on shrink list) { data->found++; } else { if (on lru list) d_lru_del if (refcount is zero) { d_shrink_add data->found++; } } if (data->found) ret = need_resched() ? D_WALK_QUIT : D_WALK_NORETRY; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Fri, May 02, 2014 at 11:08:13PM +0200, Miklos Szeredi wrote: > There's more of the "delete from shrink list not owned by us" in select > parent. > Proposed patch appended. While it certainly looks like dentry_lru_del() should die, I really wonder if "let's pretend that dentry isn't there if it's on some other shrink list" is the right approach. You've already noticed one problem (check-and-drop giving false busy indications), but shrink_dcache_parent() has similar issues. How about incrementing data->found instead? That'll end up rescanning the tree in case if it's not ours; so what? If it's another process doing the same shrinking in a subtree, we want to let it do its job anyway. Sure, somebody doing mount -o remount in a loop might be able to stall the living hell out of us, for as long as new non-busy dentries are being added in our subtree, but the second part in itself is sufficient; we will keep picking those new non-busy dentries as long as they keep coming. And if they do not, eventually they will be all taken out by us or by those other shrinkers and we'll be done. > And I'm not sure what umount_collect() is supposed to do. Can other shrinkers > still be active at that point? That would present other problems, no? No other shrinkers - prune_dcache_sb() and shrink_dcache_sb() are also serialized by ->s_umount, shrink_dcache_parent() and check_submounts_and_drop() are called only when an active reference is held, which obviously prevents fs shutdown. > Also somewhat related is the question: how check_submounts_and_drop() could be > guaranteed correctness (timely removal of all unsed dentries) in the presence > of > other shrinkers? Another interesting question is what the hell are shrink_dcache_parent() callers expecting. E.g. what's going on in fuse_reverse_inval_entry()? And what is nilfs_tree_is_busy() about? FWIW, I'm not at all sure that vfs_rmdir() and vfs_rename() have any reason to call it these days, and dentry_unhash() one simply ought to die - it's used only by hpfs unlink() in case it wants to truncate in order to work around -ENOSPC. And _that_ won't have any subdirectories to deal with anyway, so shrink_dcache_parent() there is a no-op, even if we keep the damn helper alive. The rest of callers also look dubious... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Fri, May 2, 2014 at 2:08 PM, Miklos Szeredi wrote: > There's more of the "delete from shrink list not owned by us" in select > parent. > Proposed patch appended. Ahh. Clearly this needs more work before I pull. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
There's more of the "delete from shrink list not owned by us" in select parent. Proposed patch appended. And I'm not sure what umount_collect() is supposed to do. Can other shrinkers still be active at that point? That would present other problems, no? Also somewhat related is the question: how check_submounts_and_drop() could be guaranteed correctness (timely removal of all unsed dentries) in the presence of other shrinkers? Thanks, Miklos From: Miklos Szeredi Subject: dcache: select_collect(): don't remove from shrink list Shrink lists are not protected by any lock, so don't remove from an unknown one. Signed-off-by: Miklos Szeredi --- fs/dcache.c | 23 --- 1 file changed, 12 insertions(+), 11 deletions(-) --- a/fs/dcache.c +++ b/fs/dcache.c @@ -1241,18 +1241,19 @@ static enum d_walk_ret select_collect(vo * loop in shrink_dcache_parent() might not make any progress * and loop forever. */ - if (dentry->d_lockref.count) { - dentry_lru_del(dentry); - } else if (!(dentry->d_flags & DCACHE_SHRINK_LIST)) { - /* -* We can't use d_lru_shrink_move() because we -* need to get the global LRU lock and do the -* LRU accounting. -*/ + if (dentry->d_flags & (DCACHE_SHRINK_LIST | DCACHE_LRU_LIST) == + DCACHE_LRU_LIST) { d_lru_del(dentry); - d_shrink_add(dentry, >dispose); - data->found++; - ret = D_WALK_NORETRY; + if (!dentry->d_lockref.count) { + /* +* We can't use d_lru_shrink_move() because we +* need to get the global LRU lock and do the +* LRU accounting. +*/ + d_shrink_add(dentry, >dispose); + data->found++; + ret = D_WALK_NORETRY; + } } /* * We can return to the caller if we have found some (this -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Thu, May 1, 2014 at 7:34 AM, Al Viro wrote: > > OK, fixed and pushed (both branches). Al, can you send a real pull request (the "both branches" part in particular makes me worry about which one you think is right), because I suspect by now we just need to get this wider testing. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Fri, May 02, 2014 at 11:00:04AM +0200, Szeredi Miklos wrote: > The bug is private, but I'll ask if I can repost it. The first thing > is a a warning from the D_FLAG_VERIFY() in d_shrink_del() from > shrink_dentry_list(). They added printks that show that the dentry > has DCACHE_DENTRY_KILLED. > > We could ask for a dump, but this is the only rational explanation I > could find for this (and a shrink list with two dentries, with racing > dput on both nicely explains the case where the shrink list's prev > pointer still points to the already killed dentry). Bug details: == WARNING: at /home/abuild/rpmbuild/BUILD/kernel-default-3.12.15/linux-3.12/fs/dcache.c:392 Modules linked in: iptable_filter ip_tables x_tables rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd sunrpc fscache af_packet xfs libcrc32c dm_mod autofs4 btrfs xor sr_mod cdrom usb_storage raid6_pq sd_mod crc_t10dif ibmvscsi(X) scsi_transport_srp scsi_tgt ibmveth(X) ohci_pci ohci_hcd ehci_hcd usbcore usb_common sg scsi_mod Supported: Yes CPU: 7 PID: 25685 Comm: host01 Tainted: G X 3.12.15-3-default #1 task: c003f7fb68f0 ti: c003f31f4000 task.ti: c003f31f4000 NIP: c024552c LR: c02462b8 CTR: c02aad40 REGS: c003f31f7530 TRAP: 0700 Tainted: G X (3.12.15-3-default) MSR: 800100029033 CR: 24242448 XER: 2000 SOFTE: 1 CFAR: c02454b0 GPR00: c02462b8 c003f31f77b0 c0f10fb0 c003dfa7f028 GPR04: 0001 c0e4fe00 0600 GPR08: c0e30fb0 0001 GPR12: 24242442 cfe41880 GPR16: 010013556de0 6458 0002 GPR20: c084e940 0004 c003f8afa400 GPR24: 6457 c003fde09e68 GPR28: c003dfa7f028 c003dfa7f028 c003f31f7948 c003dfa7f0a8 NIP [c024552c] .d_shrink_del+0x9c/0xc0 LR [c02462b8] .shrink_dentry_list+0xc8/0x160 PACATMSCRATCH [80019033] Call Trace: [c003f31f77b0] [c003f8f1ef80] 0xc003f8f1ef80 (unreliable) [c003f31f7820] [c02462b8] .shrink_dentry_list+0xc8/0x160 [c003f31f78d0] [c0246754] .shrink_dcache_parent+0x44/0xa0 [c003f31f7980] [c02b0a0c] .proc_flush_task+0xbc/0x1f0 [c003f31f7a70] [c008e554] .release_task+0x94/0x530 [c003f31f7b50] [c008efb8] .wait_task_zombie+0x5c8/0x750 [c003f31f7c10] [c008fa80] .do_wait+0x120/0x2c0 [c003f31f7cd0] [c0091040] .SyS_wait4+0x90/0x130 [c003f31f7dc0] [c00910fc] .SyS_waitpid+0x1c/0x30 [c003f31f7e30] [c0009dfc] syscall_exit+0x0/0x7c Instruction dump: 7d09502a 3908 7d09512a 4bdcaf19 6000 38210070 e8010010 7c0803a6 4e800020 3d02fff2 8928dfd6 69290001 <0b09> 2fa9 41feff80 3921 ---[ end trace 56e8481827564dc3 ]--- == I forgot to mention that the TCP test used in the recreates is from the LTP suite from http://ltp.sourceforge.net It has been observed that each time we run into it so far, the host01 test is the one running. == After installing the kernel with some conditional printk statements in it, the condition hit and I caught a glimpse of it. I then dropped the system into xmon to dump out some further info: When we first detect that the dentry->d_flags is missing these flags DCACHE_SHRINK_LIST | DCACHE_LRU_LIST we start to dump some information about the dentry from within shrink_dentry_list() and got: dentry c003d6240ce8 name 5 lockcnt -128 flags 1048780 post-shrink-del: dentry c003d6240ce8 name 5 lockcnt -128 dentry c003d6240ce8 lockcnt -128 dentry c003d6240ce8 name 5 lockcnt -128 flags 1048780 post-shrink-del: dentry c003d6240ce8 name 5 lockcnt -128 dentry c003d6240ce8 lockcnt -128 dentry c003d6240ce8 name 5 lockcnt -128 flags 1048780 and this repeats over and over... The code (with printks) looks like this: 899 for (;;) { 900 dentry = list_entry_rcu(list->prev, struct dentry, d_lru); 901 if (>d_lru == list) 902 break; /* empty */ 903 904 if (print_more) 905 printk(KERN_WARNING "dentry %p lockcnt %d\n", 906 dentry, dentry->d_lockref.count); 907 908 /* 909 * Get the dentry lock, and re-verify that the dentry is 910 * this on the shrinking list. If it is, we know that 911 * DCACHE_SHRINK_LIST and DCACHE_LRU_LIST are set. 912 */
Re: dcache shrink list corruption?
On Fri, May 2, 2014 at 7:51 AM, Al Viro wrote: > On Wed, Apr 30, 2014 at 11:15:15AM +0200, Miklos Szeredi wrote: >> IBM is triggering this with the host01 test from the LTP suite. I haven't >> yet >> tried to reproduce it. > > Could you repost their report? So far I hadn't managed to get actual list > corruption on mainline kernel - it definitely is possible, but on all > testcases so far the race window is too narrow. So if they have a reproducer > that doesn't take an insane amount of time, I'd really like to see it... The only reproducer they have is LTP/host01, and can trigger the bug reliably within hours. The bug is private, but I'll ask if I can repost it. The first thing is a a warning from the D_FLAG_VERIFY() in d_shrink_del() from shrink_dentry_list(). They added printks that show that the dentry has DCACHE_DENTRY_KILLED. We could ask for a dump, but this is the only rational explanation I could find for this (and a shrink list with two dentries, with racing dput on both nicely explains the case where the shrink list's prev pointer still points to the already killed dentry). Thanks, Miklos -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Thu, May 1, 2014 at 4:34 PM, Al Viro wrote: > On Thu, May 01, 2014 at 11:42:52AM +0200, Miklos Szeredi wrote: >> - "bool foo = flag & FLAG" looks suspicious. Is this guaranteed not to >> overflow? > > What do you mean, overflow? It's not a 1-bit unsigned int; conversion to > _Bool is different (which is the only reason why it's more than mere > syntax sugar). See C99 6.3.2.1 ("When any scalar value is converted > to _Bool, the result is 0 if the value compares equal to 0; otherwise, > the result is 1"). Ah, didn't know that. Thanks, Miklos -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Thu, May 1, 2014 at 4:34 PM, Al Viro v...@zeniv.linux.org.uk wrote: On Thu, May 01, 2014 at 11:42:52AM +0200, Miklos Szeredi wrote: - bool foo = flag FLAG looks suspicious. Is this guaranteed not to overflow? What do you mean, overflow? It's not a 1-bit unsigned int; conversion to _Bool is different (which is the only reason why it's more than mere syntax sugar). See C99 6.3.2.1 (When any scalar value is converted to _Bool, the result is 0 if the value compares equal to 0; otherwise, the result is 1). Ah, didn't know that. Thanks, Miklos -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Fri, May 2, 2014 at 7:51 AM, Al Viro v...@zeniv.linux.org.uk wrote: On Wed, Apr 30, 2014 at 11:15:15AM +0200, Miklos Szeredi wrote: IBM is triggering this with the host01 test from the LTP suite. I haven't yet tried to reproduce it. Could you repost their report? So far I hadn't managed to get actual list corruption on mainline kernel - it definitely is possible, but on all testcases so far the race window is too narrow. So if they have a reproducer that doesn't take an insane amount of time, I'd really like to see it... The only reproducer they have is LTP/host01, and can trigger the bug reliably within hours. The bug is private, but I'll ask if I can repost it. The first thing is a a warning from the D_FLAG_VERIFY() in d_shrink_del() from shrink_dentry_list(). They added printks that show that the dentry has DCACHE_DENTRY_KILLED. We could ask for a dump, but this is the only rational explanation I could find for this (and a shrink list with two dentries, with racing dput on both nicely explains the case where the shrink list's prev pointer still points to the already killed dentry). Thanks, Miklos -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Fri, May 02, 2014 at 11:00:04AM +0200, Szeredi Miklos wrote: The bug is private, but I'll ask if I can repost it. The first thing is a a warning from the D_FLAG_VERIFY() in d_shrink_del() from shrink_dentry_list(). They added printks that show that the dentry has DCACHE_DENTRY_KILLED. We could ask for a dump, but this is the only rational explanation I could find for this (and a shrink list with two dentries, with racing dput on both nicely explains the case where the shrink list's prev pointer still points to the already killed dentry). Bug details: == WARNING: at /home/abuild/rpmbuild/BUILD/kernel-default-3.12.15/linux-3.12/fs/dcache.c:392 Modules linked in: iptable_filter ip_tables x_tables rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd sunrpc fscache af_packet xfs libcrc32c dm_mod autofs4 btrfs xor sr_mod cdrom usb_storage raid6_pq sd_mod crc_t10dif ibmvscsi(X) scsi_transport_srp scsi_tgt ibmveth(X) ohci_pci ohci_hcd ehci_hcd usbcore usb_common sg scsi_mod Supported: Yes CPU: 7 PID: 25685 Comm: host01 Tainted: G X 3.12.15-3-default #1 task: c003f7fb68f0 ti: c003f31f4000 task.ti: c003f31f4000 NIP: c024552c LR: c02462b8 CTR: c02aad40 REGS: c003f31f7530 TRAP: 0700 Tainted: G X (3.12.15-3-default) MSR: 800100029033 SF,EE,ME,IR,DR,RI,LE CR: 24242448 XER: 2000 SOFTE: 1 CFAR: c02454b0 GPR00: c02462b8 c003f31f77b0 c0f10fb0 c003dfa7f028 GPR04: 0001 c0e4fe00 0600 GPR08: c0e30fb0 0001 GPR12: 24242442 cfe41880 GPR16: 010013556de0 6458 0002 GPR20: c084e940 0004 c003f8afa400 GPR24: 6457 c003fde09e68 GPR28: c003dfa7f028 c003dfa7f028 c003f31f7948 c003dfa7f0a8 NIP [c024552c] .d_shrink_del+0x9c/0xc0 LR [c02462b8] .shrink_dentry_list+0xc8/0x160 PACATMSCRATCH [80019033] Call Trace: [c003f31f77b0] [c003f8f1ef80] 0xc003f8f1ef80 (unreliable) [c003f31f7820] [c02462b8] .shrink_dentry_list+0xc8/0x160 [c003f31f78d0] [c0246754] .shrink_dcache_parent+0x44/0xa0 [c003f31f7980] [c02b0a0c] .proc_flush_task+0xbc/0x1f0 [c003f31f7a70] [c008e554] .release_task+0x94/0x530 [c003f31f7b50] [c008efb8] .wait_task_zombie+0x5c8/0x750 [c003f31f7c10] [c008fa80] .do_wait+0x120/0x2c0 [c003f31f7cd0] [c0091040] .SyS_wait4+0x90/0x130 [c003f31f7dc0] [c00910fc] .SyS_waitpid+0x1c/0x30 [c003f31f7e30] [c0009dfc] syscall_exit+0x0/0x7c Instruction dump: 7d09502a 3908 7d09512a 4bdcaf19 6000 38210070 e8010010 7c0803a6 4e800020 3d02fff2 8928dfd6 69290001 0b09 2fa9 41feff80 3921 ---[ end trace 56e8481827564dc3 ]--- == I forgot to mention that the TCP test used in the recreates is from the LTP suite from http://ltp.sourceforge.net It has been observed that each time we run into it so far, the host01 test is the one running. == After installing the kernel with some conditional printk statements in it, the condition hit and I caught a glimpse of it. I then dropped the system into xmon to dump out some further info: When we first detect that the dentry-d_flags is missing these flags DCACHE_SHRINK_LIST | DCACHE_LRU_LIST we start to dump some information about the dentry from within shrink_dentry_list() and got: dentry c003d6240ce8 name 5 lockcnt -128 flags 1048780 post-shrink-del: dentry c003d6240ce8 name 5 lockcnt -128 dentry c003d6240ce8 lockcnt -128 dentry c003d6240ce8 name 5 lockcnt -128 flags 1048780 post-shrink-del: dentry c003d6240ce8 name 5 lockcnt -128 dentry c003d6240ce8 lockcnt -128 dentry c003d6240ce8 name 5 lockcnt -128 flags 1048780 and this repeats over and over... The code (with printks) looks like this: 899 for (;;) { 900 dentry = list_entry_rcu(list-prev, struct dentry, d_lru); 901 if (dentry-d_lru == list) 902 break; /* empty */ 903 904 if (print_more) 905 printk(KERN_WARNING dentry %p lockcnt %d\n, 906 dentry, dentry-d_lockref.count); 907 908 /* 909 * Get the dentry lock, and re-verify that the dentry is 910 * this on the shrinking list. If it is, we know that 911 * DCACHE_SHRINK_LIST and DCACHE_LRU_LIST are set. 912
Re: dcache shrink list corruption?
On Thu, May 1, 2014 at 7:34 AM, Al Viro v...@zeniv.linux.org.uk wrote: OK, fixed and pushed (both branches). Al, can you send a real pull request (the both branches part in particular makes me worry about which one you think is right), because I suspect by now we just need to get this wider testing. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
There's more of the delete from shrink list not owned by us in select parent. Proposed patch appended. And I'm not sure what umount_collect() is supposed to do. Can other shrinkers still be active at that point? That would present other problems, no? Also somewhat related is the question: how check_submounts_and_drop() could be guaranteed correctness (timely removal of all unsed dentries) in the presence of other shrinkers? Thanks, Miklos From: Miklos Szeredi mszer...@suse.cz Subject: dcache: select_collect(): don't remove from shrink list Shrink lists are not protected by any lock, so don't remove from an unknown one. Signed-off-by: Miklos Szeredi mszer...@suse.cz --- fs/dcache.c | 23 --- 1 file changed, 12 insertions(+), 11 deletions(-) --- a/fs/dcache.c +++ b/fs/dcache.c @@ -1241,18 +1241,19 @@ static enum d_walk_ret select_collect(vo * loop in shrink_dcache_parent() might not make any progress * and loop forever. */ - if (dentry-d_lockref.count) { - dentry_lru_del(dentry); - } else if (!(dentry-d_flags DCACHE_SHRINK_LIST)) { - /* -* We can't use d_lru_shrink_move() because we -* need to get the global LRU lock and do the -* LRU accounting. -*/ + if (dentry-d_flags (DCACHE_SHRINK_LIST | DCACHE_LRU_LIST) == + DCACHE_LRU_LIST) { d_lru_del(dentry); - d_shrink_add(dentry, data-dispose); - data-found++; - ret = D_WALK_NORETRY; + if (!dentry-d_lockref.count) { + /* +* We can't use d_lru_shrink_move() because we +* need to get the global LRU lock and do the +* LRU accounting. +*/ + d_shrink_add(dentry, data-dispose); + data-found++; + ret = D_WALK_NORETRY; + } } /* * We can return to the caller if we have found some (this -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Fri, May 2, 2014 at 2:08 PM, Miklos Szeredi mik...@szeredi.hu wrote: There's more of the delete from shrink list not owned by us in select parent. Proposed patch appended. Ahh. Clearly this needs more work before I pull. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Fri, May 02, 2014 at 11:08:13PM +0200, Miklos Szeredi wrote: There's more of the delete from shrink list not owned by us in select parent. Proposed patch appended. While it certainly looks like dentry_lru_del() should die, I really wonder if let's pretend that dentry isn't there if it's on some other shrink list is the right approach. You've already noticed one problem (check-and-drop giving false busy indications), but shrink_dcache_parent() has similar issues. How about incrementing data-found instead? That'll end up rescanning the tree in case if it's not ours; so what? If it's another process doing the same shrinking in a subtree, we want to let it do its job anyway. Sure, somebody doing mount -o remount in a loop might be able to stall the living hell out of us, for as long as new non-busy dentries are being added in our subtree, but the second part in itself is sufficient; we will keep picking those new non-busy dentries as long as they keep coming. And if they do not, eventually they will be all taken out by us or by those other shrinkers and we'll be done. And I'm not sure what umount_collect() is supposed to do. Can other shrinkers still be active at that point? That would present other problems, no? No other shrinkers - prune_dcache_sb() and shrink_dcache_sb() are also serialized by -s_umount, shrink_dcache_parent() and check_submounts_and_drop() are called only when an active reference is held, which obviously prevents fs shutdown. Also somewhat related is the question: how check_submounts_and_drop() could be guaranteed correctness (timely removal of all unsed dentries) in the presence of other shrinkers? Another interesting question is what the hell are shrink_dcache_parent() callers expecting. E.g. what's going on in fuse_reverse_inval_entry()? And what is nilfs_tree_is_busy() about? FWIW, I'm not at all sure that vfs_rmdir() and vfs_rename() have any reason to call it these days, and dentry_unhash() one simply ought to die - it's used only by hpfs unlink() in case it wants to truncate in order to work around -ENOSPC. And _that_ won't have any subdirectories to deal with anyway, so shrink_dcache_parent() there is a no-op, even if we keep the damn helper alive. The rest of callers also look dubious... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Fri, May 02, 2014 at 02:18:43PM -0700, Linus Torvalds wrote: On Fri, May 2, 2014 at 2:08 PM, Miklos Szeredi mik...@szeredi.hu wrote: There's more of the delete from shrink list not owned by us in select parent. Proposed patch appended. Ahh. Clearly this needs more work before I pull. *nod* Besides, I want to put Miklos' don't bother with RCU in shrink_dentry_list() in there as soon as select_collect() has been dealt with. I don't think that the currently posted patch for select_collect() is right, though - see my reply to parent posting. Basically, I think we should treat it's on the shrink list already as increment data-found and keep going. IOW, if (on shrink list) { data-found++; } else { if (on lru list) d_lru_del if (refcount is zero) { d_shrink_add data-found++; } } if (data-found) ret = need_resched() ? D_WALK_QUIT : D_WALK_NORETRY; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Fri, May 02, 2014 at 11:40:22PM +0100, Al Viro wrote: On Fri, May 02, 2014 at 02:18:43PM -0700, Linus Torvalds wrote: On Fri, May 2, 2014 at 2:08 PM, Miklos Szeredi mik...@szeredi.hu wrote: There's more of the delete from shrink list not owned by us in select parent. Proposed patch appended. Ahh. Clearly this needs more work before I pull. *nod* Besides, I want to put Miklos' don't bother with RCU in shrink_dentry_list() in there as soon as select_collect() has been dealt with. I don't think that the currently posted patch for select_collect() is right, though - see my reply to parent posting. Basically, I think we should treat it's on the shrink list already as increment data-found and keep going. IOW, if (on shrink list) { data-found++; } else { if (on lru list) d_lru_del if (refcount is zero) { d_shrink_add data-found++; } } if (data-found) ret = need_resched() ? D_WALK_QUIT : D_WALK_NORETRY; While we are at it - BUG() in umount_collect() is probably a bad idea. At that point we are holding -s_umount, so it guarantees that a lot of stuff from that point on will get stuck. Starting with sync(2). And I really doubt that damage from WARN() instead will be more... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Fri, May 02, 2014 at 11:40:22PM +0100, Al Viro wrote: On Fri, May 02, 2014 at 02:18:43PM -0700, Linus Torvalds wrote: On Fri, May 2, 2014 at 2:08 PM, Miklos Szeredi mik...@szeredi.hu wrote: There's more of the delete from shrink list not owned by us in select parent. Proposed patch appended. Ahh. Clearly this needs more work before I pull. *nod* Besides, I want to put Miklos' don't bother with RCU in shrink_dentry_list() in there as soon as select_collect() has been dealt with. I don't think that the currently posted patch for select_collect() is right, though - see my reply to parent posting. Basically, I think we should treat it's on the shrink list already as increment data-found and keep going. IOW, if (on shrink list) { data-found++; } else { if (on lru list) d_lru_del if (refcount is zero) { d_shrink_add data-found++; } } if (data-found) ret = need_resched() ? D_WALK_QUIT : D_WALK_NORETRY; See vfs.git#dentry_kill-3; warning - this is completely untested and I would really like comments on spinning case there (i.e. the one where select_collect() finds some stuff already on some other shrink list and nothing with zero refcount that wouldn't be there). In that case (and it's basically somebody else is evicting stuff in our subtree and they'd already picked everything we want evicted) I just let the loop in check_submounts_and_drop() repeat (we do have cond_resched() there). Any better suggestions would be welcome... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Wed, Apr 30, 2014 at 11:15:15AM +0200, Miklos Szeredi wrote: > On Tue, Apr 29, 2014 at 07:56:13PM -0700, Linus Torvalds wrote: > > On Tue, Apr 29, 2014 at 7:31 PM, Al Viro wrote: > > > > > > OK, aggregate diff follows, more readable splitup (3 commits) attached. > > > It seems to survive beating here; testing, review and comments are > > > welcome. > > > > Miklos, did you have some particular load that triggered this, or was > > it just some reports? It would be really good to get this patch some > > stress-testing. > > > > I like how the patch removes more lines than it adds, but apart from > > that it's hard to read the patch (even the split-out ones) and say > > anything more about it. I think this needs a *lot* of testing. > > IBM is triggering this with the host01 test from the LTP suite. I haven't yet > tried to reproduce it. Could you repost their report? So far I hadn't managed to get actual list corruption on mainline kernel - it definitely is possible, but on all testcases so far the race window is too narrow. So if they have a reproducer that doesn't take an insane amount of time, I'd really like to see it... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Thu, May 1, 2014 at 2:05 PM, Al Viro wrote: > > ... and so has this one, the bounce being I actually got all three messages despite the bounces. But I seem to have gotten at least the first one through lkml. Your email is lacking SPF or any other authentication, so it may be gmail being pissy about that, though: Received-SPF: none (google.com: v...@ftp.linux.org.uk does not designate permitted sender hosts) client-ip=195.92.253.2; especially if your IP is in a block where other people send spam, gmail has been known to just say "that ISP is clearly bad". Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Thu, May 01, 2014 at 10:02:33PM +0100, Al Viro wrote: > *grumble* > > Looks like gmail filtering is buggered again. The previous mail got > bounced by gmail, both for Miklos and for Linus ;-/ Hell knows if > this one gets through; the original is at http://lkml.org/lkml/2014/5/1/235, > bounce being bloody vague, as usual. ... and so has this one, the bounce being [quote] A message that you sent could not be delivered to one or more of its recipients. This is a permanent error. The following address(es) failed: torva...@linux-foundation.org SMTP error from remote mail server after end of data: host ASPMX.L.GOOGLE.COM [173.194.67.27]: 550-5.7.1 [195.92.253.2 12] Our system has detected that this message is 550-5.7.1 likely unsolicited mail. To reduce the amount of spam sent to Gmail, 550-5.7.1 this message has been blocked. Please visit 550-5.7.1 http://support.google.com/mail/bin/answer.py?hl=en=188131 for 550 5.7.1 more information. cv1si45381wib.71 - gsmtp mik...@szeredi.hu SMTP error from remote mail server after end of data: host ASPMX.L.GOOGLE.COM [173.194.67.27]: 550-5.7.1 [195.92.253.2 12] Our system has detected that this message is 550-5.7.1 likely unsolicited mail. To reduce the amount of spam sent to Gmail, 550-5.7.1 this message has been blocked. Please visit 550-5.7.1 http://support.google.com/mail/bin/answer.py?hl=en=188131 for 550 5.7.1 more information. bb7si53528wib.20 - gsmtp [endquote] What the hell is going on? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
*grumble* Looks like gmail filtering is buggered again. The previous mail got bounced by gmail, both for Miklos and for Linus ;-/ Hell knows if this one gets through; the original is at http://lkml.org/lkml/2014/5/1/235, bounce being bloody vague, as usual. This is starting to get really annoying ;-/ Does anybody have any idea what might be going on? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Thu, May 01, 2014 at 11:42:52AM +0200, Miklos Szeredi wrote: > Two points about latest version (dentry_kill-2): > > - Doing anything with dentry->d_parent in case of DCACHE_DENTRY_KILLED looks > seriously wrong. Parent has been dealt with, at that point, by the other > caller, no? In both branches, actually - we should bugger off earlier *and* return NULL in that case. Nice catch. Hmm... I see why it failed to blow up on that. It *did* trigger, all right - udev is stepping into that right on boot. The thing is, check should be if ((int)dentry->d_lockref.count > 0) since the damn thing is unsigned int. IOW, they did go through handover and shrink_dentry_list() proceeded to lose them. And with that braino fixed, it steps into the extra dput crap just fine. OK, fixed and pushed (both branches). > - "bool foo = flag & FLAG" looks suspicious. Is this guaranteed not to > overflow? What do you mean, overflow? It's not a 1-bit unsigned int; conversion to _Bool is different (which is the only reason why it's more than mere syntax sugar). See C99 6.3.2.1 ("When any scalar value is converted to _Bool, the result is 0 if the value compares equal to 0; otherwise, the result is 1"). That, BTW, is also the reason why _Bool bitfields exist - try struct { _Bool a:1; unsigned b:1; } x; x.a = 2; x.b = 2; if (x.a) printf("A"); if (x.b) printf("B"); and see what it does. The first test triggers, same as if (2) would. The second does not, since conversion to unsigned integer type other than _Bool gives the value in range of that type comparable to original modulo (maximal representable + 1). And 2 modulo 2 is 0... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
Two points about latest version (dentry_kill-2): - Doing anything with dentry->d_parent in case of DCACHE_DENTRY_KILLED looks seriously wrong. Parent has been dealt with, at that point, by the other caller, no? - "bool foo = flag & FLAG" looks suspicious. Is this guaranteed not to overflow? Thanks, Miklos -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
Two points about latest version (dentry_kill-2): - Doing anything with dentry-d_parent in case of DCACHE_DENTRY_KILLED looks seriously wrong. Parent has been dealt with, at that point, by the other caller, no? - bool foo = flag FLAG looks suspicious. Is this guaranteed not to overflow? Thanks, Miklos -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Thu, May 01, 2014 at 11:42:52AM +0200, Miklos Szeredi wrote: Two points about latest version (dentry_kill-2): - Doing anything with dentry-d_parent in case of DCACHE_DENTRY_KILLED looks seriously wrong. Parent has been dealt with, at that point, by the other caller, no? In both branches, actually - we should bugger off earlier *and* return NULL in that case. Nice catch. Hmm... I see why it failed to blow up on that. It *did* trigger, all right - udev is stepping into that right on boot. The thing is, check should be if ((int)dentry-d_lockref.count 0) since the damn thing is unsigned int. IOW, they did go through handover and shrink_dentry_list() proceeded to lose them. And with that braino fixed, it steps into the extra dput crap just fine. OK, fixed and pushed (both branches). - bool foo = flag FLAG looks suspicious. Is this guaranteed not to overflow? What do you mean, overflow? It's not a 1-bit unsigned int; conversion to _Bool is different (which is the only reason why it's more than mere syntax sugar). See C99 6.3.2.1 (When any scalar value is converted to _Bool, the result is 0 if the value compares equal to 0; otherwise, the result is 1). That, BTW, is also the reason why _Bool bitfields exist - try struct { _Bool a:1; unsigned b:1; } x; x.a = 2; x.b = 2; if (x.a) printf(A); if (x.b) printf(B); and see what it does. The first test triggers, same as if (2) would. The second does not, since conversion to unsigned integer type other than _Bool gives the value in range of that type comparable to original modulo (maximal representable + 1). And 2 modulo 2 is 0... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
*grumble* Looks like gmail filtering is buggered again. The previous mail got bounced by gmail, both for Miklos and for Linus ;-/ Hell knows if this one gets through; the original is at http://lkml.org/lkml/2014/5/1/235, bounce being bloody vague, as usual. This is starting to get really annoying ;-/ Does anybody have any idea what might be going on? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Thu, May 01, 2014 at 10:02:33PM +0100, Al Viro wrote: *grumble* Looks like gmail filtering is buggered again. The previous mail got bounced by gmail, both for Miklos and for Linus ;-/ Hell knows if this one gets through; the original is at http://lkml.org/lkml/2014/5/1/235, bounce being bloody vague, as usual. ... and so has this one, the bounce being [quote] A message that you sent could not be delivered to one or more of its recipients. This is a permanent error. The following address(es) failed: torva...@linux-foundation.org SMTP error from remote mail server after end of data: host ASPMX.L.GOOGLE.COM [173.194.67.27]: 550-5.7.1 [195.92.253.2 12] Our system has detected that this message is 550-5.7.1 likely unsolicited mail. To reduce the amount of spam sent to Gmail, 550-5.7.1 this message has been blocked. Please visit 550-5.7.1 http://support.google.com/mail/bin/answer.py?hl=enanswer=188131 for 550 5.7.1 more information. cv1si45381wib.71 - gsmtp mik...@szeredi.hu SMTP error from remote mail server after end of data: host ASPMX.L.GOOGLE.COM [173.194.67.27]: 550-5.7.1 [195.92.253.2 12] Our system has detected that this message is 550-5.7.1 likely unsolicited mail. To reduce the amount of spam sent to Gmail, 550-5.7.1 this message has been blocked. Please visit 550-5.7.1 http://support.google.com/mail/bin/answer.py?hl=enanswer=188131 for 550 5.7.1 more information. bb7si53528wib.20 - gsmtp [endquote] What the hell is going on? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Thu, May 1, 2014 at 2:05 PM, Al Viro v...@zeniv.linux.org.uk wrote: ... and so has this one, the bounce being I actually got all three messages despite the bounces. But I seem to have gotten at least the first one through lkml. Your email is lacking SPF or any other authentication, so it may be gmail being pissy about that, though: Received-SPF: none (google.com: v...@ftp.linux.org.uk does not designate permitted sender hosts) client-ip=195.92.253.2; especially if your IP is in a block where other people send spam, gmail has been known to just say that ISP is clearly bad. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Wed, Apr 30, 2014 at 11:15:15AM +0200, Miklos Szeredi wrote: On Tue, Apr 29, 2014 at 07:56:13PM -0700, Linus Torvalds wrote: On Tue, Apr 29, 2014 at 7:31 PM, Al Viro v...@zeniv.linux.org.uk wrote: OK, aggregate diff follows, more readable splitup (3 commits) attached. It seems to survive beating here; testing, review and comments are welcome. Miklos, did you have some particular load that triggered this, or was it just some reports? It would be really good to get this patch some stress-testing. I like how the patch removes more lines than it adds, but apart from that it's hard to read the patch (even the split-out ones) and say anything more about it. I think this needs a *lot* of testing. IBM is triggering this with the host01 test from the LTP suite. I haven't yet tried to reproduce it. Could you repost their report? So far I hadn't managed to get actual list corruption on mainline kernel - it definitely is possible, but on all testcases so far the race window is too narrow. So if they have a reproducer that doesn't take an insane amount of time, I'd really like to see it... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Wed, Apr 30, 2014 at 07:59:43PM -0700, Linus Torvalds wrote: > On Wed, Apr 30, 2014 at 7:51 PM, Al Viro wrote: > > > > Help with profiling is needed; the loads to watch are > > the ones where dentry_kill() + dentry_free() are sufficiently high in > > profiles > > for the differences to matter. Any takers? > > I really hope there are no such loads, my "lock/unlock pairing" > suggestion was mostly so that the pairing is clearer, not necessarily > for performance. > > That said, I'd assume that it migth be worth testing at least the > "lots of concurrent lookups of 'simple_dentry_operations' dentries". > So most of /proc, most uses of simple_lookup(). That at least triggers > the dentry_kill() path in dput(), so it should be fairly easy to get > profiles. > > But real loads with real filesystems? That sounds harder. Well, the simplest way to do that is a bunch of open/unlink/close. Another one is cp -rl / rm -rf of a large tree (rmdir(2) does shrink_dcache_parent()). For that matter, unmapping an anon shared mapping will trigger the same path. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Wed, Apr 30, 2014 at 7:51 PM, Al Viro wrote: > > Help with profiling is needed; the loads to watch are > the ones where dentry_kill() + dentry_free() are sufficiently high in profiles > for the differences to matter. Any takers? I really hope there are no such loads, my "lock/unlock pairing" suggestion was mostly so that the pairing is clearer, not necessarily for performance. That said, I'd assume that it migth be worth testing at least the "lots of concurrent lookups of 'simple_dentry_operations' dentries". So most of /proc, most uses of simple_lookup(). That at least triggers the dentry_kill() path in dput(), so it should be fairly easy to get profiles. But real loads with real filesystems? That sounds harder. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Wed, Apr 30, 2014 at 05:18:23PM -0700, Linus Torvalds wrote: > and then suddenly it looks like we have a common exit sequence from > that dentry_kill() function, no? > > (The earlier "unlock_on_failure" exit case is altogether a different case). > > I dunno. Maybe not a big deal, but one reason I prefer doing that > "free" flag is because I really tend to prefer the simple case of > lock-unlock pairing cleanly at the same level. NOT the pattern where > you have one lock at one indentation level, paired with multiple > unlocks for all the different cases. Yeah, but... I have such variant, but the best I could get still generated the code that wasn't particulary nice. Part might be gcc code generation sucking for bool, part - extra register pressure... It has slightly lower i-cache footprint, though, so it might be worth doing. Hell knows; that's a fairly hot codepath, so let's do it that way - I've just pushed an alternative branch with bool can_free variant; branches in question: vfs.git#for-linus and vfs.git#dentry_kill-2. Help with profiling is needed; the loads to watch are the ones where dentry_kill() + dentry_free() are sufficiently high in profiles for the differences to matter. Any takers? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Wed, Apr 30, 2014 at 4:43 PM, Al Viro wrote: > > OK, done and force-pushed. Should propagate in a few... That made it more obvious how the DCACHE_MAY_FREE case ends up working. And in particular, mind rewriting this: if (dentry->d_flags & DCACHE_MAY_FREE) { spin_unlock(>d_lock); dentry_free(dentry); } else { spin_unlock(>d_lock); } return parent; as just bool free = dentry->d_flags & DCACHE_MAY_FREE; spin_unlock(>d_lock); if (free) dentry_free(dentry); return parent; instead? In fact, I get the feeling that the other case later on really fits the same model: spin_lock(>d_lock); if (dentry->d_flags & DCACHE_SHRINK_LIST) { dentry->d_flags |= DCACHE_MAY_FREE; spin_unlock(>d_lock); } else { spin_unlock(>d_lock); dentry_free(dentry); } ends up really being better as spin_lock(>d_lock); free = 1; if (dentry->d_flags & DCACHE_SHRINK_LIST) { dentry->d_flags |= DCACHE_MAY_FREE; free = 0; } spin_unlock(>d_lock); if (free) dentry_free(dentry); return parent; and then suddenly it looks like we have a common exit sequence from that dentry_kill() function, no? (The earlier "unlock_on_failure" exit case is altogether a different case). I dunno. Maybe not a big deal, but one reason I prefer doing that "free" flag is because I really tend to prefer the simple case of lock-unlock pairing cleanly at the same level. NOT the pattern where you have one lock at one indentation level, paired with multiple unlocks for all the different cases. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Wed, Apr 30, 2014 at 04:14:14PM -0700, Linus Torvalds wrote: > On Wed, Apr 30, 2014 at 4:04 PM, Linus Torvalds > wrote: > > > > Let me go talk to the paravirt people. Maybe they don't need this, and > > I don't know exactly *how* they use that lock pointer after the unlock > > in the "kick waiters" part. Maybe it's all good. > > .. looking at current users (xen and kvm) it does in fact look all > good. Yes, we "kick" possible waiters after the unlock, but the lock > itself is not touched by that, it just uses the pointer to the lock as > a way to figure out who to kick. > > In fact, I kind of knew that, but had forgotten. We very much depend > on spin_unlock being safe wrt immediate deleton already: the > "completion" code very much depends on it. It does a "spin_unlock()" > to release the completion, and it can get reused immediately (it might > be on the stack, it might be in some data structure that gets freed). > > So I'd suggest removing that whole RCU thing, because it should be > safe to unlock something that can go away immediately. We'd have huge > problems elsewhere if it wasn't safe. OK, done and force-pushed. Should propagate in a few... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Wed, Apr 30, 2014 at 04:04:58PM -0700, Linus Torvalds wrote: > On Wed, Apr 30, 2014 at 3:12 PM, Al Viro wrote: > > > > I've just pushed the whole thing to vfs.git#for-linus; > > review and testing would be very welcome. > > I have no half-way relevant test-case for this, so I'm hoping people > who have good VFS stress-tests (preferably under memory pressure so > that we get that whole shrinking path) will test. > > But it looks fine. > > That said, I do hate that RCU read-lock around the final spin-unlock. So do I, obviously... After looking through the rest of arch_spin_unlock(), it looks like the only suspicious one except the paravirt on x86 is blackfin. And that might be misreading - I'm not familiar enough with the architecture to tell... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Wed, Apr 30, 2014 at 4:04 PM, Linus Torvalds wrote: > > Let me go talk to the paravirt people. Maybe they don't need this, and > I don't know exactly *how* they use that lock pointer after the unlock > in the "kick waiters" part. Maybe it's all good. .. looking at current users (xen and kvm) it does in fact look all good. Yes, we "kick" possible waiters after the unlock, but the lock itself is not touched by that, it just uses the pointer to the lock as a way to figure out who to kick. In fact, I kind of knew that, but had forgotten. We very much depend on spin_unlock being safe wrt immediate deleton already: the "completion" code very much depends on it. It does a "spin_unlock()" to release the completion, and it can get reused immediately (it might be on the stack, it might be in some data structure that gets freed). So I'd suggest removing that whole RCU thing, because it should be safe to unlock something that can go away immediately. We'd have huge problems elsewhere if it wasn't safe. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Wed, Apr 30, 2014 at 3:12 PM, Al Viro wrote: > > I've just pushed the whole thing to vfs.git#for-linus; > review and testing would be very welcome. I have no half-way relevant test-case for this, so I'm hoping people who have good VFS stress-tests (preferably under memory pressure so that we get that whole shrinking path) will test. But it looks fine. That said, I do hate that RCU read-lock around the final spin-unlock. Let me go talk to the paravirt people. Maybe they don't need this, and I don't know exactly *how* they use that lock pointer after the unlock in the "kick waiters" part. Maybe it's all good. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Wed, Apr 30, 2014 at 10:12:06PM +0100, Al Viro wrote: > On Wed, Apr 30, 2014 at 01:57:05PM -0700, Linus Torvalds wrote: > > On Wed, Apr 30, 2014 at 1:38 PM, Al Viro wrote: > > > > > > We do not (and cannot) call dentry_kill() with rcu_read_lock held - it can > > > trigger any amount of IO, for one thing. We can take it around the > > > couple of places where do that spin_unlock(>d_lock) (along with > > > setting DCACHE_RCUACCESS) - that's what I'd been refering to. > > > > Just the last spin_unlock() would be the case that matters, if the > > spin_unlock() is done on something that could be freed immediately and > > the lock protects and is inside the entity that gets freed. > > *nod* > > There are two such spin_unlock (handover from shrink_dentry_list() to > dput() and the opposite one), but they are all that needs protection - > ->d_flags update is outside the rcu-critical area. I really wonder > if we *can* get there without DCACHE_RCUACCESS having been set, though; > dentry would have to be > * picked into shrink list (i.e. have had zero refcount at some point) > * never had been through __d_rehash() > shrink_dentry_list() definitely counts on that being impossible, and it > probably is, but I'm feeling seriously paranoid about the whole area. > I'll finish grepping through the tree and probably drop setting > DCACHE_RCUACCESS from the patch - either that, or set it in d_shrink_add() > it it turns out that it is possible and shrink_dentry_list() is fucked... OK, it really can't happen. The proof is more convoluted than I'd like it, but it's solid enough, so setting that flag in dentry_kill() handover cases wasn't needed. I've just pushed the whole thing to vfs.git#for-linus; review and testing would be very welcome. I can repost it one more time, but the only difference compared to the last variant in this thread is not bothering with DCACHE_RCUACCESS. It has survived LTP tests, going through xfstests now... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Wed, Apr 30, 2014 at 01:57:05PM -0700, Linus Torvalds wrote: > On Wed, Apr 30, 2014 at 1:38 PM, Al Viro wrote: > > > > We do not (and cannot) call dentry_kill() with rcu_read_lock held - it can > > trigger any amount of IO, for one thing. We can take it around the > > couple of places where do that spin_unlock(>d_lock) (along with > > setting DCACHE_RCUACCESS) - that's what I'd been refering to. > > Just the last spin_unlock() would be the case that matters, if the > spin_unlock() is done on something that could be freed immediately and > the lock protects and is inside the entity that gets freed. *nod* There are two such spin_unlock (handover from shrink_dentry_list() to dput() and the opposite one), but they are all that needs protection - ->d_flags update is outside the rcu-critical area. I really wonder if we *can* get there without DCACHE_RCUACCESS having been set, though; dentry would have to be * picked into shrink list (i.e. have had zero refcount at some point) * never had been through __d_rehash() shrink_dentry_list() definitely counts on that being impossible, and it probably is, but I'm feeling seriously paranoid about the whole area. I'll finish grepping through the tree and probably drop setting DCACHE_RCUACCESS from the patch - either that, or set it in d_shrink_add() it it turns out that it is possible and shrink_dentry_list() is fucked... Tests seem to be running fine so far... > > BTW, is there any convenient > > way to tell git commit --amend to update the commit date? Something > > like --date=now would be nice, but it isn't accepted... > > --date="$(date)" works. Thanks... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Wed, Apr 30, 2014 at 1:38 PM, Al Viro wrote: > > We do not (and cannot) call dentry_kill() with rcu_read_lock held - it can > trigger any amount of IO, for one thing. We can take it around the > couple of places where do that spin_unlock(>d_lock) (along with > setting DCACHE_RCUACCESS) - that's what I'd been refering to. Just the last spin_unlock() would be the case that matters, if the spin_unlock() is done on something that could be freed immediately and the lock protects and is inside the entity that gets freed. > BTW, is there any convenient > way to tell git commit --amend to update the commit date? Something > like --date=now would be nice, but it isn't accepted... --date="$(date)" works. The "--date" thing doesn't take the nice "git approxidate" strings, maybe it should. So you have to give it a "real" date. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Wed, Apr 30, 2014 at 01:23:26PM -0700, Linus Torvalds wrote: > On Wed, Apr 30, 2014 at 12:59 PM, Al Viro wrote: > > > > Another thing: I don't like what's going on with freeing vs. ->d_lock there. > > Had that been a mutex, we'd definitely get a repeat of "vfs: fix subtle > > use-after-free of pipe_inode_info". The question is, can spin_unlock(p) > > dereference p after another CPU gets through spin_lock(p)? Linus? > > spin_unlock() *should* be safe wrt that issue. > > But I have to say, I think paravirtualized spinlocks may break that. > They do all kinds of "kick waiters" after releasing the lock. > > Doesn't the RCU protection solve that, though? Nobody should be > releasing the dentry under us, afaik.. We do not (and cannot) call dentry_kill() with rcu_read_lock held - it can trigger any amount of IO, for one thing. We can take it around the couple of places where do that spin_unlock(>d_lock) (along with setting DCACHE_RCUACCESS) - that's what I'd been refering to. Then this sucker (tests still running, so far everything seems to survive) becomes the following (again, on top of 1/6..4/6). BTW, is there any convenient way to tell git commit --amend to update the commit date? Something like --date=now would be nice, but it isn't accepted... commit 797ff22681dc969b478ed837787d24dfd2dd2132 Author: Al Viro Date: Tue Apr 29 23:52:05 2014 -0400 dentry_kill(): don't try to remove from shrink list If the victim in on the shrink list, don't remove it from there. If shrink_dentry_list() manages to remove it from the list before we are done - fine, we'll just free it as usual. If not - mark it with new flag (DCACHE_MAY_FREE) and leave it there. Eventually, shrink_dentry_list() will get to it, remove the sucker from shrink list and call dentry_kill(dentry, 0). Which is where we'll deal with freeing. Since now dentry_kill(dentry, 0) may happen after or during dentry_kill(dentry, 1), we need to recognize that (by seeing DCACHE_DENTRY_KILLED already set), unlock everything and either free the sucker (in case DCACHE_MAY_FREE has been set) or leave it for ongoing dentry_kill(dentry, 1) to deal with. Signed-off-by: Al Viro diff --git a/fs/dcache.c b/fs/dcache.c index e482775..fa40d26 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -489,6 +489,20 @@ relock: goto relock; } + if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) { + if (parent) + spin_unlock(>d_lock); + if (dentry->d_flags & DCACHE_MAY_FREE) { + spin_unlock(>d_lock); + dentry_free(dentry); + } else { + dentry->d_flags |= DCACHE_RCUACCESS; + rcu_read_lock(); + spin_unlock(>d_lock); + rcu_read_unlock(); + } + return parent; + } /* * The dentry is now unrecoverably dead to the world. */ @@ -504,8 +518,6 @@ relock: if (dentry->d_flags & DCACHE_LRU_LIST) { if (!(dentry->d_flags & DCACHE_SHRINK_LIST)) d_lru_del(dentry); - else - d_shrink_del(dentry); } /* if it was on the hash then remove it */ __d_drop(dentry); @@ -527,7 +539,16 @@ relock: if (dentry->d_op && dentry->d_op->d_release) dentry->d_op->d_release(dentry); - dentry_free(dentry); + spin_lock(>d_lock); + if (dentry->d_flags & DCACHE_SHRINK_LIST) { + dentry->d_flags |= DCACHE_MAY_FREE | DCACHE_RCUACCESS; + rcu_read_lock(); + spin_unlock(>d_lock); + rcu_read_unlock(); + } else { + spin_unlock(>d_lock); + dentry_free(dentry); + } return parent; } @@ -829,7 +850,7 @@ static void shrink_dentry_list(struct list_head *list) * We found an inuse dentry which was not removed from * the LRU because of laziness during lookup. Do not free it. */ - if (dentry->d_lockref.count) { + if (dentry->d_lockref.count > 0) { spin_unlock(>d_lock); continue; } diff --git a/include/linux/dcache.h b/include/linux/dcache.h index 3b9bfdb..3c7ec32 100644 --- a/include/linux/dcache.h +++ b/include/linux/dcache.h @@ -221,6 +221,8 @@ struct dentry_operations { #define DCACHE_SYMLINK_TYPE0x0030 /* Symlink */ #define DCACHE_FILE_TYPE 0x0040 /* Other file type */ +#define DCACHE_MAY_FREE0x0080 + extern seqlock_t rename_lock; static inline int dname_external(const struct dentry *dentry) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to
Re: dcache shrink list corruption?
On Wed, Apr 30, 2014 at 12:59 PM, Al Viro wrote: > > Another thing: I don't like what's going on with freeing vs. ->d_lock there. > Had that been a mutex, we'd definitely get a repeat of "vfs: fix subtle > use-after-free of pipe_inode_info". The question is, can spin_unlock(p) > dereference p after another CPU gets through spin_lock(p)? Linus? spin_unlock() *should* be safe wrt that issue. But I have to say, I think paravirtualized spinlocks may break that. They do all kinds of "kick waiters" after releasing the lock. Doesn't the RCU protection solve that, though? Nobody should be releasing the dentry under us, afaik.. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Wed, Apr 30, 2014 at 08:02:27PM +0100, Al Viro wrote: > On Wed, Apr 30, 2014 at 08:42:01PM +0200, Miklos Szeredi wrote: > > > Message-ID: <20140430154958.gc3...@tucsk.piliscsaba.szeredi.hu> > > I see... Several points: > * I still think that it's better to do handling of "we see > DCACHE_DENTRY_KILLED already set" in dentry_kill() itself. > * in dentry_kill(dentry, 0) we *know* that it's not on a shrink > list - the caller has just removed it from there and we'd kept ->d_lock > since that point. What's more, with that scheme we don't need to bother > with can_free at all - just grab ->d_lock after ->d_release() call and check > DCACHE_SHRINK_LIST. No sense trying to avoid that - in case where we > could just go ahead and free the sucker, there's neither contention nor > anybody else interested in that cacheline, so spin_lock(>d_lock) > is pretty much free. > > IOW, I'd prefer to do the following (completely untested) on top of 1/6--4/6: Sigh... One more problem: /* * We found an inuse dentry which was not removed from * the LRU because of laziness during lookup. Do not free it. */ if (dentry->d_lockref.count) { spin_unlock(>d_lock); continue; } should become if (dentry->d_lockref.count > 0) { instead - lockref_mark_dead() slaps -128 into it, so we'll just leak all dentries where dput() has come first and decided to leave the suckers for us. Another thing: I don't like what's going on with freeing vs. ->d_lock there. Had that been a mutex, we'd definitely get a repeat of "vfs: fix subtle use-after-free of pipe_inode_info". The question is, can spin_unlock(p) dereference p after another CPU gets through spin_lock(p)? Linus? It can be dealt with by setting DCACHE_RCUACCESS in "let another guy free it" cases and playing with rcu_read_lock a bit, but I wonder whether we need to bother - quick look through alpha/sparc/ppc shows that on those we do not and the same is true for non-paravirt case on x86. I hadn't checked what paravirt one does, though, and I certainly hadn't done exhaustive search for architectures doing something weird... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Wed, Apr 30, 2014 at 08:42:01PM +0200, Miklos Szeredi wrote: > Message-ID: <20140430154958.gc3...@tucsk.piliscsaba.szeredi.hu> I see... Several points: * I still think that it's better to do handling of "we see DCACHE_DENTRY_KILLED already set" in dentry_kill() itself. * in dentry_kill(dentry, 0) we *know* that it's not on a shrink list - the caller has just removed it from there and we'd kept ->d_lock since that point. What's more, with that scheme we don't need to bother with can_free at all - just grab ->d_lock after ->d_release() call and check DCACHE_SHRINK_LIST. No sense trying to avoid that - in case where we could just go ahead and free the sucker, there's neither contention nor anybody else interested in that cacheline, so spin_lock(>d_lock) is pretty much free. IOW, I'd prefer to do the following (completely untested) on top of 1/6--4/6: diff --git a/fs/dcache.c b/fs/dcache.c index e482775..82d8eb4 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -489,6 +489,17 @@ relock: goto relock; } + if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) { + if (parent) + spin_unlock(>d_lock); + if (dentry->d_flags & DCACHE_MAY_FREE) { + spin_unlock(>d_lock); + dentry_free(dentry); + } else { + spin_unlock(>d_lock); + } + return parent; + } /* * The dentry is now unrecoverably dead to the world. */ @@ -504,8 +515,6 @@ relock: if (dentry->d_flags & DCACHE_LRU_LIST) { if (!(dentry->d_flags & DCACHE_SHRINK_LIST)) d_lru_del(dentry); - else - d_shrink_del(dentry); } /* if it was on the hash then remove it */ __d_drop(dentry); @@ -527,7 +536,14 @@ relock: if (dentry->d_op && dentry->d_op->d_release) dentry->d_op->d_release(dentry); - dentry_free(dentry); + spin_lock(>d_lock); + if (dentry->d_flags & DCACHE_SHRINK_LIST) { + dentry->d_flags |= DCACHE_MAY_FREE; + spin_unlock(>d_lock); + } else { + spin_unlock(>d_lock); + dentry_free(dentry); + } return parent; } diff --git a/include/linux/dcache.h b/include/linux/dcache.h index 3b9bfdb..3c7ec32 100644 --- a/include/linux/dcache.h +++ b/include/linux/dcache.h @@ -221,6 +221,8 @@ struct dentry_operations { #define DCACHE_SYMLINK_TYPE0x0030 /* Symlink */ #define DCACHE_FILE_TYPE 0x0040 /* Other file type */ +#define DCACHE_MAY_FREE0x0080 + extern seqlock_t rename_lock; static inline int dname_external(const struct dentry *dentry) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Wed, Apr 30, 2014 at 8:36 PM, Al Viro wrote: > On Wed, Apr 30, 2014 at 07:33:38PM +0200, Miklos Szeredi wrote: >> On Wed, Apr 30, 2014 at 6:03 PM, Al Viro wrote: >> > On Wed, Apr 30, 2014 at 05:49:58PM +0200, Miklos Szeredi wrote: >> >> > FWIW, the first two are really straightforward expanding the function >> >> > into its only callsite. The last needs more splitup. Not sure if the >> >> > following is good enough, but it ought to be at least somewhat cleaner. >> >> > Combined change is identical to the original, so it doesn't invalidate >> >> > the testing so far... >> >> >> >> Hmm, patches look okay, but I'm wondering if we really need the morgue >> >> list and >> >> the waiting. Why not just skip dentries that are presently being handled >> >> by >> >> dentry_kill()? >> > >> > Who will be freeing them? If we do that from dentry_kill(), we are back to >> > needing them removed from shrink list by something called by dput(). And >> > if we do that from shrink_dentry_list(), we'd damn better wait for >> > dentry_iput() et.al. to finish. >> >> We can do it from dput if the shrinker gets there first and from the >> shrinker if dput manages to finish before. See the updated patch in >> the previous mail. > > Er? The only patch I see is removal of RCU from shrink_dentry_list(), which > is fine, but doesn't do anything of that sort. What was the Message-ID? Message-ID: <20140430154958.gc3...@tucsk.piliscsaba.szeredi.hu> > > Let me see if I understand what you are proposing: > dentry_kill(dentry, 0) seeing DCACHE_DENTRY_KILLED > check DCACHE_MAY_FREE, free it's been set > dentry_kill(dentry, 1) seeing that we are on shrinker list > leave on the list, do the work on killing, retake ->d_lock, > if we are still on shrinker list > set DCACHE_MAY_FREE, > else > free it > > That would probably work... Exactly. Thanks, Miklos -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Wed, Apr 30, 2014 at 07:33:38PM +0200, Miklos Szeredi wrote: > On Wed, Apr 30, 2014 at 6:03 PM, Al Viro wrote: > > On Wed, Apr 30, 2014 at 05:49:58PM +0200, Miklos Szeredi wrote: > >> > FWIW, the first two are really straightforward expanding the function > >> > into its only callsite. The last needs more splitup. Not sure if the > >> > following is good enough, but it ought to be at least somewhat cleaner. > >> > Combined change is identical to the original, so it doesn't invalidate > >> > the testing so far... > >> > >> Hmm, patches look okay, but I'm wondering if we really need the morgue > >> list and > >> the waiting. Why not just skip dentries that are presently being handled > >> by > >> dentry_kill()? > > > > Who will be freeing them? If we do that from dentry_kill(), we are back to > > needing them removed from shrink list by something called by dput(). And > > if we do that from shrink_dentry_list(), we'd damn better wait for > > dentry_iput() et.al. to finish. > > We can do it from dput if the shrinker gets there first and from the > shrinker if dput manages to finish before. See the updated patch in > the previous mail. Er? The only patch I see is removal of RCU from shrink_dentry_list(), which is fine, but doesn't do anything of that sort. What was the Message-ID? Let me see if I understand what you are proposing: dentry_kill(dentry, 0) seeing DCACHE_DENTRY_KILLED check DCACHE_MAY_FREE, free it's been set dentry_kill(dentry, 1) seeing that we are on shrinker list leave on the list, do the work on killing, retake ->d_lock, if we are still on shrinker list set DCACHE_MAY_FREE, else free it That would probably work... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Wed, Apr 30, 2014 at 6:03 PM, Al Viro wrote: > On Wed, Apr 30, 2014 at 05:49:58PM +0200, Miklos Szeredi wrote: >> > FWIW, the first two are really straightforward expanding the function >> > into its only callsite. The last needs more splitup. Not sure if the >> > following is good enough, but it ought to be at least somewhat cleaner. >> > Combined change is identical to the original, so it doesn't invalidate >> > the testing so far... >> >> Hmm, patches look okay, but I'm wondering if we really need the morgue list >> and >> the waiting. Why not just skip dentries that are presently being handled by >> dentry_kill()? > > Who will be freeing them? If we do that from dentry_kill(), we are back to > needing them removed from shrink list by something called by dput(). And > if we do that from shrink_dentry_list(), we'd damn better wait for > dentry_iput() et.al. to finish. We can do it from dput if the shrinker gets there first and from the shrinker if dput manages to finish before. See the updated patch in the previous mail. Thanks, Miklos -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Wed, Apr 30, 2014 at 05:49:58PM +0200, Miklos Szeredi wrote: > > FWIW, the first two are really straightforward expanding the function > > into its only callsite. The last needs more splitup. Not sure if the > > following is good enough, but it ought to be at least somewhat cleaner. > > Combined change is identical to the original, so it doesn't invalidate > > the testing so far... > > Hmm, patches look okay, but I'm wondering if we really need the morgue list > and > the waiting. Why not just skip dentries that are presently being handled by > dentry_kill()? Who will be freeing them? If we do that from dentry_kill(), we are back to needing them removed from shrink list by something called by dput(). And if we do that from shrink_dentry_list(), we'd damn better wait for dentry_iput() et.al. to finish. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
And a followup patch for that, removing RCU locking from shrink_dentry_list(). I haven't tested any of this at this point, but I'll ask IBM to do some stress testing. Thanks, Miklos --- From: Miklos Szeredi Subject: dcache: don't need rcu in shrink_dentry_list() Since now the shrink list is private and nobody can free the dentry while it is on the shrink list, we can remove RCU protection from this. Signed-off-by: Miklos Szeredi --- fs/dcache.c | 29 - 1 file changed, 4 insertions(+), 25 deletions(-) --- a/fs/dcache.c +++ b/fs/dcache.c @@ -822,23 +822,9 @@ static void shrink_dentry_list(struct li { struct dentry *dentry, *parent; - rcu_read_lock(); - for (;;) { - dentry = list_entry_rcu(list->prev, struct dentry, d_lru); - if (>d_lru == list) - break; /* empty */ - - /* -* Get the dentry lock, and re-verify that the dentry is -* this on the shrinking list. If it is, we know that -* DCACHE_SHRINK_LIST and DCACHE_LRU_LIST are set. -*/ + while (!list_empty(list)) { + dentry = list_entry(list->prev, struct dentry, d_lru); spin_lock(>d_lock); - if (dentry != list_entry(list->prev, struct dentry, d_lru)) { - spin_unlock(>d_lock); - continue; - } - /* * The dispose list is isolated and dentries are not accounted * to the LRU here, so we can simply remove it from the list @@ -854,7 +840,6 @@ static void shrink_dentry_list(struct li spin_unlock(>d_lock); continue; } - rcu_read_unlock(); if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) { bool free_it; @@ -870,8 +855,6 @@ static void shrink_dentry_list(struct li if (free_it) dentry_free(dentry); - - rcu_read_lock(); continue; } @@ -879,17 +862,15 @@ static void shrink_dentry_list(struct li /* * If dentry_kill returns NULL, we have nothing more to do. */ - if (!parent) { - rcu_read_lock(); + if (!parent) continue; - } + if (unlikely(parent == dentry)) { /* * trylocks have failed and d_lock has been held the * whole time, so it could not have been added to any * other lists. Just add it back to the shrink list. */ - rcu_read_lock(); d_shrink_add(dentry, list); spin_unlock(>d_lock); continue; @@ -903,9 +884,7 @@ static void shrink_dentry_list(struct li dentry = parent; while (dentry && !lockref_put_or_lock(>d_lockref)) dentry = dentry_kill(dentry, 1); - rcu_read_lock(); } - rcu_read_unlock(); } static enum lru_status -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Wed, Apr 30, 2014 at 05:04:36AM +0100, Al Viro wrote: > On Tue, Apr 29, 2014 at 07:56:13PM -0700, Linus Torvalds wrote: > > On Tue, Apr 29, 2014 at 7:31 PM, Al Viro wrote: > > > > > > OK, aggregate diff follows, more readable splitup (3 commits) attached. > > > It seems to survive beating here; testing, review and comments are > > > welcome. > > > > Miklos, did you have some particular load that triggered this, or was > > it just some reports? It would be really good to get this patch some > > stress-testing. > > > > I like how the patch removes more lines than it adds, but apart from > > that it's hard to read the patch (even the split-out ones) and say > > anything more about it. I think this needs a *lot* of testing. > > FWIW, the first two are really straightforward expanding the function > into its only callsite. The last needs more splitup. Not sure if the > following is good enough, but it ought to be at least somewhat cleaner. > Combined change is identical to the original, so it doesn't invalidate > the testing so far... Hmm, patches look okay, but I'm wondering if we really need the morgue list and the waiting. Why not just skip dentries that are presently being handled by dentry_kill()? Thanks, Miklos --- From: Al Viro Date: Tue, 29 Apr 2014 23:52:05 -0400 Subject: Don't try to remove from shrink list we don't own So far it's just been local equivalent transformations. Now the meat of that thing: dentry_kill() has two kinds of call sites - ones that had just dropped the last reference (dput(), killing ancestors in shrink_dentry_list()) and one where the victim had sat on shrink list with zero refcount. We already have a flag telling which kind it is (unlock_on_failure). What we want to do is to avoid d_shrink_del() in the first kind of dentry_kill(). We do, however, want everything except the actual freeing still done as we would in mainline. So we need to deal with two new situations - the first kind of dentry_kill() finding a dentry on shrink list (result of laziness in dget(); we had it on shrink list with refcount 1) and the second kind finding a dentry that is currently being killed by the first kind. What we do is this: - in the first kind of dentry_kill() we don't remove the dentry from the shrink list, this makes the shrink list really private to the shrinker functions - we proceed with the normal killing but only actually free the dentry if it's definitely not on the shrink list at this point. If it's still on the shrink list set DCACHE_MAY_FREE and defer the freeing to shrink_dentry_list() - shrink_dentry_list() will detect if the dentry is killed with DCACHE_DENTRY_KILLED. If DCACHE_MAY_FREE isn't yet set, just take the dentry off the shrink list and let the still running dentry_kill() finish it off. If DCACHE_MAY_FREE is set, then free the dentry. Signed-off-by: Al Viro Signed-off-by: Miklos Szeredi --- fs/dcache.c| 44 ++-- include/linux/dcache.h |2 ++ 2 files changed, 44 insertions(+), 2 deletions(-) --- a/fs/dcache.c +++ b/fs/dcache.c @@ -469,6 +469,7 @@ dentry_kill(struct dentry *dentry, int u { struct inode *inode; struct dentry *parent; + bool can_free = true; inode = dentry->d_inode; if (inode && !spin_trylock(>i_lock)) { @@ -504,8 +505,10 @@ dentry_kill(struct dentry *dentry, int u if (dentry->d_flags & DCACHE_LRU_LIST) { if (!(dentry->d_flags & DCACHE_SHRINK_LIST)) d_lru_del(dentry); - else + else if (!unlock_on_failure) d_shrink_del(dentry); + else + can_free = false; } /* if it was on the hash then remove it */ __d_drop(dentry); @@ -527,7 +530,25 @@ dentry_kill(struct dentry *dentry, int u if (dentry->d_op && dentry->d_op->d_release) dentry->d_op->d_release(dentry); - dentry_free(dentry); + if (unlikely(!can_free)) { + spin_lock(>d_lock); + /* +* Could have gotten off the shrink list while not holding the +* dcache lock. +* +* If still on the shrink list shrink_dentry_list will take care +* of freeing it for us. Signal this by setting the +* DCACHE_MAY_FREE flag. +*/ + if (!(dentry->d_flags & DCACHE_SHRINK_LIST)) + can_free = true; + else + dentry->d_flags |= DCACHE_MAY_FREE; + spin_unlock(>d_lock); + } + if (likely(can_free)) + dentry_free(dentry); + return parent; } @@ -835,6 +856,25 @@ static void shrink_dentry_list(struct li } rcu_read_unlock(); + if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) { +
Re: dcache shrink list corruption?
On Tue, Apr 29, 2014 at 07:56:13PM -0700, Linus Torvalds wrote: > On Tue, Apr 29, 2014 at 7:31 PM, Al Viro wrote: > > > > OK, aggregate diff follows, more readable splitup (3 commits) attached. > > It seems to survive beating here; testing, review and comments are > > welcome. > > Miklos, did you have some particular load that triggered this, or was > it just some reports? It would be really good to get this patch some > stress-testing. > > I like how the patch removes more lines than it adds, but apart from > that it's hard to read the patch (even the split-out ones) and say > anything more about it. I think this needs a *lot* of testing. IBM is triggering this with the host01 test from the LTP suite. I haven't yet tried to reproduce it. Thanks, Miklos -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Tue, Apr 29, 2014 at 07:56:13PM -0700, Linus Torvalds wrote: On Tue, Apr 29, 2014 at 7:31 PM, Al Viro v...@zeniv.linux.org.uk wrote: OK, aggregate diff follows, more readable splitup (3 commits) attached. It seems to survive beating here; testing, review and comments are welcome. Miklos, did you have some particular load that triggered this, or was it just some reports? It would be really good to get this patch some stress-testing. I like how the patch removes more lines than it adds, but apart from that it's hard to read the patch (even the split-out ones) and say anything more about it. I think this needs a *lot* of testing. IBM is triggering this with the host01 test from the LTP suite. I haven't yet tried to reproduce it. Thanks, Miklos -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Wed, Apr 30, 2014 at 05:04:36AM +0100, Al Viro wrote: On Tue, Apr 29, 2014 at 07:56:13PM -0700, Linus Torvalds wrote: On Tue, Apr 29, 2014 at 7:31 PM, Al Viro v...@zeniv.linux.org.uk wrote: OK, aggregate diff follows, more readable splitup (3 commits) attached. It seems to survive beating here; testing, review and comments are welcome. Miklos, did you have some particular load that triggered this, or was it just some reports? It would be really good to get this patch some stress-testing. I like how the patch removes more lines than it adds, but apart from that it's hard to read the patch (even the split-out ones) and say anything more about it. I think this needs a *lot* of testing. FWIW, the first two are really straightforward expanding the function into its only callsite. The last needs more splitup. Not sure if the following is good enough, but it ought to be at least somewhat cleaner. Combined change is identical to the original, so it doesn't invalidate the testing so far... Hmm, patches look okay, but I'm wondering if we really need the morgue list and the waiting. Why not just skip dentries that are presently being handled by dentry_kill()? Thanks, Miklos --- From: Al Viro v...@zeniv.linux.org.uk Date: Tue, 29 Apr 2014 23:52:05 -0400 Subject: Don't try to remove from shrink list we don't own So far it's just been local equivalent transformations. Now the meat of that thing: dentry_kill() has two kinds of call sites - ones that had just dropped the last reference (dput(), killing ancestors in shrink_dentry_list()) and one where the victim had sat on shrink list with zero refcount. We already have a flag telling which kind it is (unlock_on_failure). What we want to do is to avoid d_shrink_del() in the first kind of dentry_kill(). We do, however, want everything except the actual freeing still done as we would in mainline. So we need to deal with two new situations - the first kind of dentry_kill() finding a dentry on shrink list (result of laziness in dget(); we had it on shrink list with refcount 1) and the second kind finding a dentry that is currently being killed by the first kind. What we do is this: - in the first kind of dentry_kill() we don't remove the dentry from the shrink list, this makes the shrink list really private to the shrinker functions - we proceed with the normal killing but only actually free the dentry if it's definitely not on the shrink list at this point. If it's still on the shrink list set DCACHE_MAY_FREE and defer the freeing to shrink_dentry_list() - shrink_dentry_list() will detect if the dentry is killed with DCACHE_DENTRY_KILLED. If DCACHE_MAY_FREE isn't yet set, just take the dentry off the shrink list and let the still running dentry_kill() finish it off. If DCACHE_MAY_FREE is set, then free the dentry. Signed-off-by: Al Viro v...@zeniv.linux.org.uk Signed-off-by: Miklos Szeredi mszer...@suse.cz --- fs/dcache.c| 44 ++-- include/linux/dcache.h |2 ++ 2 files changed, 44 insertions(+), 2 deletions(-) --- a/fs/dcache.c +++ b/fs/dcache.c @@ -469,6 +469,7 @@ dentry_kill(struct dentry *dentry, int u { struct inode *inode; struct dentry *parent; + bool can_free = true; inode = dentry-d_inode; if (inode !spin_trylock(inode-i_lock)) { @@ -504,8 +505,10 @@ dentry_kill(struct dentry *dentry, int u if (dentry-d_flags DCACHE_LRU_LIST) { if (!(dentry-d_flags DCACHE_SHRINK_LIST)) d_lru_del(dentry); - else + else if (!unlock_on_failure) d_shrink_del(dentry); + else + can_free = false; } /* if it was on the hash then remove it */ __d_drop(dentry); @@ -527,7 +530,25 @@ dentry_kill(struct dentry *dentry, int u if (dentry-d_op dentry-d_op-d_release) dentry-d_op-d_release(dentry); - dentry_free(dentry); + if (unlikely(!can_free)) { + spin_lock(dentry-d_lock); + /* +* Could have gotten off the shrink list while not holding the +* dcache lock. +* +* If still on the shrink list shrink_dentry_list will take care +* of freeing it for us. Signal this by setting the +* DCACHE_MAY_FREE flag. +*/ + if (!(dentry-d_flags DCACHE_SHRINK_LIST)) + can_free = true; + else + dentry-d_flags |= DCACHE_MAY_FREE; + spin_unlock(dentry-d_lock); + } + if (likely(can_free)) + dentry_free(dentry); + return parent; } @@ -835,6 +856,25 @@ static void shrink_dentry_list(struct li } rcu_read_unlock(); + if
Re: dcache shrink list corruption?
And a followup patch for that, removing RCU locking from shrink_dentry_list(). I haven't tested any of this at this point, but I'll ask IBM to do some stress testing. Thanks, Miklos --- From: Miklos Szeredi mszer...@suse.cz Subject: dcache: don't need rcu in shrink_dentry_list() Since now the shrink list is private and nobody can free the dentry while it is on the shrink list, we can remove RCU protection from this. Signed-off-by: Miklos Szeredi mszer...@suse.cz --- fs/dcache.c | 29 - 1 file changed, 4 insertions(+), 25 deletions(-) --- a/fs/dcache.c +++ b/fs/dcache.c @@ -822,23 +822,9 @@ static void shrink_dentry_list(struct li { struct dentry *dentry, *parent; - rcu_read_lock(); - for (;;) { - dentry = list_entry_rcu(list-prev, struct dentry, d_lru); - if (dentry-d_lru == list) - break; /* empty */ - - /* -* Get the dentry lock, and re-verify that the dentry is -* this on the shrinking list. If it is, we know that -* DCACHE_SHRINK_LIST and DCACHE_LRU_LIST are set. -*/ + while (!list_empty(list)) { + dentry = list_entry(list-prev, struct dentry, d_lru); spin_lock(dentry-d_lock); - if (dentry != list_entry(list-prev, struct dentry, d_lru)) { - spin_unlock(dentry-d_lock); - continue; - } - /* * The dispose list is isolated and dentries are not accounted * to the LRU here, so we can simply remove it from the list @@ -854,7 +840,6 @@ static void shrink_dentry_list(struct li spin_unlock(dentry-d_lock); continue; } - rcu_read_unlock(); if (unlikely(dentry-d_flags DCACHE_DENTRY_KILLED)) { bool free_it; @@ -870,8 +855,6 @@ static void shrink_dentry_list(struct li if (free_it) dentry_free(dentry); - - rcu_read_lock(); continue; } @@ -879,17 +862,15 @@ static void shrink_dentry_list(struct li /* * If dentry_kill returns NULL, we have nothing more to do. */ - if (!parent) { - rcu_read_lock(); + if (!parent) continue; - } + if (unlikely(parent == dentry)) { /* * trylocks have failed and d_lock has been held the * whole time, so it could not have been added to any * other lists. Just add it back to the shrink list. */ - rcu_read_lock(); d_shrink_add(dentry, list); spin_unlock(dentry-d_lock); continue; @@ -903,9 +884,7 @@ static void shrink_dentry_list(struct li dentry = parent; while (dentry !lockref_put_or_lock(dentry-d_lockref)) dentry = dentry_kill(dentry, 1); - rcu_read_lock(); } - rcu_read_unlock(); } static enum lru_status -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Wed, Apr 30, 2014 at 05:49:58PM +0200, Miklos Szeredi wrote: FWIW, the first two are really straightforward expanding the function into its only callsite. The last needs more splitup. Not sure if the following is good enough, but it ought to be at least somewhat cleaner. Combined change is identical to the original, so it doesn't invalidate the testing so far... Hmm, patches look okay, but I'm wondering if we really need the morgue list and the waiting. Why not just skip dentries that are presently being handled by dentry_kill()? Who will be freeing them? If we do that from dentry_kill(), we are back to needing them removed from shrink list by something called by dput(). And if we do that from shrink_dentry_list(), we'd damn better wait for dentry_iput() et.al. to finish. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Wed, Apr 30, 2014 at 6:03 PM, Al Viro v...@zeniv.linux.org.uk wrote: On Wed, Apr 30, 2014 at 05:49:58PM +0200, Miklos Szeredi wrote: FWIW, the first two are really straightforward expanding the function into its only callsite. The last needs more splitup. Not sure if the following is good enough, but it ought to be at least somewhat cleaner. Combined change is identical to the original, so it doesn't invalidate the testing so far... Hmm, patches look okay, but I'm wondering if we really need the morgue list and the waiting. Why not just skip dentries that are presently being handled by dentry_kill()? Who will be freeing them? If we do that from dentry_kill(), we are back to needing them removed from shrink list by something called by dput(). And if we do that from shrink_dentry_list(), we'd damn better wait for dentry_iput() et.al. to finish. We can do it from dput if the shrinker gets there first and from the shrinker if dput manages to finish before. See the updated patch in the previous mail. Thanks, Miklos -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Wed, Apr 30, 2014 at 07:33:38PM +0200, Miklos Szeredi wrote: On Wed, Apr 30, 2014 at 6:03 PM, Al Viro v...@zeniv.linux.org.uk wrote: On Wed, Apr 30, 2014 at 05:49:58PM +0200, Miklos Szeredi wrote: FWIW, the first two are really straightforward expanding the function into its only callsite. The last needs more splitup. Not sure if the following is good enough, but it ought to be at least somewhat cleaner. Combined change is identical to the original, so it doesn't invalidate the testing so far... Hmm, patches look okay, but I'm wondering if we really need the morgue list and the waiting. Why not just skip dentries that are presently being handled by dentry_kill()? Who will be freeing them? If we do that from dentry_kill(), we are back to needing them removed from shrink list by something called by dput(). And if we do that from shrink_dentry_list(), we'd damn better wait for dentry_iput() et.al. to finish. We can do it from dput if the shrinker gets there first and from the shrinker if dput manages to finish before. See the updated patch in the previous mail. Er? The only patch I see is removal of RCU from shrink_dentry_list(), which is fine, but doesn't do anything of that sort. What was the Message-ID? Let me see if I understand what you are proposing: dentry_kill(dentry, 0) seeing DCACHE_DENTRY_KILLED check DCACHE_MAY_FREE, free it's been set dentry_kill(dentry, 1) seeing that we are on shrinker list leave on the list, do the work on killing, retake -d_lock, if we are still on shrinker list set DCACHE_MAY_FREE, else free it That would probably work... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Wed, Apr 30, 2014 at 8:36 PM, Al Viro v...@zeniv.linux.org.uk wrote: On Wed, Apr 30, 2014 at 07:33:38PM +0200, Miklos Szeredi wrote: On Wed, Apr 30, 2014 at 6:03 PM, Al Viro v...@zeniv.linux.org.uk wrote: On Wed, Apr 30, 2014 at 05:49:58PM +0200, Miklos Szeredi wrote: FWIW, the first two are really straightforward expanding the function into its only callsite. The last needs more splitup. Not sure if the following is good enough, but it ought to be at least somewhat cleaner. Combined change is identical to the original, so it doesn't invalidate the testing so far... Hmm, patches look okay, but I'm wondering if we really need the morgue list and the waiting. Why not just skip dentries that are presently being handled by dentry_kill()? Who will be freeing them? If we do that from dentry_kill(), we are back to needing them removed from shrink list by something called by dput(). And if we do that from shrink_dentry_list(), we'd damn better wait for dentry_iput() et.al. to finish. We can do it from dput if the shrinker gets there first and from the shrinker if dput manages to finish before. See the updated patch in the previous mail. Er? The only patch I see is removal of RCU from shrink_dentry_list(), which is fine, but doesn't do anything of that sort. What was the Message-ID? Message-ID: 20140430154958.gc3...@tucsk.piliscsaba.szeredi.hu Let me see if I understand what you are proposing: dentry_kill(dentry, 0) seeing DCACHE_DENTRY_KILLED check DCACHE_MAY_FREE, free it's been set dentry_kill(dentry, 1) seeing that we are on shrinker list leave on the list, do the work on killing, retake -d_lock, if we are still on shrinker list set DCACHE_MAY_FREE, else free it That would probably work... Exactly. Thanks, Miklos -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Wed, Apr 30, 2014 at 08:42:01PM +0200, Miklos Szeredi wrote: Message-ID: 20140430154958.gc3...@tucsk.piliscsaba.szeredi.hu I see... Several points: * I still think that it's better to do handling of we see DCACHE_DENTRY_KILLED already set in dentry_kill() itself. * in dentry_kill(dentry, 0) we *know* that it's not on a shrink list - the caller has just removed it from there and we'd kept -d_lock since that point. What's more, with that scheme we don't need to bother with can_free at all - just grab -d_lock after -d_release() call and check DCACHE_SHRINK_LIST. No sense trying to avoid that - in case where we could just go ahead and free the sucker, there's neither contention nor anybody else interested in that cacheline, so spin_lock(dentry-d_lock) is pretty much free. IOW, I'd prefer to do the following (completely untested) on top of 1/6--4/6: diff --git a/fs/dcache.c b/fs/dcache.c index e482775..82d8eb4 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -489,6 +489,17 @@ relock: goto relock; } + if (unlikely(dentry-d_flags DCACHE_DENTRY_KILLED)) { + if (parent) + spin_unlock(parent-d_lock); + if (dentry-d_flags DCACHE_MAY_FREE) { + spin_unlock(dentry-d_lock); + dentry_free(dentry); + } else { + spin_unlock(dentry-d_lock); + } + return parent; + } /* * The dentry is now unrecoverably dead to the world. */ @@ -504,8 +515,6 @@ relock: if (dentry-d_flags DCACHE_LRU_LIST) { if (!(dentry-d_flags DCACHE_SHRINK_LIST)) d_lru_del(dentry); - else - d_shrink_del(dentry); } /* if it was on the hash then remove it */ __d_drop(dentry); @@ -527,7 +536,14 @@ relock: if (dentry-d_op dentry-d_op-d_release) dentry-d_op-d_release(dentry); - dentry_free(dentry); + spin_lock(dentry-d_lock); + if (dentry-d_flags DCACHE_SHRINK_LIST) { + dentry-d_flags |= DCACHE_MAY_FREE; + spin_unlock(dentry-d_lock); + } else { + spin_unlock(dentry-d_lock); + dentry_free(dentry); + } return parent; } diff --git a/include/linux/dcache.h b/include/linux/dcache.h index 3b9bfdb..3c7ec32 100644 --- a/include/linux/dcache.h +++ b/include/linux/dcache.h @@ -221,6 +221,8 @@ struct dentry_operations { #define DCACHE_SYMLINK_TYPE0x0030 /* Symlink */ #define DCACHE_FILE_TYPE 0x0040 /* Other file type */ +#define DCACHE_MAY_FREE0x0080 + extern seqlock_t rename_lock; static inline int dname_external(const struct dentry *dentry) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Wed, Apr 30, 2014 at 08:02:27PM +0100, Al Viro wrote: On Wed, Apr 30, 2014 at 08:42:01PM +0200, Miklos Szeredi wrote: Message-ID: 20140430154958.gc3...@tucsk.piliscsaba.szeredi.hu I see... Several points: * I still think that it's better to do handling of we see DCACHE_DENTRY_KILLED already set in dentry_kill() itself. * in dentry_kill(dentry, 0) we *know* that it's not on a shrink list - the caller has just removed it from there and we'd kept -d_lock since that point. What's more, with that scheme we don't need to bother with can_free at all - just grab -d_lock after -d_release() call and check DCACHE_SHRINK_LIST. No sense trying to avoid that - in case where we could just go ahead and free the sucker, there's neither contention nor anybody else interested in that cacheline, so spin_lock(dentry-d_lock) is pretty much free. IOW, I'd prefer to do the following (completely untested) on top of 1/6--4/6: Sigh... One more problem: /* * We found an inuse dentry which was not removed from * the LRU because of laziness during lookup. Do not free it. */ if (dentry-d_lockref.count) { spin_unlock(dentry-d_lock); continue; } should become if (dentry-d_lockref.count 0) { instead - lockref_mark_dead() slaps -128 into it, so we'll just leak all dentries where dput() has come first and decided to leave the suckers for us. Another thing: I don't like what's going on with freeing vs. -d_lock there. Had that been a mutex, we'd definitely get a repeat of vfs: fix subtle use-after-free of pipe_inode_info. The question is, can spin_unlock(p) dereference p after another CPU gets through spin_lock(p)? Linus? It can be dealt with by setting DCACHE_RCUACCESS in let another guy free it cases and playing with rcu_read_lock a bit, but I wonder whether we need to bother - quick look through alpha/sparc/ppc shows that on those we do not and the same is true for non-paravirt case on x86. I hadn't checked what paravirt one does, though, and I certainly hadn't done exhaustive search for architectures doing something weird... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Wed, Apr 30, 2014 at 12:59 PM, Al Viro v...@zeniv.linux.org.uk wrote: Another thing: I don't like what's going on with freeing vs. -d_lock there. Had that been a mutex, we'd definitely get a repeat of vfs: fix subtle use-after-free of pipe_inode_info. The question is, can spin_unlock(p) dereference p after another CPU gets through spin_lock(p)? Linus? spin_unlock() *should* be safe wrt that issue. But I have to say, I think paravirtualized spinlocks may break that. They do all kinds of kick waiters after releasing the lock. Doesn't the RCU protection solve that, though? Nobody should be releasing the dentry under us, afaik.. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Wed, Apr 30, 2014 at 01:23:26PM -0700, Linus Torvalds wrote: On Wed, Apr 30, 2014 at 12:59 PM, Al Viro v...@zeniv.linux.org.uk wrote: Another thing: I don't like what's going on with freeing vs. -d_lock there. Had that been a mutex, we'd definitely get a repeat of vfs: fix subtle use-after-free of pipe_inode_info. The question is, can spin_unlock(p) dereference p after another CPU gets through spin_lock(p)? Linus? spin_unlock() *should* be safe wrt that issue. But I have to say, I think paravirtualized spinlocks may break that. They do all kinds of kick waiters after releasing the lock. Doesn't the RCU protection solve that, though? Nobody should be releasing the dentry under us, afaik.. We do not (and cannot) call dentry_kill() with rcu_read_lock held - it can trigger any amount of IO, for one thing. We can take it around the couple of places where do that spin_unlock(dentry-d_lock) (along with setting DCACHE_RCUACCESS) - that's what I'd been refering to. Then this sucker (tests still running, so far everything seems to survive) becomes the following (again, on top of 1/6..4/6). BTW, is there any convenient way to tell git commit --amend to update the commit date? Something like --date=now would be nice, but it isn't accepted... commit 797ff22681dc969b478ed837787d24dfd2dd2132 Author: Al Viro v...@zeniv.linux.org.uk Date: Tue Apr 29 23:52:05 2014 -0400 dentry_kill(): don't try to remove from shrink list If the victim in on the shrink list, don't remove it from there. If shrink_dentry_list() manages to remove it from the list before we are done - fine, we'll just free it as usual. If not - mark it with new flag (DCACHE_MAY_FREE) and leave it there. Eventually, shrink_dentry_list() will get to it, remove the sucker from shrink list and call dentry_kill(dentry, 0). Which is where we'll deal with freeing. Since now dentry_kill(dentry, 0) may happen after or during dentry_kill(dentry, 1), we need to recognize that (by seeing DCACHE_DENTRY_KILLED already set), unlock everything and either free the sucker (in case DCACHE_MAY_FREE has been set) or leave it for ongoing dentry_kill(dentry, 1) to deal with. Signed-off-by: Al Viro v...@zeniv.linux.org.uk diff --git a/fs/dcache.c b/fs/dcache.c index e482775..fa40d26 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -489,6 +489,20 @@ relock: goto relock; } + if (unlikely(dentry-d_flags DCACHE_DENTRY_KILLED)) { + if (parent) + spin_unlock(parent-d_lock); + if (dentry-d_flags DCACHE_MAY_FREE) { + spin_unlock(dentry-d_lock); + dentry_free(dentry); + } else { + dentry-d_flags |= DCACHE_RCUACCESS; + rcu_read_lock(); + spin_unlock(dentry-d_lock); + rcu_read_unlock(); + } + return parent; + } /* * The dentry is now unrecoverably dead to the world. */ @@ -504,8 +518,6 @@ relock: if (dentry-d_flags DCACHE_LRU_LIST) { if (!(dentry-d_flags DCACHE_SHRINK_LIST)) d_lru_del(dentry); - else - d_shrink_del(dentry); } /* if it was on the hash then remove it */ __d_drop(dentry); @@ -527,7 +539,16 @@ relock: if (dentry-d_op dentry-d_op-d_release) dentry-d_op-d_release(dentry); - dentry_free(dentry); + spin_lock(dentry-d_lock); + if (dentry-d_flags DCACHE_SHRINK_LIST) { + dentry-d_flags |= DCACHE_MAY_FREE | DCACHE_RCUACCESS; + rcu_read_lock(); + spin_unlock(dentry-d_lock); + rcu_read_unlock(); + } else { + spin_unlock(dentry-d_lock); + dentry_free(dentry); + } return parent; } @@ -829,7 +850,7 @@ static void shrink_dentry_list(struct list_head *list) * We found an inuse dentry which was not removed from * the LRU because of laziness during lookup. Do not free it. */ - if (dentry-d_lockref.count) { + if (dentry-d_lockref.count 0) { spin_unlock(dentry-d_lock); continue; } diff --git a/include/linux/dcache.h b/include/linux/dcache.h index 3b9bfdb..3c7ec32 100644 --- a/include/linux/dcache.h +++ b/include/linux/dcache.h @@ -221,6 +221,8 @@ struct dentry_operations { #define DCACHE_SYMLINK_TYPE0x0030 /* Symlink */ #define DCACHE_FILE_TYPE 0x0040 /* Other file type */ +#define DCACHE_MAY_FREE0x0080 + extern seqlock_t rename_lock; static inline int dname_external(const struct dentry *dentry) -- To unsubscribe from this list: send the
Re: dcache shrink list corruption?
On Wed, Apr 30, 2014 at 1:38 PM, Al Viro v...@zeniv.linux.org.uk wrote: We do not (and cannot) call dentry_kill() with rcu_read_lock held - it can trigger any amount of IO, for one thing. We can take it around the couple of places where do that spin_unlock(dentry-d_lock) (along with setting DCACHE_RCUACCESS) - that's what I'd been refering to. Just the last spin_unlock() would be the case that matters, if the spin_unlock() is done on something that could be freed immediately and the lock protects and is inside the entity that gets freed. BTW, is there any convenient way to tell git commit --amend to update the commit date? Something like --date=now would be nice, but it isn't accepted... --date=$(date) works. The --date thing doesn't take the nice git approxidate strings, maybe it should. So you have to give it a real date. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Wed, Apr 30, 2014 at 01:57:05PM -0700, Linus Torvalds wrote: On Wed, Apr 30, 2014 at 1:38 PM, Al Viro v...@zeniv.linux.org.uk wrote: We do not (and cannot) call dentry_kill() with rcu_read_lock held - it can trigger any amount of IO, for one thing. We can take it around the couple of places where do that spin_unlock(dentry-d_lock) (along with setting DCACHE_RCUACCESS) - that's what I'd been refering to. Just the last spin_unlock() would be the case that matters, if the spin_unlock() is done on something that could be freed immediately and the lock protects and is inside the entity that gets freed. *nod* There are two such spin_unlock (handover from shrink_dentry_list() to dput() and the opposite one), but they are all that needs protection - -d_flags update is outside the rcu-critical area. I really wonder if we *can* get there without DCACHE_RCUACCESS having been set, though; dentry would have to be * picked into shrink list (i.e. have had zero refcount at some point) * never had been through __d_rehash() shrink_dentry_list() definitely counts on that being impossible, and it probably is, but I'm feeling seriously paranoid about the whole area. I'll finish grepping through the tree and probably drop setting DCACHE_RCUACCESS from the patch - either that, or set it in d_shrink_add() it it turns out that it is possible and shrink_dentry_list() is fucked... Tests seem to be running fine so far... BTW, is there any convenient way to tell git commit --amend to update the commit date? Something like --date=now would be nice, but it isn't accepted... --date=$(date) works. Thanks... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Wed, Apr 30, 2014 at 10:12:06PM +0100, Al Viro wrote: On Wed, Apr 30, 2014 at 01:57:05PM -0700, Linus Torvalds wrote: On Wed, Apr 30, 2014 at 1:38 PM, Al Viro v...@zeniv.linux.org.uk wrote: We do not (and cannot) call dentry_kill() with rcu_read_lock held - it can trigger any amount of IO, for one thing. We can take it around the couple of places where do that spin_unlock(dentry-d_lock) (along with setting DCACHE_RCUACCESS) - that's what I'd been refering to. Just the last spin_unlock() would be the case that matters, if the spin_unlock() is done on something that could be freed immediately and the lock protects and is inside the entity that gets freed. *nod* There are two such spin_unlock (handover from shrink_dentry_list() to dput() and the opposite one), but they are all that needs protection - -d_flags update is outside the rcu-critical area. I really wonder if we *can* get there without DCACHE_RCUACCESS having been set, though; dentry would have to be * picked into shrink list (i.e. have had zero refcount at some point) * never had been through __d_rehash() shrink_dentry_list() definitely counts on that being impossible, and it probably is, but I'm feeling seriously paranoid about the whole area. I'll finish grepping through the tree and probably drop setting DCACHE_RCUACCESS from the patch - either that, or set it in d_shrink_add() it it turns out that it is possible and shrink_dentry_list() is fucked... OK, it really can't happen. The proof is more convoluted than I'd like it, but it's solid enough, so setting that flag in dentry_kill() handover cases wasn't needed. I've just pushed the whole thing to vfs.git#for-linus; review and testing would be very welcome. I can repost it one more time, but the only difference compared to the last variant in this thread is not bothering with DCACHE_RCUACCESS. It has survived LTP tests, going through xfstests now... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Wed, Apr 30, 2014 at 3:12 PM, Al Viro v...@zeniv.linux.org.uk wrote: I've just pushed the whole thing to vfs.git#for-linus; review and testing would be very welcome. I have no half-way relevant test-case for this, so I'm hoping people who have good VFS stress-tests (preferably under memory pressure so that we get that whole shrinking path) will test. But it looks fine. That said, I do hate that RCU read-lock around the final spin-unlock. Let me go talk to the paravirt people. Maybe they don't need this, and I don't know exactly *how* they use that lock pointer after the unlock in the kick waiters part. Maybe it's all good. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Wed, Apr 30, 2014 at 4:04 PM, Linus Torvalds torva...@linux-foundation.org wrote: Let me go talk to the paravirt people. Maybe they don't need this, and I don't know exactly *how* they use that lock pointer after the unlock in the kick waiters part. Maybe it's all good. .. looking at current users (xen and kvm) it does in fact look all good. Yes, we kick possible waiters after the unlock, but the lock itself is not touched by that, it just uses the pointer to the lock as a way to figure out who to kick. In fact, I kind of knew that, but had forgotten. We very much depend on spin_unlock being safe wrt immediate deleton already: the completion code very much depends on it. It does a spin_unlock() to release the completion, and it can get reused immediately (it might be on the stack, it might be in some data structure that gets freed). So I'd suggest removing that whole RCU thing, because it should be safe to unlock something that can go away immediately. We'd have huge problems elsewhere if it wasn't safe. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Wed, Apr 30, 2014 at 04:04:58PM -0700, Linus Torvalds wrote: On Wed, Apr 30, 2014 at 3:12 PM, Al Viro v...@zeniv.linux.org.uk wrote: I've just pushed the whole thing to vfs.git#for-linus; review and testing would be very welcome. I have no half-way relevant test-case for this, so I'm hoping people who have good VFS stress-tests (preferably under memory pressure so that we get that whole shrinking path) will test. But it looks fine. That said, I do hate that RCU read-lock around the final spin-unlock. So do I, obviously... After looking through the rest of arch_spin_unlock(), it looks like the only suspicious one except the paravirt on x86 is blackfin. And that might be misreading - I'm not familiar enough with the architecture to tell... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Wed, Apr 30, 2014 at 04:14:14PM -0700, Linus Torvalds wrote: On Wed, Apr 30, 2014 at 4:04 PM, Linus Torvalds torva...@linux-foundation.org wrote: Let me go talk to the paravirt people. Maybe they don't need this, and I don't know exactly *how* they use that lock pointer after the unlock in the kick waiters part. Maybe it's all good. .. looking at current users (xen and kvm) it does in fact look all good. Yes, we kick possible waiters after the unlock, but the lock itself is not touched by that, it just uses the pointer to the lock as a way to figure out who to kick. In fact, I kind of knew that, but had forgotten. We very much depend on spin_unlock being safe wrt immediate deleton already: the completion code very much depends on it. It does a spin_unlock() to release the completion, and it can get reused immediately (it might be on the stack, it might be in some data structure that gets freed). So I'd suggest removing that whole RCU thing, because it should be safe to unlock something that can go away immediately. We'd have huge problems elsewhere if it wasn't safe. OK, done and force-pushed. Should propagate in a few... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Wed, Apr 30, 2014 at 4:43 PM, Al Viro v...@zeniv.linux.org.uk wrote: OK, done and force-pushed. Should propagate in a few... That made it more obvious how the DCACHE_MAY_FREE case ends up working. And in particular, mind rewriting this: if (dentry-d_flags DCACHE_MAY_FREE) { spin_unlock(dentry-d_lock); dentry_free(dentry); } else { spin_unlock(dentry-d_lock); } return parent; as just bool free = dentry-d_flags DCACHE_MAY_FREE; spin_unlock(dentry-d_lock); if (free) dentry_free(dentry); return parent; instead? In fact, I get the feeling that the other case later on really fits the same model: spin_lock(dentry-d_lock); if (dentry-d_flags DCACHE_SHRINK_LIST) { dentry-d_flags |= DCACHE_MAY_FREE; spin_unlock(dentry-d_lock); } else { spin_unlock(dentry-d_lock); dentry_free(dentry); } ends up really being better as spin_lock(dentry-d_lock); free = 1; if (dentry-d_flags DCACHE_SHRINK_LIST) { dentry-d_flags |= DCACHE_MAY_FREE; free = 0; } spin_unlock(dentry-d_lock); if (free) dentry_free(dentry); return parent; and then suddenly it looks like we have a common exit sequence from that dentry_kill() function, no? (The earlier unlock_on_failure exit case is altogether a different case). I dunno. Maybe not a big deal, but one reason I prefer doing that free flag is because I really tend to prefer the simple case of lock-unlock pairing cleanly at the same level. NOT the pattern where you have one lock at one indentation level, paired with multiple unlocks for all the different cases. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Wed, Apr 30, 2014 at 05:18:23PM -0700, Linus Torvalds wrote: and then suddenly it looks like we have a common exit sequence from that dentry_kill() function, no? (The earlier unlock_on_failure exit case is altogether a different case). I dunno. Maybe not a big deal, but one reason I prefer doing that free flag is because I really tend to prefer the simple case of lock-unlock pairing cleanly at the same level. NOT the pattern where you have one lock at one indentation level, paired with multiple unlocks for all the different cases. Yeah, but... I have such variant, but the best I could get still generated the code that wasn't particulary nice. Part might be gcc code generation sucking for bool, part - extra register pressure... It has slightly lower i-cache footprint, though, so it might be worth doing. Hell knows; that's a fairly hot codepath, so let's do it that way - I've just pushed an alternative branch with bool can_free variant; branches in question: vfs.git#for-linus and vfs.git#dentry_kill-2. Help with profiling is needed; the loads to watch are the ones where dentry_kill() + dentry_free() are sufficiently high in profiles for the differences to matter. Any takers? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Wed, Apr 30, 2014 at 7:51 PM, Al Viro v...@zeniv.linux.org.uk wrote: Help with profiling is needed; the loads to watch are the ones where dentry_kill() + dentry_free() are sufficiently high in profiles for the differences to matter. Any takers? I really hope there are no such loads, my lock/unlock pairing suggestion was mostly so that the pairing is clearer, not necessarily for performance. That said, I'd assume that it migth be worth testing at least the lots of concurrent lookups of 'simple_dentry_operations' dentries. So most of /proc, most uses of simple_lookup(). That at least triggers the dentry_kill() path in dput(), so it should be fairly easy to get profiles. But real loads with real filesystems? That sounds harder. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Wed, Apr 30, 2014 at 07:59:43PM -0700, Linus Torvalds wrote: On Wed, Apr 30, 2014 at 7:51 PM, Al Viro v...@zeniv.linux.org.uk wrote: Help with profiling is needed; the loads to watch are the ones where dentry_kill() + dentry_free() are sufficiently high in profiles for the differences to matter. Any takers? I really hope there are no such loads, my lock/unlock pairing suggestion was mostly so that the pairing is clearer, not necessarily for performance. That said, I'd assume that it migth be worth testing at least the lots of concurrent lookups of 'simple_dentry_operations' dentries. So most of /proc, most uses of simple_lookup(). That at least triggers the dentry_kill() path in dput(), so it should be fairly easy to get profiles. But real loads with real filesystems? That sounds harder. Well, the simplest way to do that is a bunch of open/unlink/close. Another one is cp -rl / rm -rf of a large tree (rmdir(2) does shrink_dcache_parent()). For that matter, unmapping an anon shared mapping will trigger the same path. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Tue, Apr 29, 2014 at 07:56:13PM -0700, Linus Torvalds wrote: > On Tue, Apr 29, 2014 at 7:31 PM, Al Viro wrote: > > > > OK, aggregate diff follows, more readable splitup (3 commits) attached. > > It seems to survive beating here; testing, review and comments are > > welcome. > > Miklos, did you have some particular load that triggered this, or was > it just some reports? It would be really good to get this patch some > stress-testing. > > I like how the patch removes more lines than it adds, but apart from > that it's hard to read the patch (even the split-out ones) and say > anything more about it. I think this needs a *lot* of testing. FWIW, the first two are really straightforward expanding the function into its only callsite. The last needs more splitup. Not sure if the following is good enough, but it ought to be at least somewhat cleaner. Combined change is identical to the original, so it doesn't invalidate the testing so far... >From 895aeb48465bbf78803fd11dee2556d010ada23a Mon Sep 17 00:00:00 2001 From: Al Viro Date: Tue, 29 Apr 2014 15:45:28 -0400 Subject: [PATCH 1/6] fold d_kill() and d_free() Signed-off-by: Al Viro --- fs/dcache.c | 76 +++ 1 file changed, 24 insertions(+), 52 deletions(-) diff --git a/fs/dcache.c b/fs/dcache.c index 494a9def..9b15c5c 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -246,23 +246,6 @@ static void __d_free(struct rcu_head *head) kmem_cache_free(dentry_cache, dentry); } -/* - * no locks, please. - */ -static void d_free(struct dentry *dentry) -{ - BUG_ON((int)dentry->d_lockref.count > 0); - this_cpu_dec(nr_dentry); - if (dentry->d_op && dentry->d_op->d_release) - dentry->d_op->d_release(dentry); - - /* if dentry was never visible to RCU, immediate free is OK */ - if (!(dentry->d_flags & DCACHE_RCUACCESS)) - __d_free(>d_u.d_rcu); - else - call_rcu(>d_u.d_rcu, __d_free); -} - /** * dentry_rcuwalk_barrier - invalidate in-progress rcu-walk lookups * @dentry: the target dentry @@ -420,40 +403,6 @@ static void dentry_lru_del(struct dentry *dentry) } /** - * d_kill - kill dentry and return parent - * @dentry: dentry to kill - * @parent: parent dentry - * - * The dentry must already be unhashed and removed from the LRU. - * - * If this is the root of the dentry tree, return NULL. - * - * dentry->d_lock and parent->d_lock must be held by caller, and are dropped by - * d_kill. - */ -static struct dentry *d_kill(struct dentry *dentry, struct dentry *parent) - __releases(dentry->d_lock) - __releases(parent->d_lock) - __releases(dentry->d_inode->i_lock) -{ - list_del(>d_u.d_child); - /* -* Inform d_walk() that we are no longer attached to the -* dentry tree -*/ - dentry->d_flags |= DCACHE_DENTRY_KILLED; - if (parent) - spin_unlock(>d_lock); - dentry_iput(dentry); - /* -* dentry_iput drops the locks, at which point nobody (except -* transient RCU lookups) can reach this dentry. -*/ - d_free(dentry); - return parent; -} - -/** * d_drop - drop a dentry * @dentry: dentry to drop * @@ -546,7 +495,30 @@ relock: dentry_lru_del(dentry); /* if it was on the hash then remove it */ __d_drop(dentry); - return d_kill(dentry, parent); + list_del(>d_u.d_child); + /* +* Inform d_walk() that we are no longer attached to the +* dentry tree +*/ + dentry->d_flags |= DCACHE_DENTRY_KILLED; + if (parent) + spin_unlock(>d_lock); + dentry_iput(dentry); + /* +* dentry_iput drops the locks, at which point nobody (except +* transient RCU lookups) can reach this dentry. +*/ + BUG_ON((int)dentry->d_lockref.count > 0); + this_cpu_dec(nr_dentry); + if (dentry->d_op && dentry->d_op->d_release) + dentry->d_op->d_release(dentry); + + /* if dentry was never visible to RCU, immediate free is OK */ + if (!(dentry->d_flags & DCACHE_RCUACCESS)) + __d_free(>d_u.d_rcu); + else + call_rcu(>d_u.d_rcu, __d_free); + return parent; } /* -- 1.7.10.4 >From aff934c47717be0216c9e2c10a2e8ca0f829bb54 Mon Sep 17 00:00:00 2001 From: Al Viro Date: Tue, 29 Apr 2014 16:13:18 -0400 Subject: [PATCH 2/6] fold try_prune_one_dentry() Signed-off-by: Al Viro --- fs/dcache.c | 75 --- 1 file changed, 25 insertions(+), 50 deletions(-) diff --git a/fs/dcache.c b/fs/dcache.c index 9b15c5c..a5540d4 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -787,47 +787,9 @@ restart: } EXPORT_SYMBOL(d_prune_aliases); -/* - * Try to throw away a dentry - free the inode, dput the parent. - * Requires dentry->d_lock is held, and dentry->d_count == 0. - * Releases dentry->d_lock. - *
Re: dcache shrink list corruption?
On Tue, Apr 29, 2014 at 7:31 PM, Al Viro wrote: > > OK, aggregate diff follows, more readable splitup (3 commits) attached. > It seems to survive beating here; testing, review and comments are > welcome. Miklos, did you have some particular load that triggered this, or was it just some reports? It would be really good to get this patch some stress-testing. I like how the patch removes more lines than it adds, but apart from that it's hard to read the patch (even the split-out ones) and say anything more about it. I think this needs a *lot* of testing. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Wed, Apr 30, 2014 at 12:20:13AM +0100, Al Viro wrote: > On Tue, Apr 29, 2014 at 04:04:11PM -0700, Linus Torvalds wrote: > > But at a minimum, we have "d_op->d_prune()" that would now be possibly > > be called for the old dentry *after* a new dentry has been allocated. > > Not to mention the inode not having been dropped. So it looks like a > > disaster where the filesystem now ends up having concurrent "live" > > dentries for the same entry. Even if one of them is on its way out, > > it's still visible to the filesystem. That makes me very > > uncomfortable. > > > > I'm starting to think that Miklos' original patch (perhaps with the > > spinlock split to at least be one per superblock) is the most > > straightforward approach after all. It's annoying having to add locks > > here, but the whole pruning path should not be a hotpath, so maybe > > it's not actually a big deal. > > FWIW, my solution is more or less working; I'll give more local beating > and post it... OK, aggregate diff follows, more readable splitup (3 commits) attached. It seems to survive beating here; testing, review and comments are welcome. diff --git a/fs/dcache.c b/fs/dcache.c index 494a9def..a83b933 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -246,16 +246,8 @@ static void __d_free(struct rcu_head *head) kmem_cache_free(dentry_cache, dentry); } -/* - * no locks, please. - */ -static void d_free(struct dentry *dentry) +static void dentry_free(struct dentry *dentry) { - BUG_ON((int)dentry->d_lockref.count > 0); - this_cpu_dec(nr_dentry); - if (dentry->d_op && dentry->d_op->d_release) - dentry->d_op->d_release(dentry); - /* if dentry was never visible to RCU, immediate free is OK */ if (!(dentry->d_flags & DCACHE_RCUACCESS)) __d_free(>d_u.d_rcu); @@ -420,40 +412,6 @@ static void dentry_lru_del(struct dentry *dentry) } /** - * d_kill - kill dentry and return parent - * @dentry: dentry to kill - * @parent: parent dentry - * - * The dentry must already be unhashed and removed from the LRU. - * - * If this is the root of the dentry tree, return NULL. - * - * dentry->d_lock and parent->d_lock must be held by caller, and are dropped by - * d_kill. - */ -static struct dentry *d_kill(struct dentry *dentry, struct dentry *parent) - __releases(dentry->d_lock) - __releases(parent->d_lock) - __releases(dentry->d_inode->i_lock) -{ - list_del(>d_u.d_child); - /* -* Inform d_walk() that we are no longer attached to the -* dentry tree -*/ - dentry->d_flags |= DCACHE_DENTRY_KILLED; - if (parent) - spin_unlock(>d_lock); - dentry_iput(dentry); - /* -* dentry_iput drops the locks, at which point nobody (except -* transient RCU lookups) can reach this dentry. -*/ - d_free(dentry); - return parent; -} - -/** * d_drop - drop a dentry * @dentry: dentry to drop * @@ -499,6 +457,8 @@ void d_drop(struct dentry *dentry) } EXPORT_SYMBOL(d_drop); +static DECLARE_WAIT_QUEUE_HEAD(free_wq); + /* * Finish off a dentry we've decided to kill. * dentry->d_lock must be held, returns with it unlocked. @@ -506,16 +466,17 @@ EXPORT_SYMBOL(d_drop); * Returns dentry requiring refcount drop, or NULL if we're done. */ static struct dentry * -dentry_kill(struct dentry *dentry, int unlock_on_failure) +dentry_kill(struct dentry *dentry, struct list_head *morgue) __releases(dentry->d_lock) { struct inode *inode; struct dentry *parent; + bool can_free = true; inode = dentry->d_inode; if (inode && !spin_trylock(>i_lock)) { relock: - if (unlock_on_failure) { + if (!morgue) { spin_unlock(>d_lock); cpu_relax(); } @@ -531,6 +492,15 @@ relock: goto relock; } + if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) { + /* morgue must be non-NULL */ + list_move(>d_lru, morgue); + if (parent) + spin_unlock(>d_lock); + /* inode must be NULL */ + spin_unlock(>d_lock); + return parent; + } /* * The dentry is now unrecoverably dead to the world. */ @@ -543,10 +513,43 @@ relock: if ((dentry->d_flags & DCACHE_OP_PRUNE) && !d_unhashed(dentry)) dentry->d_op->d_prune(dentry); - dentry_lru_del(dentry); + if (dentry->d_flags & DCACHE_LRU_LIST) { + if (!(dentry->d_flags & DCACHE_SHRINK_LIST)) + d_lru_del(dentry); + else if (morgue) + d_shrink_del(dentry); + else + can_free = false; + } /* if it was on the hash then remove it */ __d_drop(dentry); - return d_kill(dentry, parent); +
Re: dcache shrink list corruption?
On Tue, Apr 29, 2014 at 04:04:11PM -0700, Linus Torvalds wrote: > But at a minimum, we have "d_op->d_prune()" that would now be possibly > be called for the old dentry *after* a new dentry has been allocated. > Not to mention the inode not having been dropped. So it looks like a > disaster where the filesystem now ends up having concurrent "live" > dentries for the same entry. Even if one of them is on its way out, > it's still visible to the filesystem. That makes me very > uncomfortable. > > I'm starting to think that Miklos' original patch (perhaps with the > spinlock split to at least be one per superblock) is the most > straightforward approach after all. It's annoying having to add locks > here, but the whole pruning path should not be a hotpath, so maybe > it's not actually a big deal. FWIW, my solution is more or less working; I'll give more local beating and post it... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Tue, Apr 29, 2014 at 2:48 PM, Al Viro wrote: > > Ummm... You mean, have d_lookup() et.al. fail on something that is on > a shrink list? So I tried to see if that would work just consider it dead by the time it hits the shrink list, and if somebody does a lookup on the dentry, fail on it and just allocate a new dentry and do a real lookup. But at a minimum, we have "d_op->d_prune()" that would now be possibly be called for the old dentry *after* a new dentry has been allocated. Not to mention the inode not having been dropped. So it looks like a disaster where the filesystem now ends up having concurrent "live" dentries for the same entry. Even if one of them is on its way out, it's still visible to the filesystem. That makes me very uncomfortable. I'm starting to think that Miklos' original patch (perhaps with the spinlock split to at least be one per superblock) is the most straightforward approach after all. It's annoying having to add locks here, but the whole pruning path should not be a hotpath, so maybe it's not actually a big deal. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dcache shrink list corruption?
On Wed, Apr 30, 2014 at 07:18:51AM +1000, Dave Chinner wrote: > Seems like it would work, but it seems fragile to me - I'm > wondering how we can ensure that the private shrink list > manipulations can be kept private. > > We have a similar situation with the inode cache (private shrink > list) but the I_FREEING flag is set the entire time the inode is on > the shrink list. Any new hash lookup or attempt to grab the inode > that occurs while I_FREEING is set fails, so perhaps dentries also > need a well defined "being torn down and freed" state where new > references cannot be taken even though the dentry can still be > found... Ummm... You mean, have d_lookup() et.al. fail on something that is on a shrink list? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/