Re: dcache shrink list corruption?

2014-05-06 Thread Al Viro
On Tue, May 06, 2014 at 10:01:21AM -0700, Linus Torvalds wrote:
> On Tue, May 6, 2014 at 9:52 AM, Al Viro  wrote:
> >
> > OK...  There's one more thing I would like to put there, if you are going to
> > be away for the week.  It has sat in -next for a while, and it could stay
> > there, except that there's a _lot_ of followups touching stuff all over the
> > tree and I'd obviously prefer those to go into subsystem trees.  Which
> > means inter-tree dependencies ;-/  Would you be OK if I included that one
> > into pull request?  It just turns kvfree() into inline and takes it to
> > mm.h, next to is_vmalloc_addr();
> 
> Is there any particular reason for inlining this? I'd actually rather
> not, _especially_ if it means that a lot of "is_vmalloc_addr()" usage
> goes away and we may end up with this as a real interface.

OK, I can live with that.  dcache fixes + kvfree() (uninlined, exported by
mm/util.c) + posix_acl bugfix from hch.  Please, pull from
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for-linus

Shortlog:
Al Viro (9):
  fix races between __d_instantiate() and checks of dentry flags
  fold d_kill() and d_free()
  fold try_prune_one_dentry()
  new helper: dentry_free()
  expand the call of dentry_lru_del() in dentry_kill()
  dentry_kill(): don't try to remove from shrink list
  don't remove from shrink list in select_collect()
  more graceful recovery in umount_collect()
  nick kvfree() from apparmor

Christoph Hellwig (1):
  posix_acl: handle NULL ACL in posix_acl_equiv_mode

Miklos Szeredi (1):
  dcache: don't need rcu in shrink_dentry_list()

Diffstat:
 fs/dcache.c  |  318 
---
 fs/namei.c   |6 +-
 fs/posix_acl.c   |6 ++
 include/linux/dcache.h   |2 +
 include/linux/mm.h   |2 +
 mm/util.c|   10 +++
 security/apparmor/include/apparmor.h |1 -
 security/apparmor/lib.c  |   14 
 8 files changed, 125 insertions(+), 234 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-06 Thread Linus Torvalds
On Tue, May 6, 2014 at 9:52 AM, Al Viro  wrote:
>
> OK...  There's one more thing I would like to put there, if you are going to
> be away for the week.  It has sat in -next for a while, and it could stay
> there, except that there's a _lot_ of followups touching stuff all over the
> tree and I'd obviously prefer those to go into subsystem trees.  Which
> means inter-tree dependencies ;-/  Would you be OK if I included that one
> into pull request?  It just turns kvfree() into inline and takes it to
> mm.h, next to is_vmalloc_addr();

Is there any particular reason for inlining this? I'd actually rather
not, _especially_ if it means that a lot of "is_vmalloc_addr()" usage
goes away and we may end up with this as a real interface.

But no, I don't hate the patch.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-06 Thread Al Viro
On Tue, May 06, 2014 at 07:53:19AM -0700, Linus Torvalds wrote:
> On Tue, May 6, 2014 at 3:17 AM, Miklos Szeredi  wrote:
> >
> > Patches look okay to me.
> >
> > Reviewed-by: Miklos Szeredi 
> >
> >>   dentry_kill(): don't try to remove from shrink list
> >
> > Backport of this to 3.12 was tested by IBM and apparently fixes the
> > issue for them (I didn't backport the cleanup patches only the actual
> > fix)
> 
> Ok, good.
> 
> >>   don't remove from shrink list in select_collect()
> >
> > I've also asked them to test this, although I think this is even
> > harder to trigger.  But at least the non-racy codepaths need to be
> > tested.
> 
> I'll be incommunicado all next week, so I think I should merge this
> all now rather than later.
> 
> Al, mind doing a real pull request?

OK...  There's one more thing I would like to put there, if you are going to
be away for the week.  It has sat in -next for a while, and it could stay
there, except that there's a _lot_ of followups touching stuff all over the
tree and I'd obviously prefer those to go into subsystem trees.  Which
means inter-tree dependencies ;-/  Would you be OK if I included that one
into pull request?  It just turns kvfree() into inline and takes it to
mm.h, next to is_vmalloc_addr(); we have *lots* of open-coded instances of
that all over the place.  As the matter of fact, more than a half of
is_vmalloc_addr() call sites are of that sort...

commit 3c91dc1ce40f973bbceded3d8ce96bda7c4d480c
Author: Al Viro 
Date:   Wed Apr 23 10:13:03 2014 -0400

nick kvfree() from apparmor/lib

too many open-coded instances

Signed-off-by: Al Viro 

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bf9811e..a784964 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -370,6 +370,17 @@ static inline int is_vmalloc_or_module_addr(const void *x)
 }
 #endif
 
+static inline void kvfree(const void *x)
+{
+   /* include order mess... */
+   extern void kfree(const void *);
+   extern void vfree(const void *);
+   if (is_vmalloc_addr(x))
+   vfree(x);
+   else
+   kfree(x);
+}
+
 static inline void compound_lock(struct page *page)
 {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
diff --git a/security/apparmor/include/apparmor.h 
b/security/apparmor/include/apparmor.h
index 8fb1488..97130f8 100644
--- a/security/apparmor/include/apparmor.h
+++ b/security/apparmor/include/apparmor.h
@@ -66,7 +66,6 @@ extern int apparmor_initialized __initdata;
 char *aa_split_fqname(char *args, char **ns_name);
 void aa_info_message(const char *str);
 void *__aa_kvmalloc(size_t size, gfp_t flags);
-void kvfree(void *buffer);
 
 static inline void *kvmalloc(size_t size)
 {
diff --git a/security/apparmor/lib.c b/security/apparmor/lib.c
index 6968992..c1827e0 100644
--- a/security/apparmor/lib.c
+++ b/security/apparmor/lib.c
@@ -104,17 +104,3 @@ void *__aa_kvmalloc(size_t size, gfp_t flags)
}
return buffer;
 }
-
-/**
- * kvfree - free an allocation do by kvmalloc
- * @buffer: buffer to free (MAYBE_NULL)
- *
- * Free a buffer allocated by kvmalloc
- */
-void kvfree(void *buffer)
-{
-   if (is_vmalloc_addr(buffer))
-   vfree(buffer);
-   else
-   kfree(buffer);
-}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-06 Thread Linus Torvalds
On Tue, May 6, 2014 at 3:17 AM, Miklos Szeredi  wrote:
>
> Patches look okay to me.
>
> Reviewed-by: Miklos Szeredi 
>
>>   dentry_kill(): don't try to remove from shrink list
>
> Backport of this to 3.12 was tested by IBM and apparently fixes the
> issue for them (I didn't backport the cleanup patches only the actual
> fix)

Ok, good.

>>   don't remove from shrink list in select_collect()
>
> I've also asked them to test this, although I think this is even
> harder to trigger.  But at least the non-racy codepaths need to be
> tested.

I'll be incommunicado all next week, so I think I should merge this
all now rather than later.

Al, mind doing a real pull request?

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-06 Thread Miklos Szeredi
On Sun, May 4, 2014 at 8:29 AM, Al Viro  wrote:
> On Sat, May 03, 2014 at 07:21:11PM +0100, Al Viro wrote:
>> On Sat, May 03, 2014 at 05:26:04AM +0100, Al Viro wrote:
>>
>> > See vfs.git#dentry_kill-3; warning - this is completely untested and I 
>> > would
>> > really like comments on spinning case there (i.e. the one where 
>> > select_collect()
>> > finds some stuff already on some other shrink list and nothing with zero
>> > refcount that wouldn't be there).  In that case (and it's basically 
>> > "somebody
>> > else is evicting stuff in our subtree and they'd already picked everything
>> > we want evicted") I just let the loop in check_submounts_and_drop() repeat
>> > (we do have cond_resched() there).  Any better suggestions would be 
>> > welcome...
>>
>> Hmm...  As the matter of fact, the whole shrink_dcache_for_umount() could
>> be made a lot saner.  What we need is to reuse shrink_dcache_parent()
>> and follow it with d_walk() that would just go through whatever remains and
>> complain about the leaves of that.  For anon roots we'll obviously need to
>> wrap that into dget and d_drop/dput.
>>
>> I'm testing that right now; everything seems to be working so far and if
>> it survives, I'll push that sucker out.  Total since the beginning of
>> the whole series:
>>  fs/dcache.c|  310 
>> -
>>  include/linux/dcache.h |2 +
>>  2 files changed, 112 insertions(+), 200 deletions(-)
>
> No regressions compared to mainline; force-pushed into vfs#dentry_kill-3.
> Review and testing would be very welcome...  Summary for that branch:
> (it's *not* a pull request yet; the thing really needs review)

Patches look okay to me.

Reviewed-by: Miklos Szeredi 

>   dentry_kill(): don't try to remove from shrink list

Backport of this to 3.12 was tested by IBM and apparently fixes the
issue for them (I didn't backport the cleanup patches only the actual
fix)

>   don't remove from shrink list in select_collect()

I've also asked them to test this, although I think this is even
harder to trigger.  But at least the non-racy codepaths need to be
tested.

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-06 Thread Miklos Szeredi
On Sun, May 4, 2014 at 8:29 AM, Al Viro v...@zeniv.linux.org.uk wrote:
 On Sat, May 03, 2014 at 07:21:11PM +0100, Al Viro wrote:
 On Sat, May 03, 2014 at 05:26:04AM +0100, Al Viro wrote:

  See vfs.git#dentry_kill-3; warning - this is completely untested and I 
  would
  really like comments on spinning case there (i.e. the one where 
  select_collect()
  finds some stuff already on some other shrink list and nothing with zero
  refcount that wouldn't be there).  In that case (and it's basically 
  somebody
  else is evicting stuff in our subtree and they'd already picked everything
  we want evicted) I just let the loop in check_submounts_and_drop() repeat
  (we do have cond_resched() there).  Any better suggestions would be 
  welcome...

 Hmm...  As the matter of fact, the whole shrink_dcache_for_umount() could
 be made a lot saner.  What we need is to reuse shrink_dcache_parent()
 and follow it with d_walk() that would just go through whatever remains and
 complain about the leaves of that.  For anon roots we'll obviously need to
 wrap that into dget and d_drop/dput.

 I'm testing that right now; everything seems to be working so far and if
 it survives, I'll push that sucker out.  Total since the beginning of
 the whole series:
  fs/dcache.c|  310 
 -
  include/linux/dcache.h |2 +
  2 files changed, 112 insertions(+), 200 deletions(-)

 No regressions compared to mainline; force-pushed into vfs#dentry_kill-3.
 Review and testing would be very welcome...  Summary for that branch:
 (it's *not* a pull request yet; the thing really needs review)

Patches look okay to me.

Reviewed-by: Miklos Szeredi mszer...@suse.cz

   dentry_kill(): don't try to remove from shrink list

Backport of this to 3.12 was tested by IBM and apparently fixes the
issue for them (I didn't backport the cleanup patches only the actual
fix)

   don't remove from shrink list in select_collect()

I've also asked them to test this, although I think this is even
harder to trigger.  But at least the non-racy codepaths need to be
tested.

Thanks,
Miklos
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-06 Thread Linus Torvalds
On Tue, May 6, 2014 at 3:17 AM, Miklos Szeredi mik...@szeredi.hu wrote:

 Patches look okay to me.

 Reviewed-by: Miklos Szeredi mszer...@suse.cz

   dentry_kill(): don't try to remove from shrink list

 Backport of this to 3.12 was tested by IBM and apparently fixes the
 issue for them (I didn't backport the cleanup patches only the actual
 fix)

Ok, good.

   don't remove from shrink list in select_collect()

 I've also asked them to test this, although I think this is even
 harder to trigger.  But at least the non-racy codepaths need to be
 tested.

I'll be incommunicado all next week, so I think I should merge this
all now rather than later.

Al, mind doing a real pull request?

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-06 Thread Al Viro
On Tue, May 06, 2014 at 07:53:19AM -0700, Linus Torvalds wrote:
 On Tue, May 6, 2014 at 3:17 AM, Miklos Szeredi mik...@szeredi.hu wrote:
 
  Patches look okay to me.
 
  Reviewed-by: Miklos Szeredi mszer...@suse.cz
 
dentry_kill(): don't try to remove from shrink list
 
  Backport of this to 3.12 was tested by IBM and apparently fixes the
  issue for them (I didn't backport the cleanup patches only the actual
  fix)
 
 Ok, good.
 
don't remove from shrink list in select_collect()
 
  I've also asked them to test this, although I think this is even
  harder to trigger.  But at least the non-racy codepaths need to be
  tested.
 
 I'll be incommunicado all next week, so I think I should merge this
 all now rather than later.
 
 Al, mind doing a real pull request?

OK...  There's one more thing I would like to put there, if you are going to
be away for the week.  It has sat in -next for a while, and it could stay
there, except that there's a _lot_ of followups touching stuff all over the
tree and I'd obviously prefer those to go into subsystem trees.  Which
means inter-tree dependencies ;-/  Would you be OK if I included that one
into pull request?  It just turns kvfree() into inline and takes it to
mm.h, next to is_vmalloc_addr(); we have *lots* of open-coded instances of
that all over the place.  As the matter of fact, more than a half of
is_vmalloc_addr() call sites are of that sort...

commit 3c91dc1ce40f973bbceded3d8ce96bda7c4d480c
Author: Al Viro v...@zeniv.linux.org.uk
Date:   Wed Apr 23 10:13:03 2014 -0400

nick kvfree() from apparmor/lib

too many open-coded instances

Signed-off-by: Al Viro v...@zeniv.linux.org.uk

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bf9811e..a784964 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -370,6 +370,17 @@ static inline int is_vmalloc_or_module_addr(const void *x)
 }
 #endif
 
+static inline void kvfree(const void *x)
+{
+   /* include order mess... */
+   extern void kfree(const void *);
+   extern void vfree(const void *);
+   if (is_vmalloc_addr(x))
+   vfree(x);
+   else
+   kfree(x);
+}
+
 static inline void compound_lock(struct page *page)
 {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
diff --git a/security/apparmor/include/apparmor.h 
b/security/apparmor/include/apparmor.h
index 8fb1488..97130f8 100644
--- a/security/apparmor/include/apparmor.h
+++ b/security/apparmor/include/apparmor.h
@@ -66,7 +66,6 @@ extern int apparmor_initialized __initdata;
 char *aa_split_fqname(char *args, char **ns_name);
 void aa_info_message(const char *str);
 void *__aa_kvmalloc(size_t size, gfp_t flags);
-void kvfree(void *buffer);
 
 static inline void *kvmalloc(size_t size)
 {
diff --git a/security/apparmor/lib.c b/security/apparmor/lib.c
index 6968992..c1827e0 100644
--- a/security/apparmor/lib.c
+++ b/security/apparmor/lib.c
@@ -104,17 +104,3 @@ void *__aa_kvmalloc(size_t size, gfp_t flags)
}
return buffer;
 }
-
-/**
- * kvfree - free an allocation do by kvmalloc
- * @buffer: buffer to free (MAYBE_NULL)
- *
- * Free a buffer allocated by kvmalloc
- */
-void kvfree(void *buffer)
-{
-   if (is_vmalloc_addr(buffer))
-   vfree(buffer);
-   else
-   kfree(buffer);
-}
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-06 Thread Linus Torvalds
On Tue, May 6, 2014 at 9:52 AM, Al Viro v...@zeniv.linux.org.uk wrote:

 OK...  There's one more thing I would like to put there, if you are going to
 be away for the week.  It has sat in -next for a while, and it could stay
 there, except that there's a _lot_ of followups touching stuff all over the
 tree and I'd obviously prefer those to go into subsystem trees.  Which
 means inter-tree dependencies ;-/  Would you be OK if I included that one
 into pull request?  It just turns kvfree() into inline and takes it to
 mm.h, next to is_vmalloc_addr();

Is there any particular reason for inlining this? I'd actually rather
not, _especially_ if it means that a lot of is_vmalloc_addr() usage
goes away and we may end up with this as a real interface.

But no, I don't hate the patch.

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-06 Thread Al Viro
On Tue, May 06, 2014 at 10:01:21AM -0700, Linus Torvalds wrote:
 On Tue, May 6, 2014 at 9:52 AM, Al Viro v...@zeniv.linux.org.uk wrote:
 
  OK...  There's one more thing I would like to put there, if you are going to
  be away for the week.  It has sat in -next for a while, and it could stay
  there, except that there's a _lot_ of followups touching stuff all over the
  tree and I'd obviously prefer those to go into subsystem trees.  Which
  means inter-tree dependencies ;-/  Would you be OK if I included that one
  into pull request?  It just turns kvfree() into inline and takes it to
  mm.h, next to is_vmalloc_addr();
 
 Is there any particular reason for inlining this? I'd actually rather
 not, _especially_ if it means that a lot of is_vmalloc_addr() usage
 goes away and we may end up with this as a real interface.

OK, I can live with that.  dcache fixes + kvfree() (uninlined, exported by
mm/util.c) + posix_acl bugfix from hch.  Please, pull from
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for-linus

Shortlog:
Al Viro (9):
  fix races between __d_instantiate() and checks of dentry flags
  fold d_kill() and d_free()
  fold try_prune_one_dentry()
  new helper: dentry_free()
  expand the call of dentry_lru_del() in dentry_kill()
  dentry_kill(): don't try to remove from shrink list
  don't remove from shrink list in select_collect()
  more graceful recovery in umount_collect()
  nick kvfree() from apparmor

Christoph Hellwig (1):
  posix_acl: handle NULL ACL in posix_acl_equiv_mode

Miklos Szeredi (1):
  dcache: don't need rcu in shrink_dentry_list()

Diffstat:
 fs/dcache.c  |  318 
---
 fs/namei.c   |6 +-
 fs/posix_acl.c   |6 ++
 include/linux/dcache.h   |2 +
 include/linux/mm.h   |2 +
 mm/util.c|   10 +++
 security/apparmor/include/apparmor.h |1 -
 security/apparmor/lib.c  |   14 
 8 files changed, 125 insertions(+), 234 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-04 Thread Al Viro
On Sat, May 03, 2014 at 07:21:11PM +0100, Al Viro wrote:
> On Sat, May 03, 2014 at 05:26:04AM +0100, Al Viro wrote:
> 
> > See vfs.git#dentry_kill-3; warning - this is completely untested and I would
> > really like comments on spinning case there (i.e. the one where 
> > select_collect()
> > finds some stuff already on some other shrink list and nothing with zero
> > refcount that wouldn't be there).  In that case (and it's basically 
> > "somebody
> > else is evicting stuff in our subtree and they'd already picked everything
> > we want evicted") I just let the loop in check_submounts_and_drop() repeat
> > (we do have cond_resched() there).  Any better suggestions would be 
> > welcome...
> 
> Hmm...  As the matter of fact, the whole shrink_dcache_for_umount() could
> be made a lot saner.  What we need is to reuse shrink_dcache_parent()
> and follow it with d_walk() that would just go through whatever remains and
> complain about the leaves of that.  For anon roots we'll obviously need to
> wrap that into dget and d_drop/dput.
> 
> I'm testing that right now; everything seems to be working so far and if
> it survives, I'll push that sucker out.  Total since the beginning of
> the whole series:
>  fs/dcache.c|  310 
> -
>  include/linux/dcache.h |2 +
>  2 files changed, 112 insertions(+), 200 deletions(-)

No regressions compared to mainline; force-pushed into vfs#dentry_kill-3.
Review and testing would be very welcome...  Summary for that branch:
(it's *not* a pull request yet; the thing really needs review)
Shortlog:
Al Viro (8):
  fix races between __d_instantiate() and checks of dentry flags
  fold d_kill() and d_free()
  fold try_prune_one_dentry()
  new helper: dentry_free()
  expand the call of dentry_lru_del() in dentry_kill()
  dentry_kill(): don't try to remove from shrink list
  don't remove from shrink list in select_collect()
  more graceful recovery in umount_collect()

Miklos Szeredi (1):
  dcache: don't need rcu in shrink_dentry_list()

Diffstat:
 fs/dcache.c|  318 
+
 fs/namei.c |6 +-
 include/linux/dcache.h |2 +
 3 files changed, 107 insertions(+), 219 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-04 Thread Al Viro
On Sat, May 03, 2014 at 07:21:11PM +0100, Al Viro wrote:
 On Sat, May 03, 2014 at 05:26:04AM +0100, Al Viro wrote:
 
  See vfs.git#dentry_kill-3; warning - this is completely untested and I would
  really like comments on spinning case there (i.e. the one where 
  select_collect()
  finds some stuff already on some other shrink list and nothing with zero
  refcount that wouldn't be there).  In that case (and it's basically 
  somebody
  else is evicting stuff in our subtree and they'd already picked everything
  we want evicted) I just let the loop in check_submounts_and_drop() repeat
  (we do have cond_resched() there).  Any better suggestions would be 
  welcome...
 
 Hmm...  As the matter of fact, the whole shrink_dcache_for_umount() could
 be made a lot saner.  What we need is to reuse shrink_dcache_parent()
 and follow it with d_walk() that would just go through whatever remains and
 complain about the leaves of that.  For anon roots we'll obviously need to
 wrap that into dget and d_drop/dput.
 
 I'm testing that right now; everything seems to be working so far and if
 it survives, I'll push that sucker out.  Total since the beginning of
 the whole series:
  fs/dcache.c|  310 
 -
  include/linux/dcache.h |2 +
  2 files changed, 112 insertions(+), 200 deletions(-)

No regressions compared to mainline; force-pushed into vfs#dentry_kill-3.
Review and testing would be very welcome...  Summary for that branch:
(it's *not* a pull request yet; the thing really needs review)
Shortlog:
Al Viro (8):
  fix races between __d_instantiate() and checks of dentry flags
  fold d_kill() and d_free()
  fold try_prune_one_dentry()
  new helper: dentry_free()
  expand the call of dentry_lru_del() in dentry_kill()
  dentry_kill(): don't try to remove from shrink list
  don't remove from shrink list in select_collect()
  more graceful recovery in umount_collect()

Miklos Szeredi (1):
  dcache: don't need rcu in shrink_dentry_list()

Diffstat:
 fs/dcache.c|  318 
+
 fs/namei.c |6 +-
 include/linux/dcache.h |2 +
 3 files changed, 107 insertions(+), 219 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-03 Thread Al Viro
On Sat, May 03, 2014 at 11:07:57AM -0700, Linus Torvalds wrote:
> Sure, umount itself should be serialized by the sb lock, so there
> should be only one umount dentry collector. But why wouldn't there be
> shrinkers active due to memory pressure?
> 
> generic_unmount_super() is called by ->kill_sb(), which is done
> *before* the superblock shrinker is unregistered  So any memory
> pressure during that will cause dentries to be shrunk other ways.
> 
> What am I missing?

This:
if (!grab_super_passive(sb))
return SHRINK_STOP;
before calling prune_dcache_sb().  grab_super_passive() returns with
->s_umount held shared on success (with down_read_trylock()) and ->kill_sb()
is called only with ->s_umount held exclusive.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-03 Thread Al Viro
On Sat, May 03, 2014 at 05:26:04AM +0100, Al Viro wrote:

> See vfs.git#dentry_kill-3; warning - this is completely untested and I would
> really like comments on spinning case there (i.e. the one where 
> select_collect()
> finds some stuff already on some other shrink list and nothing with zero
> refcount that wouldn't be there).  In that case (and it's basically "somebody
> else is evicting stuff in our subtree and they'd already picked everything
> we want evicted") I just let the loop in check_submounts_and_drop() repeat
> (we do have cond_resched() there).  Any better suggestions would be welcome...

Hmm...  As the matter of fact, the whole shrink_dcache_for_umount() could
be made a lot saner.  What we need is to reuse shrink_dcache_parent()
and follow it with d_walk() that would just go through whatever remains and
complain about the leaves of that.  For anon roots we'll obviously need to
wrap that into dget and d_drop/dput.

I'm testing that right now; everything seems to be working so far and if
it survives, I'll push that sucker out.  Total since the beginning of
the whole series:
 fs/dcache.c|  310 
-
 include/linux/dcache.h |2 +
 2 files changed, 112 insertions(+), 200 deletions(-)

Anyway, I'm off to finish with acct.c races, then it's time for Eric's
lazy umount on unlink series; this work is getting unpleasantly close
to the areas he's touching, so I'd better get it on top of this stuff
and into -next...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-03 Thread Linus Torvalds
On Fri, May 2, 2014 at 9:26 PM, Al Viro  wrote:
>
> See vfs.git#dentry_kill-3; warning - this is completely untested and I would
> really like comments on spinning case there (i.e. the one where 
> select_collect()
> finds some stuff already on some other shrink list and nothing with zero
> refcount that wouldn't be there).

I'm not seeing why you say that there can be no other shrinkers active
during umount (that whole "buggered" thing).

Sure, umount itself should be serialized by the sb lock, so there
should be only one umount dentry collector. But why wouldn't there be
shrinkers active due to memory pressure?

generic_unmount_super() is called by ->kill_sb(), which is done
*before* the superblock shrinker is unregistered  So any memory
pressure during that will cause dentries to be shrunk other ways.

What am I missing?

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-03 Thread Linus Torvalds
On Fri, May 2, 2014 at 9:26 PM, Al Viro v...@zeniv.linux.org.uk wrote:

 See vfs.git#dentry_kill-3; warning - this is completely untested and I would
 really like comments on spinning case there (i.e. the one where 
 select_collect()
 finds some stuff already on some other shrink list and nothing with zero
 refcount that wouldn't be there).

I'm not seeing why you say that there can be no other shrinkers active
during umount (that whole buggered thing).

Sure, umount itself should be serialized by the sb lock, so there
should be only one umount dentry collector. But why wouldn't there be
shrinkers active due to memory pressure?

generic_unmount_super() is called by -kill_sb(), which is done
*before* the superblock shrinker is unregistered  So any memory
pressure during that will cause dentries to be shrunk other ways.

What am I missing?

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-03 Thread Al Viro
On Sat, May 03, 2014 at 05:26:04AM +0100, Al Viro wrote:

 See vfs.git#dentry_kill-3; warning - this is completely untested and I would
 really like comments on spinning case there (i.e. the one where 
 select_collect()
 finds some stuff already on some other shrink list and nothing with zero
 refcount that wouldn't be there).  In that case (and it's basically somebody
 else is evicting stuff in our subtree and they'd already picked everything
 we want evicted) I just let the loop in check_submounts_and_drop() repeat
 (we do have cond_resched() there).  Any better suggestions would be welcome...

Hmm...  As the matter of fact, the whole shrink_dcache_for_umount() could
be made a lot saner.  What we need is to reuse shrink_dcache_parent()
and follow it with d_walk() that would just go through whatever remains and
complain about the leaves of that.  For anon roots we'll obviously need to
wrap that into dget and d_drop/dput.

I'm testing that right now; everything seems to be working so far and if
it survives, I'll push that sucker out.  Total since the beginning of
the whole series:
 fs/dcache.c|  310 
-
 include/linux/dcache.h |2 +
 2 files changed, 112 insertions(+), 200 deletions(-)

Anyway, I'm off to finish with acct.c races, then it's time for Eric's
lazy umount on unlink series; this work is getting unpleasantly close
to the areas he's touching, so I'd better get it on top of this stuff
and into -next...
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-03 Thread Al Viro
On Sat, May 03, 2014 at 11:07:57AM -0700, Linus Torvalds wrote:
 Sure, umount itself should be serialized by the sb lock, so there
 should be only one umount dentry collector. But why wouldn't there be
 shrinkers active due to memory pressure?
 
 generic_unmount_super() is called by -kill_sb(), which is done
 *before* the superblock shrinker is unregistered  So any memory
 pressure during that will cause dentries to be shrunk other ways.
 
 What am I missing?

This:
if (!grab_super_passive(sb))
return SHRINK_STOP;
before calling prune_dcache_sb().  grab_super_passive() returns with
-s_umount held shared on success (with down_read_trylock()) and -kill_sb()
is called only with -s_umount held exclusive.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-02 Thread Al Viro
On Fri, May 02, 2014 at 11:40:22PM +0100, Al Viro wrote:
> On Fri, May 02, 2014 at 02:18:43PM -0700, Linus Torvalds wrote:
> > On Fri, May 2, 2014 at 2:08 PM, Miklos Szeredi  wrote:
> > > There's more of the "delete from shrink list not owned by us" in select 
> > > parent.
> > > Proposed patch appended.
> > 
> > Ahh. Clearly this needs more work before I pull.
> 
> *nod*
> 
> Besides, I want to put Miklos' "don't bother with RCU in shrink_dentry_list()"
> in there as soon as select_collect() has been dealt with.  I don't think
> that the currently posted patch for select_collect() is right, though -
> see my reply to parent posting.  Basically, I think we should treat "it's
> on the shrink list already" as "increment data->found and keep going".  IOW,
>   if (on shrink list) {
>   data->found++;
>   } else {
>   if (on lru list)
>   d_lru_del
>   if (refcount is zero) {
>   d_shrink_add
>   data->found++;
>   }
>   }
>   if (data->found)
>   ret = need_resched() ? D_WALK_QUIT : D_WALK_NORETRY;

See vfs.git#dentry_kill-3; warning - this is completely untested and I would
really like comments on spinning case there (i.e. the one where select_collect()
finds some stuff already on some other shrink list and nothing with zero
refcount that wouldn't be there).  In that case (and it's basically "somebody
else is evicting stuff in our subtree and they'd already picked everything
we want evicted") I just let the loop in check_submounts_and_drop() repeat
(we do have cond_resched() there).  Any better suggestions would be welcome...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-02 Thread Al Viro
On Fri, May 02, 2014 at 11:40:22PM +0100, Al Viro wrote:
> On Fri, May 02, 2014 at 02:18:43PM -0700, Linus Torvalds wrote:
> > On Fri, May 2, 2014 at 2:08 PM, Miklos Szeredi  wrote:
> > > There's more of the "delete from shrink list not owned by us" in select 
> > > parent.
> > > Proposed patch appended.
> > 
> > Ahh. Clearly this needs more work before I pull.
> 
> *nod*
> 
> Besides, I want to put Miklos' "don't bother with RCU in shrink_dentry_list()"
> in there as soon as select_collect() has been dealt with.  I don't think
> that the currently posted patch for select_collect() is right, though -
> see my reply to parent posting.  Basically, I think we should treat "it's
> on the shrink list already" as "increment data->found and keep going".  IOW,
>   if (on shrink list) {
>   data->found++;
>   } else {
>   if (on lru list)
>   d_lru_del
>   if (refcount is zero) {
>   d_shrink_add
>   data->found++;
>   }
>   }
>   if (data->found)
>   ret = need_resched() ? D_WALK_QUIT : D_WALK_NORETRY;

While we are at it - BUG() in umount_collect() is probably a bad idea.
At that point we are holding ->s_umount, so it guarantees that a lot
of stuff from that point on will get stuck.  Starting with sync(2).
And I really doubt that damage from WARN() instead will be more...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-02 Thread Al Viro
On Fri, May 02, 2014 at 02:18:43PM -0700, Linus Torvalds wrote:
> On Fri, May 2, 2014 at 2:08 PM, Miklos Szeredi  wrote:
> > There's more of the "delete from shrink list not owned by us" in select 
> > parent.
> > Proposed patch appended.
> 
> Ahh. Clearly this needs more work before I pull.

*nod*

Besides, I want to put Miklos' "don't bother with RCU in shrink_dentry_list()"
in there as soon as select_collect() has been dealt with.  I don't think
that the currently posted patch for select_collect() is right, though -
see my reply to parent posting.  Basically, I think we should treat "it's
on the shrink list already" as "increment data->found and keep going".  IOW,
if (on shrink list) {
data->found++;
} else {
if (on lru list)
d_lru_del
if (refcount is zero) {
d_shrink_add
data->found++;
}
}
if (data->found)
ret = need_resched() ? D_WALK_QUIT : D_WALK_NORETRY;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-02 Thread Al Viro
On Fri, May 02, 2014 at 11:08:13PM +0200, Miklos Szeredi wrote:
> There's more of the "delete from shrink list not owned by us" in select 
> parent.
> Proposed patch appended.

While it certainly looks like dentry_lru_del() should die, I really wonder if
"let's pretend that dentry isn't there if it's on some other shrink list"
is the right approach.  You've already noticed one problem (check-and-drop
giving false busy indications), but shrink_dcache_parent() has similar
issues.  How about incrementing data->found instead?  That'll end up
rescanning the tree in case if it's not ours; so what?  If it's another
process doing the same shrinking in a subtree, we want to let it do its
job anyway.  Sure, somebody doing mount -o remount in a loop might be
able to stall the living hell out of us, for as long as new non-busy
dentries are being added in our subtree, but the second part in itself
is sufficient; we will keep picking those new non-busy dentries as long
as they keep coming.  And if they do not, eventually they will be all
taken out by us or by those other shrinkers and we'll be done.
 
> And I'm not sure what umount_collect() is supposed to do.  Can other shrinkers
> still be active at that point?  That would present other problems, no?

No other shrinkers - prune_dcache_sb() and shrink_dcache_sb() are also
serialized by ->s_umount, shrink_dcache_parent() and check_submounts_and_drop()
are called only when an active reference is held, which obviously prevents
fs shutdown.

> Also somewhat related is the question: how check_submounts_and_drop() could be
> guaranteed correctness (timely removal of all unsed dentries) in the presence 
> of
> other shrinkers?

Another interesting question is what the hell are shrink_dcache_parent()
callers expecting.  E.g. what's going on in fuse_reverse_inval_entry()?
And what is nilfs_tree_is_busy() about?

FWIW, I'm not at all sure that vfs_rmdir() and vfs_rename() have any reason
to call it these days, and dentry_unhash() one simply ought to die - it's used
only by hpfs unlink() in case it wants to truncate in order to work around
-ENOSPC.  And _that_ won't have any subdirectories to deal with anyway, so
shrink_dcache_parent() there is a no-op, even if we keep the damn helper alive.
The rest of callers also look dubious...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-02 Thread Linus Torvalds
On Fri, May 2, 2014 at 2:08 PM, Miklos Szeredi  wrote:
> There's more of the "delete from shrink list not owned by us" in select 
> parent.
> Proposed patch appended.

Ahh. Clearly this needs more work before I pull.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-02 Thread Miklos Szeredi
There's more of the "delete from shrink list not owned by us" in select parent.
Proposed patch appended.

And I'm not sure what umount_collect() is supposed to do.  Can other shrinkers
still be active at that point?  That would present other problems, no?

Also somewhat related is the question: how check_submounts_and_drop() could be
guaranteed correctness (timely removal of all unsed dentries) in the presence of
other shrinkers?

Thanks,
Miklos



From: Miklos Szeredi 
Subject: dcache: select_collect(): don't remove from shrink list

Shrink lists are not protected by any lock, so don't remove from an unknown
one.

Signed-off-by: Miklos Szeredi 
---
 fs/dcache.c |   23 ---
 1 file changed, 12 insertions(+), 11 deletions(-)

--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1241,18 +1241,19 @@ static enum d_walk_ret select_collect(vo
 * loop in shrink_dcache_parent() might not make any progress
 * and loop forever.
 */
-   if (dentry->d_lockref.count) {
-   dentry_lru_del(dentry);
-   } else if (!(dentry->d_flags & DCACHE_SHRINK_LIST)) {
-   /*
-* We can't use d_lru_shrink_move() because we
-* need to get the global LRU lock and do the
-* LRU accounting.
-*/
+   if (dentry->d_flags & (DCACHE_SHRINK_LIST | DCACHE_LRU_LIST) ==
+   DCACHE_LRU_LIST) {
d_lru_del(dentry);
-   d_shrink_add(dentry, >dispose);
-   data->found++;
-   ret = D_WALK_NORETRY;
+   if (!dentry->d_lockref.count) {
+   /*
+* We can't use d_lru_shrink_move() because we
+* need to get the global LRU lock and do the
+* LRU accounting.
+*/
+   d_shrink_add(dentry, >dispose);
+   data->found++;
+   ret = D_WALK_NORETRY;
+   }
}
/*
 * We can return to the caller if we have found some (this
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-02 Thread Linus Torvalds
On Thu, May 1, 2014 at 7:34 AM, Al Viro  wrote:
>
> OK, fixed and pushed (both branches).

Al, can you send a real pull request (the "both branches" part in
particular makes me worry about which one you think is right), because
I suspect by now we just need to get this wider testing.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-02 Thread Miklos Szeredi
On Fri, May 02, 2014 at 11:00:04AM +0200, Szeredi Miklos wrote:

> The bug is private, but I'll ask if I can repost it.  The first thing
> is a a warning from the D_FLAG_VERIFY() in d_shrink_del() from
> shrink_dentry_list().  They added printks that show that the dentry
> has DCACHE_DENTRY_KILLED.
> 
> We could ask for a dump, but this is the only rational explanation I
> could find for this (and a shrink list with two dentries, with racing
> dput on both nicely explains the case where the shrink list's prev
> pointer still points to the already killed dentry).

Bug details:

==
WARNING: at
/home/abuild/rpmbuild/BUILD/kernel-default-3.12.15/linux-3.12/fs/dcache.c:392
 Modules linked in: iptable_filter ip_tables x_tables rpcsec_gss_krb5
auth_rpcgss nfsv4 dns_resolver nfs lockd sunrpc fscache af_packet xfs
libcrc32c dm_mod autofs4 btrfs xor sr_mod cdrom usb_storage raid6_pq sd_mod
crc_t10dif ibmvscsi(X) scsi_transport_srp scsi_tgt ibmveth(X) ohci_pci
ohci_hcd ehci_hcd usbcore usb_common sg scsi_mod
 Supported: Yes
 CPU: 7 PID: 25685 Comm: host01 Tainted: G  X 3.12.15-3-default #1
 task: c003f7fb68f0 ti: c003f31f4000 task.ti: c003f31f4000
 NIP: c024552c LR: c02462b8 CTR: c02aad40
 REGS: c003f31f7530 TRAP: 0700   Tainted: G  X 
(3.12.15-3-default)
 MSR: 800100029033   CR: 24242448  XER: 2000
 SOFTE: 1
 CFAR: c02454b0

 GPR00: c02462b8 c003f31f77b0 c0f10fb0 c003dfa7f028
 GPR04: 0001 c0e4fe00  0600
 GPR08: c0e30fb0 0001  
 GPR12: 24242442 cfe41880  
 GPR16:  010013556de0 6458 0002
 GPR20:  c084e940 0004 c003f8afa400
 GPR24: 6457 c003fde09e68  
 GPR28: c003dfa7f028 c003dfa7f028 c003f31f7948 c003dfa7f0a8
 NIP [c024552c] .d_shrink_del+0x9c/0xc0
 LR [c02462b8] .shrink_dentry_list+0xc8/0x160
 PACATMSCRATCH [80019033]
 Call Trace:
 [c003f31f77b0] [c003f8f1ef80] 0xc003f8f1ef80 (unreliable)
 [c003f31f7820] [c02462b8] .shrink_dentry_list+0xc8/0x160
 [c003f31f78d0] [c0246754] .shrink_dcache_parent+0x44/0xa0
 [c003f31f7980] [c02b0a0c] .proc_flush_task+0xbc/0x1f0
 [c003f31f7a70] [c008e554] .release_task+0x94/0x530
 [c003f31f7b50] [c008efb8] .wait_task_zombie+0x5c8/0x750
 [c003f31f7c10] [c008fa80] .do_wait+0x120/0x2c0
 [c003f31f7cd0] [c0091040] .SyS_wait4+0x90/0x130
 [c003f31f7dc0] [c00910fc] .SyS_waitpid+0x1c/0x30
 [c003f31f7e30] [c0009dfc] syscall_exit+0x0/0x7c
 Instruction dump:
 7d09502a 3908 7d09512a 4bdcaf19 6000 38210070 e8010010 7c0803a6
 4e800020 3d02fff2 8928dfd6 69290001 <0b09> 2fa9 41feff80 3921
 ---[ end trace 56e8481827564dc3 ]---
==
I forgot to mention that the TCP test used in the recreates is from the LTP
suite from http://ltp.sourceforge.net

It has been observed that each time we run into it so far, the host01 test is
the one running.
==
After installing the kernel with some conditional printk statements in it, the
condition hit and I caught a glimpse of it. I then dropped the system into xmon
to dump out some further info:

When we first detect that the dentry->d_flags is missing these flags
DCACHE_SHRINK_LIST | DCACHE_LRU_LIST we start to dump some information about
the dentry from within shrink_dentry_list() and got:

dentry c003d6240ce8 name 5 lockcnt -128 flags 1048780
post-shrink-del: dentry c003d6240ce8 name 5 lockcnt -128
dentry c003d6240ce8 lockcnt -128
dentry c003d6240ce8 name 5 lockcnt -128 flags 1048780
post-shrink-del: dentry c003d6240ce8 name 5 lockcnt -128
dentry c003d6240ce8 lockcnt -128
dentry c003d6240ce8 name 5 lockcnt -128 flags 1048780

and this repeats over and over...

The code (with printks) looks like this:

 899 for (;;) {
 900 dentry = list_entry_rcu(list->prev, struct dentry, d_lru);
 901 if (>d_lru == list)
 902 break; /* empty */
 903 
 904 if (print_more)
 905 printk(KERN_WARNING "dentry %p lockcnt %d\n",
 906 dentry, dentry->d_lockref.count);
 907 
 908 /*
 909  * Get the dentry lock, and re-verify that the dentry is
 910  * this on the shrinking list. If it is, we know that
 911  * DCACHE_SHRINK_LIST and DCACHE_LRU_LIST are set.
 912  */
 

Re: dcache shrink list corruption?

2014-05-02 Thread Szeredi Miklos
On Fri, May 2, 2014 at 7:51 AM, Al Viro  wrote:
> On Wed, Apr 30, 2014 at 11:15:15AM +0200, Miklos Szeredi wrote:

>> IBM is triggering this with the host01 test from the LTP suite.  I haven't 
>> yet
>> tried to reproduce it.
>
> Could you repost their report?  So far I hadn't managed to get actual list
> corruption on mainline kernel - it definitely is possible, but on all
> testcases so far the race window is too narrow.  So if they have a reproducer
> that doesn't take an insane amount of time, I'd really like to see it...

The only reproducer they have is LTP/host01, and can trigger the bug
reliably within hours.

The bug is private, but I'll ask if I can repost it.  The first thing
is a a warning from the D_FLAG_VERIFY() in d_shrink_del() from
shrink_dentry_list().  They added printks that show that the dentry
has DCACHE_DENTRY_KILLED.

We could ask for a dump, but this is the only rational explanation I
could find for this (and a shrink list with two dentries, with racing
dput on both nicely explains the case where the shrink list's prev
pointer still points to the already killed dentry).

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-02 Thread Szeredi Miklos
On Thu, May 1, 2014 at 4:34 PM, Al Viro  wrote:
> On Thu, May 01, 2014 at 11:42:52AM +0200, Miklos Szeredi wrote:

>> - "bool foo = flag & FLAG" looks suspicious.  Is this guaranteed not to
>>   overflow?
>
> What do you mean, overflow?  It's not a 1-bit unsigned int; conversion to
> _Bool is different (which is the only reason why it's more than mere
> syntax sugar).  See C99 6.3.2.1 ("When any scalar value is converted
> to _Bool, the result is 0 if the value compares equal to 0; otherwise,
> the result is 1").

Ah, didn't know that.

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-02 Thread Szeredi Miklos
On Thu, May 1, 2014 at 4:34 PM, Al Viro v...@zeniv.linux.org.uk wrote:
 On Thu, May 01, 2014 at 11:42:52AM +0200, Miklos Szeredi wrote:

 - bool foo = flag  FLAG looks suspicious.  Is this guaranteed not to
   overflow?

 What do you mean, overflow?  It's not a 1-bit unsigned int; conversion to
 _Bool is different (which is the only reason why it's more than mere
 syntax sugar).  See C99 6.3.2.1 (When any scalar value is converted
 to _Bool, the result is 0 if the value compares equal to 0; otherwise,
 the result is 1).

Ah, didn't know that.

Thanks,
Miklos
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-02 Thread Szeredi Miklos
On Fri, May 2, 2014 at 7:51 AM, Al Viro v...@zeniv.linux.org.uk wrote:
 On Wed, Apr 30, 2014 at 11:15:15AM +0200, Miklos Szeredi wrote:

 IBM is triggering this with the host01 test from the LTP suite.  I haven't 
 yet
 tried to reproduce it.

 Could you repost their report?  So far I hadn't managed to get actual list
 corruption on mainline kernel - it definitely is possible, but on all
 testcases so far the race window is too narrow.  So if they have a reproducer
 that doesn't take an insane amount of time, I'd really like to see it...

The only reproducer they have is LTP/host01, and can trigger the bug
reliably within hours.

The bug is private, but I'll ask if I can repost it.  The first thing
is a a warning from the D_FLAG_VERIFY() in d_shrink_del() from
shrink_dentry_list().  They added printks that show that the dentry
has DCACHE_DENTRY_KILLED.

We could ask for a dump, but this is the only rational explanation I
could find for this (and a shrink list with two dentries, with racing
dput on both nicely explains the case where the shrink list's prev
pointer still points to the already killed dentry).

Thanks,
Miklos
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-02 Thread Miklos Szeredi
On Fri, May 02, 2014 at 11:00:04AM +0200, Szeredi Miklos wrote:

 The bug is private, but I'll ask if I can repost it.  The first thing
 is a a warning from the D_FLAG_VERIFY() in d_shrink_del() from
 shrink_dentry_list().  They added printks that show that the dentry
 has DCACHE_DENTRY_KILLED.
 
 We could ask for a dump, but this is the only rational explanation I
 could find for this (and a shrink list with two dentries, with racing
 dput on both nicely explains the case where the shrink list's prev
 pointer still points to the already killed dentry).

Bug details:

==
WARNING: at
/home/abuild/rpmbuild/BUILD/kernel-default-3.12.15/linux-3.12/fs/dcache.c:392
 Modules linked in: iptable_filter ip_tables x_tables rpcsec_gss_krb5
auth_rpcgss nfsv4 dns_resolver nfs lockd sunrpc fscache af_packet xfs
libcrc32c dm_mod autofs4 btrfs xor sr_mod cdrom usb_storage raid6_pq sd_mod
crc_t10dif ibmvscsi(X) scsi_transport_srp scsi_tgt ibmveth(X) ohci_pci
ohci_hcd ehci_hcd usbcore usb_common sg scsi_mod
 Supported: Yes
 CPU: 7 PID: 25685 Comm: host01 Tainted: G  X 3.12.15-3-default #1
 task: c003f7fb68f0 ti: c003f31f4000 task.ti: c003f31f4000
 NIP: c024552c LR: c02462b8 CTR: c02aad40
 REGS: c003f31f7530 TRAP: 0700   Tainted: G  X 
(3.12.15-3-default)
 MSR: 800100029033 SF,EE,ME,IR,DR,RI,LE  CR: 24242448  XER: 2000
 SOFTE: 1
 CFAR: c02454b0

 GPR00: c02462b8 c003f31f77b0 c0f10fb0 c003dfa7f028
 GPR04: 0001 c0e4fe00  0600
 GPR08: c0e30fb0 0001  
 GPR12: 24242442 cfe41880  
 GPR16:  010013556de0 6458 0002
 GPR20:  c084e940 0004 c003f8afa400
 GPR24: 6457 c003fde09e68  
 GPR28: c003dfa7f028 c003dfa7f028 c003f31f7948 c003dfa7f0a8
 NIP [c024552c] .d_shrink_del+0x9c/0xc0
 LR [c02462b8] .shrink_dentry_list+0xc8/0x160
 PACATMSCRATCH [80019033]
 Call Trace:
 [c003f31f77b0] [c003f8f1ef80] 0xc003f8f1ef80 (unreliable)
 [c003f31f7820] [c02462b8] .shrink_dentry_list+0xc8/0x160
 [c003f31f78d0] [c0246754] .shrink_dcache_parent+0x44/0xa0
 [c003f31f7980] [c02b0a0c] .proc_flush_task+0xbc/0x1f0
 [c003f31f7a70] [c008e554] .release_task+0x94/0x530
 [c003f31f7b50] [c008efb8] .wait_task_zombie+0x5c8/0x750
 [c003f31f7c10] [c008fa80] .do_wait+0x120/0x2c0
 [c003f31f7cd0] [c0091040] .SyS_wait4+0x90/0x130
 [c003f31f7dc0] [c00910fc] .SyS_waitpid+0x1c/0x30
 [c003f31f7e30] [c0009dfc] syscall_exit+0x0/0x7c
 Instruction dump:
 7d09502a 3908 7d09512a 4bdcaf19 6000 38210070 e8010010 7c0803a6
 4e800020 3d02fff2 8928dfd6 69290001 0b09 2fa9 41feff80 3921
 ---[ end trace 56e8481827564dc3 ]---
==
I forgot to mention that the TCP test used in the recreates is from the LTP
suite from http://ltp.sourceforge.net

It has been observed that each time we run into it so far, the host01 test is
the one running.
==
After installing the kernel with some conditional printk statements in it, the
condition hit and I caught a glimpse of it. I then dropped the system into xmon
to dump out some further info:

When we first detect that the dentry-d_flags is missing these flags
DCACHE_SHRINK_LIST | DCACHE_LRU_LIST we start to dump some information about
the dentry from within shrink_dentry_list() and got:

dentry c003d6240ce8 name 5 lockcnt -128 flags 1048780
post-shrink-del: dentry c003d6240ce8 name 5 lockcnt -128
dentry c003d6240ce8 lockcnt -128
dentry c003d6240ce8 name 5 lockcnt -128 flags 1048780
post-shrink-del: dentry c003d6240ce8 name 5 lockcnt -128
dentry c003d6240ce8 lockcnt -128
dentry c003d6240ce8 name 5 lockcnt -128 flags 1048780

and this repeats over and over...

The code (with printks) looks like this:

 899 for (;;) {
 900 dentry = list_entry_rcu(list-prev, struct dentry, d_lru);
 901 if (dentry-d_lru == list)
 902 break; /* empty */
 903 
 904 if (print_more)
 905 printk(KERN_WARNING dentry %p lockcnt %d\n,
 906 dentry, dentry-d_lockref.count);
 907 
 908 /*
 909  * Get the dentry lock, and re-verify that the dentry is
 910  * this on the shrinking list. If it is, we know that
 911  * DCACHE_SHRINK_LIST and DCACHE_LRU_LIST are set.
 912  

Re: dcache shrink list corruption?

2014-05-02 Thread Linus Torvalds
On Thu, May 1, 2014 at 7:34 AM, Al Viro v...@zeniv.linux.org.uk wrote:

 OK, fixed and pushed (both branches).

Al, can you send a real pull request (the both branches part in
particular makes me worry about which one you think is right), because
I suspect by now we just need to get this wider testing.

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-02 Thread Miklos Szeredi
There's more of the delete from shrink list not owned by us in select parent.
Proposed patch appended.

And I'm not sure what umount_collect() is supposed to do.  Can other shrinkers
still be active at that point?  That would present other problems, no?

Also somewhat related is the question: how check_submounts_and_drop() could be
guaranteed correctness (timely removal of all unsed dentries) in the presence of
other shrinkers?

Thanks,
Miklos



From: Miklos Szeredi mszer...@suse.cz
Subject: dcache: select_collect(): don't remove from shrink list

Shrink lists are not protected by any lock, so don't remove from an unknown
one.

Signed-off-by: Miklos Szeredi mszer...@suse.cz
---
 fs/dcache.c |   23 ---
 1 file changed, 12 insertions(+), 11 deletions(-)

--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1241,18 +1241,19 @@ static enum d_walk_ret select_collect(vo
 * loop in shrink_dcache_parent() might not make any progress
 * and loop forever.
 */
-   if (dentry-d_lockref.count) {
-   dentry_lru_del(dentry);
-   } else if (!(dentry-d_flags  DCACHE_SHRINK_LIST)) {
-   /*
-* We can't use d_lru_shrink_move() because we
-* need to get the global LRU lock and do the
-* LRU accounting.
-*/
+   if (dentry-d_flags  (DCACHE_SHRINK_LIST | DCACHE_LRU_LIST) ==
+   DCACHE_LRU_LIST) {
d_lru_del(dentry);
-   d_shrink_add(dentry, data-dispose);
-   data-found++;
-   ret = D_WALK_NORETRY;
+   if (!dentry-d_lockref.count) {
+   /*
+* We can't use d_lru_shrink_move() because we
+* need to get the global LRU lock and do the
+* LRU accounting.
+*/
+   d_shrink_add(dentry, data-dispose);
+   data-found++;
+   ret = D_WALK_NORETRY;
+   }
}
/*
 * We can return to the caller if we have found some (this
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-02 Thread Linus Torvalds
On Fri, May 2, 2014 at 2:08 PM, Miklos Szeredi mik...@szeredi.hu wrote:
 There's more of the delete from shrink list not owned by us in select 
 parent.
 Proposed patch appended.

Ahh. Clearly this needs more work before I pull.

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-02 Thread Al Viro
On Fri, May 02, 2014 at 11:08:13PM +0200, Miklos Szeredi wrote:
 There's more of the delete from shrink list not owned by us in select 
 parent.
 Proposed patch appended.

While it certainly looks like dentry_lru_del() should die, I really wonder if
let's pretend that dentry isn't there if it's on some other shrink list
is the right approach.  You've already noticed one problem (check-and-drop
giving false busy indications), but shrink_dcache_parent() has similar
issues.  How about incrementing data-found instead?  That'll end up
rescanning the tree in case if it's not ours; so what?  If it's another
process doing the same shrinking in a subtree, we want to let it do its
job anyway.  Sure, somebody doing mount -o remount in a loop might be
able to stall the living hell out of us, for as long as new non-busy
dentries are being added in our subtree, but the second part in itself
is sufficient; we will keep picking those new non-busy dentries as long
as they keep coming.  And if they do not, eventually they will be all
taken out by us or by those other shrinkers and we'll be done.
 
 And I'm not sure what umount_collect() is supposed to do.  Can other shrinkers
 still be active at that point?  That would present other problems, no?

No other shrinkers - prune_dcache_sb() and shrink_dcache_sb() are also
serialized by -s_umount, shrink_dcache_parent() and check_submounts_and_drop()
are called only when an active reference is held, which obviously prevents
fs shutdown.

 Also somewhat related is the question: how check_submounts_and_drop() could be
 guaranteed correctness (timely removal of all unsed dentries) in the presence 
 of
 other shrinkers?

Another interesting question is what the hell are shrink_dcache_parent()
callers expecting.  E.g. what's going on in fuse_reverse_inval_entry()?
And what is nilfs_tree_is_busy() about?

FWIW, I'm not at all sure that vfs_rmdir() and vfs_rename() have any reason
to call it these days, and dentry_unhash() one simply ought to die - it's used
only by hpfs unlink() in case it wants to truncate in order to work around
-ENOSPC.  And _that_ won't have any subdirectories to deal with anyway, so
shrink_dcache_parent() there is a no-op, even if we keep the damn helper alive.
The rest of callers also look dubious...
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-02 Thread Al Viro
On Fri, May 02, 2014 at 02:18:43PM -0700, Linus Torvalds wrote:
 On Fri, May 2, 2014 at 2:08 PM, Miklos Szeredi mik...@szeredi.hu wrote:
  There's more of the delete from shrink list not owned by us in select 
  parent.
  Proposed patch appended.
 
 Ahh. Clearly this needs more work before I pull.

*nod*

Besides, I want to put Miklos' don't bother with RCU in shrink_dentry_list()
in there as soon as select_collect() has been dealt with.  I don't think
that the currently posted patch for select_collect() is right, though -
see my reply to parent posting.  Basically, I think we should treat it's
on the shrink list already as increment data-found and keep going.  IOW,
if (on shrink list) {
data-found++;
} else {
if (on lru list)
d_lru_del
if (refcount is zero) {
d_shrink_add
data-found++;
}
}
if (data-found)
ret = need_resched() ? D_WALK_QUIT : D_WALK_NORETRY;
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-02 Thread Al Viro
On Fri, May 02, 2014 at 11:40:22PM +0100, Al Viro wrote:
 On Fri, May 02, 2014 at 02:18:43PM -0700, Linus Torvalds wrote:
  On Fri, May 2, 2014 at 2:08 PM, Miklos Szeredi mik...@szeredi.hu wrote:
   There's more of the delete from shrink list not owned by us in select 
   parent.
   Proposed patch appended.
  
  Ahh. Clearly this needs more work before I pull.
 
 *nod*
 
 Besides, I want to put Miklos' don't bother with RCU in shrink_dentry_list()
 in there as soon as select_collect() has been dealt with.  I don't think
 that the currently posted patch for select_collect() is right, though -
 see my reply to parent posting.  Basically, I think we should treat it's
 on the shrink list already as increment data-found and keep going.  IOW,
   if (on shrink list) {
   data-found++;
   } else {
   if (on lru list)
   d_lru_del
   if (refcount is zero) {
   d_shrink_add
   data-found++;
   }
   }
   if (data-found)
   ret = need_resched() ? D_WALK_QUIT : D_WALK_NORETRY;

While we are at it - BUG() in umount_collect() is probably a bad idea.
At that point we are holding -s_umount, so it guarantees that a lot
of stuff from that point on will get stuck.  Starting with sync(2).
And I really doubt that damage from WARN() instead will be more...
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-02 Thread Al Viro
On Fri, May 02, 2014 at 11:40:22PM +0100, Al Viro wrote:
 On Fri, May 02, 2014 at 02:18:43PM -0700, Linus Torvalds wrote:
  On Fri, May 2, 2014 at 2:08 PM, Miklos Szeredi mik...@szeredi.hu wrote:
   There's more of the delete from shrink list not owned by us in select 
   parent.
   Proposed patch appended.
  
  Ahh. Clearly this needs more work before I pull.
 
 *nod*
 
 Besides, I want to put Miklos' don't bother with RCU in shrink_dentry_list()
 in there as soon as select_collect() has been dealt with.  I don't think
 that the currently posted patch for select_collect() is right, though -
 see my reply to parent posting.  Basically, I think we should treat it's
 on the shrink list already as increment data-found and keep going.  IOW,
   if (on shrink list) {
   data-found++;
   } else {
   if (on lru list)
   d_lru_del
   if (refcount is zero) {
   d_shrink_add
   data-found++;
   }
   }
   if (data-found)
   ret = need_resched() ? D_WALK_QUIT : D_WALK_NORETRY;

See vfs.git#dentry_kill-3; warning - this is completely untested and I would
really like comments on spinning case there (i.e. the one where select_collect()
finds some stuff already on some other shrink list and nothing with zero
refcount that wouldn't be there).  In that case (and it's basically somebody
else is evicting stuff in our subtree and they'd already picked everything
we want evicted) I just let the loop in check_submounts_and_drop() repeat
(we do have cond_resched() there).  Any better suggestions would be welcome...
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-01 Thread Al Viro
On Wed, Apr 30, 2014 at 11:15:15AM +0200, Miklos Szeredi wrote:
> On Tue, Apr 29, 2014 at 07:56:13PM -0700, Linus Torvalds wrote:
> > On Tue, Apr 29, 2014 at 7:31 PM, Al Viro  wrote:
> > >
> > > OK, aggregate diff follows, more readable splitup (3 commits) attached.
> > > It seems to survive beating here; testing, review and comments are
> > > welcome.
> > 
> > Miklos, did you have some particular load that triggered this, or was
> > it just some reports? It would be really good to get this patch some
> > stress-testing.
> > 
> > I like how the patch removes more lines than it adds, but apart from
> > that it's hard to read the patch (even the split-out ones) and say
> > anything more about it. I think this needs a *lot* of testing.
> 
> IBM is triggering this with the host01 test from the LTP suite.  I haven't yet
> tried to reproduce it.

Could you repost their report?  So far I hadn't managed to get actual list
corruption on mainline kernel - it definitely is possible, but on all
testcases so far the race window is too narrow.  So if they have a reproducer
that doesn't take an insane amount of time, I'd really like to see it...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-01 Thread Linus Torvalds
On Thu, May 1, 2014 at 2:05 PM, Al Viro  wrote:
>
> ... and so has this one, the bounce being

I actually got all three messages despite the bounces. But I seem to
have gotten at least the first one through lkml.

Your email is lacking SPF or any other authentication, so it may be
gmail being pissy about that, though:

  Received-SPF: none (google.com: v...@ftp.linux.org.uk does not
designate permitted sender hosts) client-ip=195.92.253.2;

especially if your IP is in a block where other people send spam,
gmail has been known to just say "that ISP is clearly bad".

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-01 Thread Al Viro
On Thu, May 01, 2014 at 10:02:33PM +0100, Al Viro wrote:
> *grumble*
> 
> Looks like gmail filtering is buggered again.  The previous mail got
> bounced by gmail, both for Miklos and for Linus ;-/  Hell knows if
> this one gets through; the original is at http://lkml.org/lkml/2014/5/1/235,
> bounce being bloody vague, as usual.

... and so has this one, the bounce being

[quote]
A message that you sent could not be delivered to one or more of its
recipients. This is a permanent error. The following address(es) failed:

  torva...@linux-foundation.org
SMTP error from remote mail server after end of data:
host ASPMX.L.GOOGLE.COM [173.194.67.27]: 550-5.7.1 [195.92.253.2  12] 
Our system has detected that this message is
550-5.7.1 likely unsolicited mail. To reduce the amount of spam sent to 
Gmail,
550-5.7.1 this message has been blocked. Please visit
550-5.7.1 http://support.google.com/mail/bin/answer.py?hl=en=188131 
for
550 5.7.1 more information. cv1si45381wib.71 - gsmtp
  mik...@szeredi.hu
SMTP error from remote mail server after end of data:
host ASPMX.L.GOOGLE.COM [173.194.67.27]: 550-5.7.1 [195.92.253.2  12] 
Our system has detected that this message is
550-5.7.1 likely unsolicited mail. To reduce the amount of spam sent to 
Gmail,
550-5.7.1 this message has been blocked. Please visit
550-5.7.1 http://support.google.com/mail/bin/answer.py?hl=en=188131 
for
550 5.7.1 more information. bb7si53528wib.20 - gsmtp
[endquote]

What the hell is going on?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-01 Thread Al Viro
*grumble*

Looks like gmail filtering is buggered again.  The previous mail got
bounced by gmail, both for Miklos and for Linus ;-/  Hell knows if
this one gets through; the original is at http://lkml.org/lkml/2014/5/1/235,
bounce being bloody vague, as usual.

This is starting to get really annoying ;-/  Does anybody have any idea
what might be going on?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-01 Thread Al Viro
On Thu, May 01, 2014 at 11:42:52AM +0200, Miklos Szeredi wrote:
> Two points about latest version (dentry_kill-2):
> 
> - Doing anything with dentry->d_parent in case of DCACHE_DENTRY_KILLED looks
>   seriously wrong.  Parent has been dealt with, at that point, by the other
>   caller, no?

In both branches, actually - we should bugger off earlier *and* return NULL
in that case.  Nice catch.

Hmm... I see why it failed to blow up on that.  It *did* trigger, all right -
udev is stepping into that right on boot.  The thing is, check should be
if ((int)dentry->d_lockref.count > 0)
since the damn thing is unsigned int.  IOW, they did go through handover
and shrink_dentry_list() proceeded to lose them.  And with that braino
fixed, it steps into the extra dput crap just fine.

OK, fixed and pushed (both branches).

> - "bool foo = flag & FLAG" looks suspicious.  Is this guaranteed not to
>   overflow?

What do you mean, overflow?  It's not a 1-bit unsigned int; conversion to
_Bool is different (which is the only reason why it's more than mere
syntax sugar).  See C99 6.3.2.1 ("When any scalar value is converted
to _Bool, the result is 0 if the value compares equal to 0; otherwise,
the result is 1").

That, BTW, is also the reason why _Bool bitfields exist - try
struct {
_Bool a:1;
unsigned b:1;
} x;
x.a = 2;
x.b = 2;
if (x.a)
printf("A");
if (x.b)
printf("B");
and see what it does.  The first test triggers, same as if (2) would.
The second does not, since conversion to unsigned integer type other
than _Bool gives the value in range of that type comparable to original
modulo (maximal representable + 1).  And 2 modulo 2 is 0...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-01 Thread Miklos Szeredi
Two points about latest version (dentry_kill-2):

- Doing anything with dentry->d_parent in case of DCACHE_DENTRY_KILLED looks
  seriously wrong.  Parent has been dealt with, at that point, by the other
  caller, no?

- "bool foo = flag & FLAG" looks suspicious.  Is this guaranteed not to
  overflow?

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-01 Thread Miklos Szeredi
Two points about latest version (dentry_kill-2):

- Doing anything with dentry-d_parent in case of DCACHE_DENTRY_KILLED looks
  seriously wrong.  Parent has been dealt with, at that point, by the other
  caller, no?

- bool foo = flag  FLAG looks suspicious.  Is this guaranteed not to
  overflow?

Thanks,
Miklos
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-01 Thread Al Viro
On Thu, May 01, 2014 at 11:42:52AM +0200, Miklos Szeredi wrote:
 Two points about latest version (dentry_kill-2):
 
 - Doing anything with dentry-d_parent in case of DCACHE_DENTRY_KILLED looks
   seriously wrong.  Parent has been dealt with, at that point, by the other
   caller, no?

In both branches, actually - we should bugger off earlier *and* return NULL
in that case.  Nice catch.

Hmm... I see why it failed to blow up on that.  It *did* trigger, all right -
udev is stepping into that right on boot.  The thing is, check should be
if ((int)dentry-d_lockref.count  0)
since the damn thing is unsigned int.  IOW, they did go through handover
and shrink_dentry_list() proceeded to lose them.  And with that braino
fixed, it steps into the extra dput crap just fine.

OK, fixed and pushed (both branches).

 - bool foo = flag  FLAG looks suspicious.  Is this guaranteed not to
   overflow?

What do you mean, overflow?  It's not a 1-bit unsigned int; conversion to
_Bool is different (which is the only reason why it's more than mere
syntax sugar).  See C99 6.3.2.1 (When any scalar value is converted
to _Bool, the result is 0 if the value compares equal to 0; otherwise,
the result is 1).

That, BTW, is also the reason why _Bool bitfields exist - try
struct {
_Bool a:1;
unsigned b:1;
} x;
x.a = 2;
x.b = 2;
if (x.a)
printf(A);
if (x.b)
printf(B);
and see what it does.  The first test triggers, same as if (2) would.
The second does not, since conversion to unsigned integer type other
than _Bool gives the value in range of that type comparable to original
modulo (maximal representable + 1).  And 2 modulo 2 is 0...
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-01 Thread Al Viro
*grumble*

Looks like gmail filtering is buggered again.  The previous mail got
bounced by gmail, both for Miklos and for Linus ;-/  Hell knows if
this one gets through; the original is at http://lkml.org/lkml/2014/5/1/235,
bounce being bloody vague, as usual.

This is starting to get really annoying ;-/  Does anybody have any idea
what might be going on?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-01 Thread Al Viro
On Thu, May 01, 2014 at 10:02:33PM +0100, Al Viro wrote:
 *grumble*
 
 Looks like gmail filtering is buggered again.  The previous mail got
 bounced by gmail, both for Miklos and for Linus ;-/  Hell knows if
 this one gets through; the original is at http://lkml.org/lkml/2014/5/1/235,
 bounce being bloody vague, as usual.

... and so has this one, the bounce being

[quote]
A message that you sent could not be delivered to one or more of its
recipients. This is a permanent error. The following address(es) failed:

  torva...@linux-foundation.org
SMTP error from remote mail server after end of data:
host ASPMX.L.GOOGLE.COM [173.194.67.27]: 550-5.7.1 [195.92.253.2  12] 
Our system has detected that this message is
550-5.7.1 likely unsolicited mail. To reduce the amount of spam sent to 
Gmail,
550-5.7.1 this message has been blocked. Please visit
550-5.7.1 http://support.google.com/mail/bin/answer.py?hl=enanswer=188131 
for
550 5.7.1 more information. cv1si45381wib.71 - gsmtp
  mik...@szeredi.hu
SMTP error from remote mail server after end of data:
host ASPMX.L.GOOGLE.COM [173.194.67.27]: 550-5.7.1 [195.92.253.2  12] 
Our system has detected that this message is
550-5.7.1 likely unsolicited mail. To reduce the amount of spam sent to 
Gmail,
550-5.7.1 this message has been blocked. Please visit
550-5.7.1 http://support.google.com/mail/bin/answer.py?hl=enanswer=188131 
for
550 5.7.1 more information. bb7si53528wib.20 - gsmtp
[endquote]

What the hell is going on?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-01 Thread Linus Torvalds
On Thu, May 1, 2014 at 2:05 PM, Al Viro v...@zeniv.linux.org.uk wrote:

 ... and so has this one, the bounce being

I actually got all three messages despite the bounces. But I seem to
have gotten at least the first one through lkml.

Your email is lacking SPF or any other authentication, so it may be
gmail being pissy about that, though:

  Received-SPF: none (google.com: v...@ftp.linux.org.uk does not
designate permitted sender hosts) client-ip=195.92.253.2;

especially if your IP is in a block where other people send spam,
gmail has been known to just say that ISP is clearly bad.

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-05-01 Thread Al Viro
On Wed, Apr 30, 2014 at 11:15:15AM +0200, Miklos Szeredi wrote:
 On Tue, Apr 29, 2014 at 07:56:13PM -0700, Linus Torvalds wrote:
  On Tue, Apr 29, 2014 at 7:31 PM, Al Viro v...@zeniv.linux.org.uk wrote:
  
   OK, aggregate diff follows, more readable splitup (3 commits) attached.
   It seems to survive beating here; testing, review and comments are
   welcome.
  
  Miklos, did you have some particular load that triggered this, or was
  it just some reports? It would be really good to get this patch some
  stress-testing.
  
  I like how the patch removes more lines than it adds, but apart from
  that it's hard to read the patch (even the split-out ones) and say
  anything more about it. I think this needs a *lot* of testing.
 
 IBM is triggering this with the host01 test from the LTP suite.  I haven't yet
 tried to reproduce it.

Could you repost their report?  So far I hadn't managed to get actual list
corruption on mainline kernel - it definitely is possible, but on all
testcases so far the race window is too narrow.  So if they have a reproducer
that doesn't take an insane amount of time, I'd really like to see it...
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-04-30 Thread Al Viro
On Wed, Apr 30, 2014 at 07:59:43PM -0700, Linus Torvalds wrote:
> On Wed, Apr 30, 2014 at 7:51 PM, Al Viro  wrote:
> >
> >   Help with profiling is needed; the loads to watch are
> > the ones where dentry_kill() + dentry_free() are sufficiently high in 
> > profiles
> > for the differences to matter.  Any takers?
> 
> I really hope there are no such loads, my "lock/unlock pairing"
> suggestion was mostly so that the pairing is clearer, not necessarily
> for performance.
> 
> That said, I'd assume that it migth be worth testing at least the
> "lots of concurrent lookups of 'simple_dentry_operations' dentries".
> So most of /proc, most uses of simple_lookup(). That at least triggers
> the dentry_kill() path in dput(), so it should be fairly easy to get
> profiles.
> 
> But real loads with real filesystems? That sounds harder.

Well, the simplest way to do that is a bunch of open/unlink/close.  Another
one is cp -rl / rm -rf of a large tree (rmdir(2) does shrink_dcache_parent()). 
For that matter, unmapping an anon shared mapping will trigger the same path.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-04-30 Thread Linus Torvalds
On Wed, Apr 30, 2014 at 7:51 PM, Al Viro  wrote:
>
>   Help with profiling is needed; the loads to watch are
> the ones where dentry_kill() + dentry_free() are sufficiently high in profiles
> for the differences to matter.  Any takers?

I really hope there are no such loads, my "lock/unlock pairing"
suggestion was mostly so that the pairing is clearer, not necessarily
for performance.

That said, I'd assume that it migth be worth testing at least the
"lots of concurrent lookups of 'simple_dentry_operations' dentries".
So most of /proc, most uses of simple_lookup(). That at least triggers
the dentry_kill() path in dput(), so it should be fairly easy to get
profiles.

But real loads with real filesystems? That sounds harder.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-04-30 Thread Al Viro
On Wed, Apr 30, 2014 at 05:18:23PM -0700, Linus Torvalds wrote:

> and then suddenly it looks like we have a common exit sequence from
> that dentry_kill() function, no?
> 
> (The earlier "unlock_on_failure" exit case is altogether a different case).
> 
> I dunno. Maybe not a big deal, but one reason I prefer doing that
> "free" flag is because I really tend to prefer the simple case of
> lock-unlock pairing cleanly at the same level. NOT the pattern where
> you have one lock at one indentation level, paired with multiple
> unlocks for all the different cases.

Yeah, but... I have such variant, but the best I could get still generated
the code that wasn't particulary nice.  Part might be gcc code generation
sucking for bool, part - extra register pressure...  It has slightly lower
i-cache footprint, though, so it might be worth doing.  Hell knows; that's a
fairly hot codepath, so let's do it that way - I've just pushed an alternative
branch with bool can_free variant; branches in question: vfs.git#for-linus and
vfs.git#dentry_kill-2. Help with profiling is needed; the loads to watch are
the ones where dentry_kill() + dentry_free() are sufficiently high in profiles
for the differences to matter.  Any takers?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-04-30 Thread Linus Torvalds
On Wed, Apr 30, 2014 at 4:43 PM, Al Viro  wrote:
>
> OK, done and force-pushed.  Should propagate in a few...

That made it more obvious how the DCACHE_MAY_FREE case ends up
working. And in particular, mind rewriting this:

if (dentry->d_flags & DCACHE_MAY_FREE) {
spin_unlock(>d_lock);
dentry_free(dentry);
} else {
spin_unlock(>d_lock);
}
return parent;

as just

bool free = dentry->d_flags & DCACHE_MAY_FREE;
spin_unlock(>d_lock);
if (free)
dentry_free(dentry);
return parent;

instead? In fact, I get the feeling that the other case later on
really fits the same model:

spin_lock(>d_lock);
if (dentry->d_flags & DCACHE_SHRINK_LIST) {
dentry->d_flags |= DCACHE_MAY_FREE;
spin_unlock(>d_lock);
} else {
spin_unlock(>d_lock);
dentry_free(dentry);
  }

ends up really being better as

spin_lock(>d_lock);
free = 1;
if (dentry->d_flags & DCACHE_SHRINK_LIST) {
dentry->d_flags |= DCACHE_MAY_FREE;
free = 0;
}
spin_unlock(>d_lock);
if (free)
dentry_free(dentry);
return parent;

and then suddenly it looks like we have a common exit sequence from
that dentry_kill() function, no?

(The earlier "unlock_on_failure" exit case is altogether a different case).

I dunno. Maybe not a big deal, but one reason I prefer doing that
"free" flag is because I really tend to prefer the simple case of
lock-unlock pairing cleanly at the same level. NOT the pattern where
you have one lock at one indentation level, paired with multiple
unlocks for all the different cases.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-04-30 Thread Al Viro
On Wed, Apr 30, 2014 at 04:14:14PM -0700, Linus Torvalds wrote:
> On Wed, Apr 30, 2014 at 4:04 PM, Linus Torvalds
>  wrote:
> >
> > Let me go talk to the paravirt people. Maybe they don't need this, and
> > I don't know exactly *how* they use that lock pointer after the unlock
> > in the "kick waiters" part. Maybe it's all good.
> 
> .. looking at current users (xen and kvm) it does in fact look all
> good. Yes, we "kick" possible waiters after the unlock, but the lock
> itself is not touched by that, it just uses the pointer to the lock as
> a way to figure out who to kick.
> 
> In fact, I kind of knew that, but had forgotten. We very much depend
> on spin_unlock being safe wrt immediate deleton already: the
> "completion" code very much depends on it. It does a "spin_unlock()"
> to release the completion, and it can get reused immediately (it might
> be on the stack, it might be in some data structure that gets freed).
> 
> So I'd suggest removing that whole RCU thing, because it should be
> safe to unlock something that can go away immediately. We'd have huge
> problems elsewhere if it wasn't safe.

OK, done and force-pushed.  Should propagate in a few...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-04-30 Thread Al Viro
On Wed, Apr 30, 2014 at 04:04:58PM -0700, Linus Torvalds wrote:
> On Wed, Apr 30, 2014 at 3:12 PM, Al Viro  wrote:
> >
> > I've just pushed the whole thing to vfs.git#for-linus;
> > review and testing would be very welcome.
> 
> I have no half-way relevant test-case for this, so I'm hoping people
> who have good VFS stress-tests (preferably under memory pressure so
> that we get that whole shrinking path) will test.
> 
> But it looks fine.
> 
> That said, I do hate that RCU read-lock around the final spin-unlock.

So do I, obviously...  After looking through the rest of arch_spin_unlock(),
it looks like the only suspicious one except the paravirt on x86 is
blackfin.  And that might be misreading - I'm not familiar enough with the
architecture to tell...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-04-30 Thread Linus Torvalds
On Wed, Apr 30, 2014 at 4:04 PM, Linus Torvalds
 wrote:
>
> Let me go talk to the paravirt people. Maybe they don't need this, and
> I don't know exactly *how* they use that lock pointer after the unlock
> in the "kick waiters" part. Maybe it's all good.

.. looking at current users (xen and kvm) it does in fact look all
good. Yes, we "kick" possible waiters after the unlock, but the lock
itself is not touched by that, it just uses the pointer to the lock as
a way to figure out who to kick.

In fact, I kind of knew that, but had forgotten. We very much depend
on spin_unlock being safe wrt immediate deleton already: the
"completion" code very much depends on it. It does a "spin_unlock()"
to release the completion, and it can get reused immediately (it might
be on the stack, it might be in some data structure that gets freed).

So I'd suggest removing that whole RCU thing, because it should be
safe to unlock something that can go away immediately. We'd have huge
problems elsewhere if it wasn't safe.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-04-30 Thread Linus Torvalds
On Wed, Apr 30, 2014 at 3:12 PM, Al Viro  wrote:
>
> I've just pushed the whole thing to vfs.git#for-linus;
> review and testing would be very welcome.

I have no half-way relevant test-case for this, so I'm hoping people
who have good VFS stress-tests (preferably under memory pressure so
that we get that whole shrinking path) will test.

But it looks fine.

That said, I do hate that RCU read-lock around the final spin-unlock.

Let me go talk to the paravirt people. Maybe they don't need this, and
I don't know exactly *how* they use that lock pointer after the unlock
in the "kick waiters" part. Maybe it's all good.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-04-30 Thread Al Viro
On Wed, Apr 30, 2014 at 10:12:06PM +0100, Al Viro wrote:
> On Wed, Apr 30, 2014 at 01:57:05PM -0700, Linus Torvalds wrote:
> > On Wed, Apr 30, 2014 at 1:38 PM, Al Viro  wrote:
> > >
> > > We do not (and cannot) call dentry_kill() with rcu_read_lock held - it can
> > > trigger any amount of IO, for one thing.  We can take it around the
> > > couple of places where do that spin_unlock(>d_lock) (along with
> > > setting DCACHE_RCUACCESS) - that's what I'd been refering to.
> > 
> > Just the last spin_unlock() would be the case that matters, if the
> > spin_unlock() is done on something that could be freed immediately and
> > the lock protects and is inside the entity that gets freed.
> 
> *nod*
> 
> There are two such spin_unlock (handover from shrink_dentry_list() to
> dput() and the opposite one), but they are all that needs protection -
> ->d_flags update is outside the rcu-critical area.  I really wonder
> if we *can* get there without DCACHE_RCUACCESS having been set, though;
> dentry would have to be
>   * picked into shrink list (i.e. have had zero refcount at some point)
>   * never had been through __d_rehash()
> shrink_dentry_list() definitely counts on that being impossible, and it
> probably is, but I'm feeling seriously paranoid about the whole area.
> I'll finish grepping through the tree and probably drop setting
> DCACHE_RCUACCESS from the patch - either that, or set it in d_shrink_add()
> it it turns out that it is possible and shrink_dentry_list() is fucked...

OK, it really can't happen.  The proof is more convoluted than I'd like it,
but it's solid enough, so setting that flag in dentry_kill() handover cases
wasn't needed.  I've just pushed the whole thing to vfs.git#for-linus;
review and testing would be very welcome.  I can repost it one more time,
but the only difference compared to the last variant in this thread is not
bothering with DCACHE_RCUACCESS.

It has survived LTP tests, going through xfstests now...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-04-30 Thread Al Viro
On Wed, Apr 30, 2014 at 01:57:05PM -0700, Linus Torvalds wrote:
> On Wed, Apr 30, 2014 at 1:38 PM, Al Viro  wrote:
> >
> > We do not (and cannot) call dentry_kill() with rcu_read_lock held - it can
> > trigger any amount of IO, for one thing.  We can take it around the
> > couple of places where do that spin_unlock(>d_lock) (along with
> > setting DCACHE_RCUACCESS) - that's what I'd been refering to.
> 
> Just the last spin_unlock() would be the case that matters, if the
> spin_unlock() is done on something that could be freed immediately and
> the lock protects and is inside the entity that gets freed.

*nod*

There are two such spin_unlock (handover from shrink_dentry_list() to
dput() and the opposite one), but they are all that needs protection -
->d_flags update is outside the rcu-critical area.  I really wonder
if we *can* get there without DCACHE_RCUACCESS having been set, though;
dentry would have to be
* picked into shrink list (i.e. have had zero refcount at some point)
* never had been through __d_rehash()
shrink_dentry_list() definitely counts on that being impossible, and it
probably is, but I'm feeling seriously paranoid about the whole area.
I'll finish grepping through the tree and probably drop setting
DCACHE_RCUACCESS from the patch - either that, or set it in d_shrink_add()
it it turns out that it is possible and shrink_dentry_list() is fucked...

Tests seem to be running fine so far...

> >   BTW, is there any convenient
> > way to tell git commit --amend to update the commit date?  Something
> > like --date=now would be nice, but it isn't accepted...
> 
> --date="$(date)" works.

Thanks...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-04-30 Thread Linus Torvalds
On Wed, Apr 30, 2014 at 1:38 PM, Al Viro  wrote:
>
> We do not (and cannot) call dentry_kill() with rcu_read_lock held - it can
> trigger any amount of IO, for one thing.  We can take it around the
> couple of places where do that spin_unlock(>d_lock) (along with
> setting DCACHE_RCUACCESS) - that's what I'd been refering to.

Just the last spin_unlock() would be the case that matters, if the
spin_unlock() is done on something that could be freed immediately and
the lock protects and is inside the entity that gets freed.

>   BTW, is there any convenient
> way to tell git commit --amend to update the commit date?  Something
> like --date=now would be nice, but it isn't accepted...

--date="$(date)" works.

The "--date" thing doesn't take the nice "git approxidate" strings,
maybe it should. So you have to give it a "real" date.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-04-30 Thread Al Viro
On Wed, Apr 30, 2014 at 01:23:26PM -0700, Linus Torvalds wrote:
> On Wed, Apr 30, 2014 at 12:59 PM, Al Viro  wrote:
> >
> > Another thing: I don't like what's going on with freeing vs. ->d_lock there.
> > Had that been a mutex, we'd definitely get a repeat of "vfs: fix subtle
> > use-after-free of pipe_inode_info".  The question is, can spin_unlock(p)
> > dereference p after another CPU gets through spin_lock(p)?  Linus?
> 
> spin_unlock() *should* be safe wrt that issue.
> 
> But I have to say, I think paravirtualized spinlocks may break that.
> They do all kinds of "kick waiters" after releasing the lock.
> 
> Doesn't the RCU protection solve that, though? Nobody should be
> releasing the dentry under us, afaik..

We do not (and cannot) call dentry_kill() with rcu_read_lock held - it can
trigger any amount of IO, for one thing.  We can take it around the
couple of places where do that spin_unlock(>d_lock) (along with
setting DCACHE_RCUACCESS) - that's what I'd been refering to.  Then this
sucker (tests still running, so far everything seems to survive) becomes
the following (again, on top of 1/6..4/6).  BTW, is there any convenient
way to tell git commit --amend to update the commit date?  Something
like --date=now would be nice, but it isn't accepted...

commit 797ff22681dc969b478ed837787d24dfd2dd2132
Author: Al Viro 
Date:   Tue Apr 29 23:52:05 2014 -0400

dentry_kill(): don't try to remove from shrink list

If the victim in on the shrink list, don't remove it from there.
If shrink_dentry_list() manages to remove it from the list before
we are done - fine, we'll just free it as usual.  If not - mark
it with new flag (DCACHE_MAY_FREE) and leave it there.

Eventually, shrink_dentry_list() will get to it, remove the sucker
from shrink list and call dentry_kill(dentry, 0).  Which is where
we'll deal with freeing.

Since now dentry_kill(dentry, 0) may happen after or during
dentry_kill(dentry, 1), we need to recognize that (by seeing
DCACHE_DENTRY_KILLED already set), unlock everything
and either free the sucker (in case DCACHE_MAY_FREE has been
set) or leave it for ongoing dentry_kill(dentry, 1) to deal with.

Signed-off-by: Al Viro 

diff --git a/fs/dcache.c b/fs/dcache.c
index e482775..fa40d26 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -489,6 +489,20 @@ relock:
goto relock;
}
 
+   if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) {
+   if (parent)
+   spin_unlock(>d_lock);
+   if (dentry->d_flags & DCACHE_MAY_FREE) {
+   spin_unlock(>d_lock);
+   dentry_free(dentry);
+   } else {
+   dentry->d_flags |= DCACHE_RCUACCESS;
+   rcu_read_lock();
+   spin_unlock(>d_lock);
+   rcu_read_unlock();
+   }
+   return parent;
+   }
/*
 * The dentry is now unrecoverably dead to the world.
 */
@@ -504,8 +518,6 @@ relock:
if (dentry->d_flags & DCACHE_LRU_LIST) {
if (!(dentry->d_flags & DCACHE_SHRINK_LIST))
d_lru_del(dentry);
-   else
-   d_shrink_del(dentry);
}
/* if it was on the hash then remove it */
__d_drop(dentry);
@@ -527,7 +539,16 @@ relock:
if (dentry->d_op && dentry->d_op->d_release)
dentry->d_op->d_release(dentry);
 
-   dentry_free(dentry);
+   spin_lock(>d_lock);
+   if (dentry->d_flags & DCACHE_SHRINK_LIST) {
+   dentry->d_flags |= DCACHE_MAY_FREE | DCACHE_RCUACCESS;
+   rcu_read_lock();
+   spin_unlock(>d_lock);
+   rcu_read_unlock();
+   } else {
+   spin_unlock(>d_lock);
+   dentry_free(dentry);
+   }
return parent;
 }
 
@@ -829,7 +850,7 @@ static void shrink_dentry_list(struct list_head *list)
 * We found an inuse dentry which was not removed from
 * the LRU because of laziness during lookup. Do not free it.
 */
-   if (dentry->d_lockref.count) {
+   if (dentry->d_lockref.count > 0) {
spin_unlock(>d_lock);
continue;
}
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 3b9bfdb..3c7ec32 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -221,6 +221,8 @@ struct dentry_operations {
 #define DCACHE_SYMLINK_TYPE0x0030 /* Symlink */
 #define DCACHE_FILE_TYPE   0x0040 /* Other file type */
 
+#define DCACHE_MAY_FREE0x0080
+
 extern seqlock_t rename_lock;
 
 static inline int dname_external(const struct dentry *dentry)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to 

Re: dcache shrink list corruption?

2014-04-30 Thread Linus Torvalds
On Wed, Apr 30, 2014 at 12:59 PM, Al Viro  wrote:
>
> Another thing: I don't like what's going on with freeing vs. ->d_lock there.
> Had that been a mutex, we'd definitely get a repeat of "vfs: fix subtle
> use-after-free of pipe_inode_info".  The question is, can spin_unlock(p)
> dereference p after another CPU gets through spin_lock(p)?  Linus?

spin_unlock() *should* be safe wrt that issue.

But I have to say, I think paravirtualized spinlocks may break that.
They do all kinds of "kick waiters" after releasing the lock.

Doesn't the RCU protection solve that, though? Nobody should be
releasing the dentry under us, afaik..

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-04-30 Thread Al Viro
On Wed, Apr 30, 2014 at 08:02:27PM +0100, Al Viro wrote:
> On Wed, Apr 30, 2014 at 08:42:01PM +0200, Miklos Szeredi wrote:
> 
> > Message-ID: <20140430154958.gc3...@tucsk.piliscsaba.szeredi.hu>
> 
> I see...  Several points:
>   * I still think that it's better to do handling of "we see
> DCACHE_DENTRY_KILLED already set" in dentry_kill() itself.
>   * in dentry_kill(dentry, 0) we *know* that it's not on a shrink
> list - the caller has just removed it from there and we'd kept ->d_lock
> since that point.   What's more, with that scheme we don't need to bother
> with can_free at all - just grab ->d_lock after ->d_release() call and check
> DCACHE_SHRINK_LIST.  No sense trying to avoid that - in case where we
> could just go ahead and free the sucker, there's neither contention nor
> anybody else interested in that cacheline, so spin_lock(>d_lock)
> is pretty much free.
> 
> IOW, I'd prefer to do the following (completely untested) on top of 1/6--4/6:

Sigh...  One more problem:
/*
 * We found an inuse dentry which was not removed from
 * the LRU because of laziness during lookup. Do not free it.
 */
if (dentry->d_lockref.count) {
spin_unlock(>d_lock);
continue;
}
should become
if (dentry->d_lockref.count > 0) {

instead - lockref_mark_dead() slaps -128 into it, so we'll just leak all
dentries where dput() has come first and decided to leave the suckers
for us.

Another thing: I don't like what's going on with freeing vs. ->d_lock there.
Had that been a mutex, we'd definitely get a repeat of "vfs: fix subtle
use-after-free of pipe_inode_info".  The question is, can spin_unlock(p)
dereference p after another CPU gets through spin_lock(p)?  Linus?

It can be dealt with by setting DCACHE_RCUACCESS in "let another guy free
it" cases and playing with rcu_read_lock a bit, but I wonder whether we
need to bother - quick look through alpha/sparc/ppc shows that on those
we do not and the same is true for non-paravirt case on x86.  I hadn't
checked what paravirt one does, though, and I certainly hadn't done
exhaustive search for architectures doing something weird...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-04-30 Thread Al Viro
On Wed, Apr 30, 2014 at 08:42:01PM +0200, Miklos Szeredi wrote:

> Message-ID: <20140430154958.gc3...@tucsk.piliscsaba.szeredi.hu>

I see...  Several points:
* I still think that it's better to do handling of "we see
DCACHE_DENTRY_KILLED already set" in dentry_kill() itself.
* in dentry_kill(dentry, 0) we *know* that it's not on a shrink
list - the caller has just removed it from there and we'd kept ->d_lock
since that point.   What's more, with that scheme we don't need to bother
with can_free at all - just grab ->d_lock after ->d_release() call and check
DCACHE_SHRINK_LIST.  No sense trying to avoid that - in case where we
could just go ahead and free the sucker, there's neither contention nor
anybody else interested in that cacheline, so spin_lock(>d_lock)
is pretty much free.

IOW, I'd prefer to do the following (completely untested) on top of 1/6--4/6:

diff --git a/fs/dcache.c b/fs/dcache.c
index e482775..82d8eb4 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -489,6 +489,17 @@ relock:
goto relock;
}
 
+   if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) {
+   if (parent)
+   spin_unlock(>d_lock);
+   if (dentry->d_flags & DCACHE_MAY_FREE) {
+   spin_unlock(>d_lock);
+   dentry_free(dentry);
+   } else {
+   spin_unlock(>d_lock);
+   }
+   return parent;
+   }
/*
 * The dentry is now unrecoverably dead to the world.
 */
@@ -504,8 +515,6 @@ relock:
if (dentry->d_flags & DCACHE_LRU_LIST) {
if (!(dentry->d_flags & DCACHE_SHRINK_LIST))
d_lru_del(dentry);
-   else
-   d_shrink_del(dentry);
}
/* if it was on the hash then remove it */
__d_drop(dentry);
@@ -527,7 +536,14 @@ relock:
if (dentry->d_op && dentry->d_op->d_release)
dentry->d_op->d_release(dentry);
 
-   dentry_free(dentry);
+   spin_lock(>d_lock);
+   if (dentry->d_flags & DCACHE_SHRINK_LIST) {
+   dentry->d_flags |= DCACHE_MAY_FREE;
+   spin_unlock(>d_lock);
+   } else {
+   spin_unlock(>d_lock);
+   dentry_free(dentry);
+   }
return parent;
 }
 
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 3b9bfdb..3c7ec32 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -221,6 +221,8 @@ struct dentry_operations {
 #define DCACHE_SYMLINK_TYPE0x0030 /* Symlink */
 #define DCACHE_FILE_TYPE   0x0040 /* Other file type */
 
+#define DCACHE_MAY_FREE0x0080
+
 extern seqlock_t rename_lock;
 
 static inline int dname_external(const struct dentry *dentry)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-04-30 Thread Miklos Szeredi
On Wed, Apr 30, 2014 at 8:36 PM, Al Viro  wrote:
> On Wed, Apr 30, 2014 at 07:33:38PM +0200, Miklos Szeredi wrote:
>> On Wed, Apr 30, 2014 at 6:03 PM, Al Viro  wrote:
>> > On Wed, Apr 30, 2014 at 05:49:58PM +0200, Miklos Szeredi wrote:
>> >> > FWIW, the first two are really straightforward expanding the function
>> >> > into its only callsite.  The last needs more splitup.  Not sure if the
>> >> > following is good enough, but it ought to be at least somewhat cleaner.
>> >> > Combined change is identical to the original, so it doesn't invalidate
>> >> > the testing so far...
>> >>
>> >> Hmm, patches look okay, but I'm wondering if we really need the morgue 
>> >> list and
>> >> the waiting.  Why not just skip dentries that are presently being handled 
>> >> by
>> >> dentry_kill()?
>> >
>> > Who will be freeing them?  If we do that from dentry_kill(), we are back to
>> > needing them removed from shrink list by something called by dput().  And
>> > if we do that from shrink_dentry_list(), we'd damn better wait for
>> > dentry_iput() et.al. to finish.
>>
>> We can do it from dput if the shrinker gets there first and from the
>> shrinker if dput manages to finish before.  See the updated patch in
>> the previous mail.
>
> Er?  The only patch I see is removal of RCU from shrink_dentry_list(), which
> is fine, but doesn't do anything of that sort.  What was the Message-ID?

Message-ID: <20140430154958.gc3...@tucsk.piliscsaba.szeredi.hu>

>
> Let me see if I understand what you are proposing:
> dentry_kill(dentry, 0) seeing DCACHE_DENTRY_KILLED
> check DCACHE_MAY_FREE, free it's been set
> dentry_kill(dentry, 1) seeing that we are on shrinker list
> leave on the list, do the work on killing, retake ->d_lock,
> if we are still on shrinker list
> set DCACHE_MAY_FREE,
> else
> free it
>
> That would probably work...

Exactly.

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-04-30 Thread Al Viro
On Wed, Apr 30, 2014 at 07:33:38PM +0200, Miklos Szeredi wrote:
> On Wed, Apr 30, 2014 at 6:03 PM, Al Viro  wrote:
> > On Wed, Apr 30, 2014 at 05:49:58PM +0200, Miklos Szeredi wrote:
> >> > FWIW, the first two are really straightforward expanding the function
> >> > into its only callsite.  The last needs more splitup.  Not sure if the
> >> > following is good enough, but it ought to be at least somewhat cleaner.
> >> > Combined change is identical to the original, so it doesn't invalidate
> >> > the testing so far...
> >>
> >> Hmm, patches look okay, but I'm wondering if we really need the morgue 
> >> list and
> >> the waiting.  Why not just skip dentries that are presently being handled 
> >> by
> >> dentry_kill()?
> >
> > Who will be freeing them?  If we do that from dentry_kill(), we are back to
> > needing them removed from shrink list by something called by dput().  And
> > if we do that from shrink_dentry_list(), we'd damn better wait for
> > dentry_iput() et.al. to finish.
> 
> We can do it from dput if the shrinker gets there first and from the
> shrinker if dput manages to finish before.  See the updated patch in
> the previous mail.

Er?  The only patch I see is removal of RCU from shrink_dentry_list(), which
is fine, but doesn't do anything of that sort.  What was the Message-ID?

Let me see if I understand what you are proposing:
dentry_kill(dentry, 0) seeing DCACHE_DENTRY_KILLED
check DCACHE_MAY_FREE, free it's been set
dentry_kill(dentry, 1) seeing that we are on shrinker list
leave on the list, do the work on killing, retake ->d_lock,
if we are still on shrinker list
set DCACHE_MAY_FREE,
else
free it

That would probably work...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-04-30 Thread Miklos Szeredi
On Wed, Apr 30, 2014 at 6:03 PM, Al Viro  wrote:
> On Wed, Apr 30, 2014 at 05:49:58PM +0200, Miklos Szeredi wrote:
>> > FWIW, the first two are really straightforward expanding the function
>> > into its only callsite.  The last needs more splitup.  Not sure if the
>> > following is good enough, but it ought to be at least somewhat cleaner.
>> > Combined change is identical to the original, so it doesn't invalidate
>> > the testing so far...
>>
>> Hmm, patches look okay, but I'm wondering if we really need the morgue list 
>> and
>> the waiting.  Why not just skip dentries that are presently being handled by
>> dentry_kill()?
>
> Who will be freeing them?  If we do that from dentry_kill(), we are back to
> needing them removed from shrink list by something called by dput().  And
> if we do that from shrink_dentry_list(), we'd damn better wait for
> dentry_iput() et.al. to finish.

We can do it from dput if the shrinker gets there first and from the
shrinker if dput manages to finish before.  See the updated patch in
the previous mail.

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-04-30 Thread Al Viro
On Wed, Apr 30, 2014 at 05:49:58PM +0200, Miklos Szeredi wrote:
> > FWIW, the first two are really straightforward expanding the function
> > into its only callsite.  The last needs more splitup.  Not sure if the
> > following is good enough, but it ought to be at least somewhat cleaner.
> > Combined change is identical to the original, so it doesn't invalidate
> > the testing so far...
> 
> Hmm, patches look okay, but I'm wondering if we really need the morgue list 
> and
> the waiting.  Why not just skip dentries that are presently being handled by
> dentry_kill()?

Who will be freeing them?  If we do that from dentry_kill(), we are back to
needing them removed from shrink list by something called by dput().  And
if we do that from shrink_dentry_list(), we'd damn better wait for
dentry_iput() et.al. to finish.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-04-30 Thread Miklos Szeredi
And a followup patch for that, removing RCU locking from shrink_dentry_list().

I haven't tested any of this at this point, but I'll ask IBM to do some stress
testing.

Thanks,
Miklos
---

From: Miklos Szeredi 
Subject: dcache: don't need rcu in shrink_dentry_list()

Since now the shrink list is private and nobody can free the dentry while
it is on the shrink list, we can remove RCU protection from this.

Signed-off-by: Miklos Szeredi 
---
 fs/dcache.c |   29 -
 1 file changed, 4 insertions(+), 25 deletions(-)

--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -822,23 +822,9 @@ static void shrink_dentry_list(struct li
 {
struct dentry *dentry, *parent;
 
-   rcu_read_lock();
-   for (;;) {
-   dentry = list_entry_rcu(list->prev, struct dentry, d_lru);
-   if (>d_lru == list)
-   break; /* empty */
-
-   /*
-* Get the dentry lock, and re-verify that the dentry is
-* this on the shrinking list. If it is, we know that
-* DCACHE_SHRINK_LIST and DCACHE_LRU_LIST are set.
-*/
+   while (!list_empty(list)) {
+   dentry = list_entry(list->prev, struct dentry, d_lru);
spin_lock(>d_lock);
-   if (dentry != list_entry(list->prev, struct dentry, d_lru)) {
-   spin_unlock(>d_lock);
-   continue;
-   }
-
/*
 * The dispose list is isolated and dentries are not accounted
 * to the LRU here, so we can simply remove it from the list
@@ -854,7 +840,6 @@ static void shrink_dentry_list(struct li
spin_unlock(>d_lock);
continue;
}
-   rcu_read_unlock();
 
if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) {
bool free_it;
@@ -870,8 +855,6 @@ static void shrink_dentry_list(struct li
 
if (free_it)
dentry_free(dentry);
-
-   rcu_read_lock();
continue;
}
 
@@ -879,17 +862,15 @@ static void shrink_dentry_list(struct li
/*
 * If dentry_kill returns NULL, we have nothing more to do.
 */
-   if (!parent) {
-   rcu_read_lock();
+   if (!parent)
continue;
-   }
+
if (unlikely(parent == dentry)) {
/*
 * trylocks have failed and d_lock has been held the
 * whole time, so it could not have been added to any
 * other lists. Just add it back to the shrink list.
 */
-   rcu_read_lock();
d_shrink_add(dentry, list);
spin_unlock(>d_lock);
continue;
@@ -903,9 +884,7 @@ static void shrink_dentry_list(struct li
dentry = parent;
while (dentry && !lockref_put_or_lock(>d_lockref))
dentry = dentry_kill(dentry, 1);
-   rcu_read_lock();
}
-   rcu_read_unlock();
 }
 
 static enum lru_status
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-04-30 Thread Miklos Szeredi
On Wed, Apr 30, 2014 at 05:04:36AM +0100, Al Viro wrote:
> On Tue, Apr 29, 2014 at 07:56:13PM -0700, Linus Torvalds wrote:
> > On Tue, Apr 29, 2014 at 7:31 PM, Al Viro  wrote:
> > >
> > > OK, aggregate diff follows, more readable splitup (3 commits) attached.
> > > It seems to survive beating here; testing, review and comments are
> > > welcome.
> > 
> > Miklos, did you have some particular load that triggered this, or was
> > it just some reports? It would be really good to get this patch some
> > stress-testing.
> > 
> > I like how the patch removes more lines than it adds, but apart from
> > that it's hard to read the patch (even the split-out ones) and say
> > anything more about it. I think this needs a *lot* of testing.
> 
> FWIW, the first two are really straightforward expanding the function
> into its only callsite.  The last needs more splitup.  Not sure if the
> following is good enough, but it ought to be at least somewhat cleaner.
> Combined change is identical to the original, so it doesn't invalidate
> the testing so far...

Hmm, patches look okay, but I'm wondering if we really need the morgue list and
the waiting.  Why not just skip dentries that are presently being handled by
dentry_kill()?

Thanks,
Miklos
---


From: Al Viro 
Date: Tue, 29 Apr 2014 23:52:05 -0400
Subject: Don't try to remove from shrink list we don't own

So far it's just been local equivalent transformations.  Now the meat of
that thing: dentry_kill() has two kinds of call sites - ones that had just
dropped the last reference (dput(), killing ancestors in
shrink_dentry_list()) and one where the victim had sat on shrink list with
zero refcount.  We already have a flag telling which kind it is
(unlock_on_failure).  What we want to do is to avoid d_shrink_del() in the
first kind of dentry_kill().  We do, however, want everything except the
actual freeing still done as we would in mainline.  So we need to deal with
two new situations - the first kind of dentry_kill() finding a dentry on
shrink list (result of laziness in dget(); we had it on shrink list with
refcount 1) and the second kind finding a dentry that is currently being
killed by the first kind.  What we do is this:

 - in the first kind of dentry_kill() we don't remove the dentry from the
   shrink list, this makes the shrink list really private to the shrinker
   functions

 - we proceed with the normal killing but only actually free the dentry if
   it's definitely not on the shrink list at this point.  If it's still on
   the shrink list set DCACHE_MAY_FREE and defer the freeing to
   shrink_dentry_list()

 - shrink_dentry_list() will detect if the dentry is killed with
   DCACHE_DENTRY_KILLED.  If DCACHE_MAY_FREE isn't yet set, just take the
   dentry off the shrink list and let the still running dentry_kill()
   finish it off.  If DCACHE_MAY_FREE is set, then free the dentry.

Signed-off-by: Al Viro 
Signed-off-by: Miklos Szeredi 
---
 fs/dcache.c|   44 ++--
 include/linux/dcache.h |2 ++
 2 files changed, 44 insertions(+), 2 deletions(-)

--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -469,6 +469,7 @@ dentry_kill(struct dentry *dentry, int u
 {
struct inode *inode;
struct dentry *parent;
+   bool can_free = true;
 
inode = dentry->d_inode;
if (inode && !spin_trylock(>i_lock)) {
@@ -504,8 +505,10 @@ dentry_kill(struct dentry *dentry, int u
if (dentry->d_flags & DCACHE_LRU_LIST) {
if (!(dentry->d_flags & DCACHE_SHRINK_LIST))
d_lru_del(dentry);
-   else
+   else if (!unlock_on_failure)
d_shrink_del(dentry);
+   else
+   can_free = false;
}
/* if it was on the hash then remove it */
__d_drop(dentry);
@@ -527,7 +530,25 @@ dentry_kill(struct dentry *dentry, int u
if (dentry->d_op && dentry->d_op->d_release)
dentry->d_op->d_release(dentry);
 
-   dentry_free(dentry);
+   if (unlikely(!can_free)) {
+   spin_lock(>d_lock);
+   /*
+* Could have gotten off the shrink list while not holding the
+* dcache lock.
+*
+* If still on the shrink list shrink_dentry_list will take care
+* of freeing it for us.  Signal this by setting the
+* DCACHE_MAY_FREE flag.
+*/
+   if (!(dentry->d_flags & DCACHE_SHRINK_LIST))
+   can_free = true;
+   else
+   dentry->d_flags |= DCACHE_MAY_FREE;
+   spin_unlock(>d_lock);
+   }
+   if (likely(can_free))
+   dentry_free(dentry);
+
return parent;
 }
 
@@ -835,6 +856,25 @@ static void shrink_dentry_list(struct li
}
rcu_read_unlock();
 
+   if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) {
+

Re: dcache shrink list corruption?

2014-04-30 Thread Miklos Szeredi
On Tue, Apr 29, 2014 at 07:56:13PM -0700, Linus Torvalds wrote:
> On Tue, Apr 29, 2014 at 7:31 PM, Al Viro  wrote:
> >
> > OK, aggregate diff follows, more readable splitup (3 commits) attached.
> > It seems to survive beating here; testing, review and comments are
> > welcome.
> 
> Miklos, did you have some particular load that triggered this, or was
> it just some reports? It would be really good to get this patch some
> stress-testing.
> 
> I like how the patch removes more lines than it adds, but apart from
> that it's hard to read the patch (even the split-out ones) and say
> anything more about it. I think this needs a *lot* of testing.

IBM is triggering this with the host01 test from the LTP suite.  I haven't yet
tried to reproduce it.

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-04-30 Thread Miklos Szeredi
On Tue, Apr 29, 2014 at 07:56:13PM -0700, Linus Torvalds wrote:
 On Tue, Apr 29, 2014 at 7:31 PM, Al Viro v...@zeniv.linux.org.uk wrote:
 
  OK, aggregate diff follows, more readable splitup (3 commits) attached.
  It seems to survive beating here; testing, review and comments are
  welcome.
 
 Miklos, did you have some particular load that triggered this, or was
 it just some reports? It would be really good to get this patch some
 stress-testing.
 
 I like how the patch removes more lines than it adds, but apart from
 that it's hard to read the patch (even the split-out ones) and say
 anything more about it. I think this needs a *lot* of testing.

IBM is triggering this with the host01 test from the LTP suite.  I haven't yet
tried to reproduce it.

Thanks,
Miklos
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-04-30 Thread Miklos Szeredi
On Wed, Apr 30, 2014 at 05:04:36AM +0100, Al Viro wrote:
 On Tue, Apr 29, 2014 at 07:56:13PM -0700, Linus Torvalds wrote:
  On Tue, Apr 29, 2014 at 7:31 PM, Al Viro v...@zeniv.linux.org.uk wrote:
  
   OK, aggregate diff follows, more readable splitup (3 commits) attached.
   It seems to survive beating here; testing, review and comments are
   welcome.
  
  Miklos, did you have some particular load that triggered this, or was
  it just some reports? It would be really good to get this patch some
  stress-testing.
  
  I like how the patch removes more lines than it adds, but apart from
  that it's hard to read the patch (even the split-out ones) and say
  anything more about it. I think this needs a *lot* of testing.
 
 FWIW, the first two are really straightforward expanding the function
 into its only callsite.  The last needs more splitup.  Not sure if the
 following is good enough, but it ought to be at least somewhat cleaner.
 Combined change is identical to the original, so it doesn't invalidate
 the testing so far...

Hmm, patches look okay, but I'm wondering if we really need the morgue list and
the waiting.  Why not just skip dentries that are presently being handled by
dentry_kill()?

Thanks,
Miklos
---


From: Al Viro v...@zeniv.linux.org.uk
Date: Tue, 29 Apr 2014 23:52:05 -0400
Subject: Don't try to remove from shrink list we don't own

So far it's just been local equivalent transformations.  Now the meat of
that thing: dentry_kill() has two kinds of call sites - ones that had just
dropped the last reference (dput(), killing ancestors in
shrink_dentry_list()) and one where the victim had sat on shrink list with
zero refcount.  We already have a flag telling which kind it is
(unlock_on_failure).  What we want to do is to avoid d_shrink_del() in the
first kind of dentry_kill().  We do, however, want everything except the
actual freeing still done as we would in mainline.  So we need to deal with
two new situations - the first kind of dentry_kill() finding a dentry on
shrink list (result of laziness in dget(); we had it on shrink list with
refcount 1) and the second kind finding a dentry that is currently being
killed by the first kind.  What we do is this:

 - in the first kind of dentry_kill() we don't remove the dentry from the
   shrink list, this makes the shrink list really private to the shrinker
   functions

 - we proceed with the normal killing but only actually free the dentry if
   it's definitely not on the shrink list at this point.  If it's still on
   the shrink list set DCACHE_MAY_FREE and defer the freeing to
   shrink_dentry_list()

 - shrink_dentry_list() will detect if the dentry is killed with
   DCACHE_DENTRY_KILLED.  If DCACHE_MAY_FREE isn't yet set, just take the
   dentry off the shrink list and let the still running dentry_kill()
   finish it off.  If DCACHE_MAY_FREE is set, then free the dentry.

Signed-off-by: Al Viro v...@zeniv.linux.org.uk
Signed-off-by: Miklos Szeredi mszer...@suse.cz
---
 fs/dcache.c|   44 ++--
 include/linux/dcache.h |2 ++
 2 files changed, 44 insertions(+), 2 deletions(-)

--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -469,6 +469,7 @@ dentry_kill(struct dentry *dentry, int u
 {
struct inode *inode;
struct dentry *parent;
+   bool can_free = true;
 
inode = dentry-d_inode;
if (inode  !spin_trylock(inode-i_lock)) {
@@ -504,8 +505,10 @@ dentry_kill(struct dentry *dentry, int u
if (dentry-d_flags  DCACHE_LRU_LIST) {
if (!(dentry-d_flags  DCACHE_SHRINK_LIST))
d_lru_del(dentry);
-   else
+   else if (!unlock_on_failure)
d_shrink_del(dentry);
+   else
+   can_free = false;
}
/* if it was on the hash then remove it */
__d_drop(dentry);
@@ -527,7 +530,25 @@ dentry_kill(struct dentry *dentry, int u
if (dentry-d_op  dentry-d_op-d_release)
dentry-d_op-d_release(dentry);
 
-   dentry_free(dentry);
+   if (unlikely(!can_free)) {
+   spin_lock(dentry-d_lock);
+   /*
+* Could have gotten off the shrink list while not holding the
+* dcache lock.
+*
+* If still on the shrink list shrink_dentry_list will take care
+* of freeing it for us.  Signal this by setting the
+* DCACHE_MAY_FREE flag.
+*/
+   if (!(dentry-d_flags  DCACHE_SHRINK_LIST))
+   can_free = true;
+   else
+   dentry-d_flags |= DCACHE_MAY_FREE;
+   spin_unlock(dentry-d_lock);
+   }
+   if (likely(can_free))
+   dentry_free(dentry);
+
return parent;
 }
 
@@ -835,6 +856,25 @@ static void shrink_dentry_list(struct li
}
rcu_read_unlock();
 
+   if 

Re: dcache shrink list corruption?

2014-04-30 Thread Miklos Szeredi
And a followup patch for that, removing RCU locking from shrink_dentry_list().

I haven't tested any of this at this point, but I'll ask IBM to do some stress
testing.

Thanks,
Miklos
---

From: Miklos Szeredi mszer...@suse.cz
Subject: dcache: don't need rcu in shrink_dentry_list()

Since now the shrink list is private and nobody can free the dentry while
it is on the shrink list, we can remove RCU protection from this.

Signed-off-by: Miklos Szeredi mszer...@suse.cz
---
 fs/dcache.c |   29 -
 1 file changed, 4 insertions(+), 25 deletions(-)

--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -822,23 +822,9 @@ static void shrink_dentry_list(struct li
 {
struct dentry *dentry, *parent;
 
-   rcu_read_lock();
-   for (;;) {
-   dentry = list_entry_rcu(list-prev, struct dentry, d_lru);
-   if (dentry-d_lru == list)
-   break; /* empty */
-
-   /*
-* Get the dentry lock, and re-verify that the dentry is
-* this on the shrinking list. If it is, we know that
-* DCACHE_SHRINK_LIST and DCACHE_LRU_LIST are set.
-*/
+   while (!list_empty(list)) {
+   dentry = list_entry(list-prev, struct dentry, d_lru);
spin_lock(dentry-d_lock);
-   if (dentry != list_entry(list-prev, struct dentry, d_lru)) {
-   spin_unlock(dentry-d_lock);
-   continue;
-   }
-
/*
 * The dispose list is isolated and dentries are not accounted
 * to the LRU here, so we can simply remove it from the list
@@ -854,7 +840,6 @@ static void shrink_dentry_list(struct li
spin_unlock(dentry-d_lock);
continue;
}
-   rcu_read_unlock();
 
if (unlikely(dentry-d_flags  DCACHE_DENTRY_KILLED)) {
bool free_it;
@@ -870,8 +855,6 @@ static void shrink_dentry_list(struct li
 
if (free_it)
dentry_free(dentry);
-
-   rcu_read_lock();
continue;
}
 
@@ -879,17 +862,15 @@ static void shrink_dentry_list(struct li
/*
 * If dentry_kill returns NULL, we have nothing more to do.
 */
-   if (!parent) {
-   rcu_read_lock();
+   if (!parent)
continue;
-   }
+
if (unlikely(parent == dentry)) {
/*
 * trylocks have failed and d_lock has been held the
 * whole time, so it could not have been added to any
 * other lists. Just add it back to the shrink list.
 */
-   rcu_read_lock();
d_shrink_add(dentry, list);
spin_unlock(dentry-d_lock);
continue;
@@ -903,9 +884,7 @@ static void shrink_dentry_list(struct li
dentry = parent;
while (dentry  !lockref_put_or_lock(dentry-d_lockref))
dentry = dentry_kill(dentry, 1);
-   rcu_read_lock();
}
-   rcu_read_unlock();
 }
 
 static enum lru_status
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-04-30 Thread Al Viro
On Wed, Apr 30, 2014 at 05:49:58PM +0200, Miklos Szeredi wrote:
  FWIW, the first two are really straightforward expanding the function
  into its only callsite.  The last needs more splitup.  Not sure if the
  following is good enough, but it ought to be at least somewhat cleaner.
  Combined change is identical to the original, so it doesn't invalidate
  the testing so far...
 
 Hmm, patches look okay, but I'm wondering if we really need the morgue list 
 and
 the waiting.  Why not just skip dentries that are presently being handled by
 dentry_kill()?

Who will be freeing them?  If we do that from dentry_kill(), we are back to
needing them removed from shrink list by something called by dput().  And
if we do that from shrink_dentry_list(), we'd damn better wait for
dentry_iput() et.al. to finish.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-04-30 Thread Miklos Szeredi
On Wed, Apr 30, 2014 at 6:03 PM, Al Viro v...@zeniv.linux.org.uk wrote:
 On Wed, Apr 30, 2014 at 05:49:58PM +0200, Miklos Szeredi wrote:
  FWIW, the first two are really straightforward expanding the function
  into its only callsite.  The last needs more splitup.  Not sure if the
  following is good enough, but it ought to be at least somewhat cleaner.
  Combined change is identical to the original, so it doesn't invalidate
  the testing so far...

 Hmm, patches look okay, but I'm wondering if we really need the morgue list 
 and
 the waiting.  Why not just skip dentries that are presently being handled by
 dentry_kill()?

 Who will be freeing them?  If we do that from dentry_kill(), we are back to
 needing them removed from shrink list by something called by dput().  And
 if we do that from shrink_dentry_list(), we'd damn better wait for
 dentry_iput() et.al. to finish.

We can do it from dput if the shrinker gets there first and from the
shrinker if dput manages to finish before.  See the updated patch in
the previous mail.

Thanks,
Miklos
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-04-30 Thread Al Viro
On Wed, Apr 30, 2014 at 07:33:38PM +0200, Miklos Szeredi wrote:
 On Wed, Apr 30, 2014 at 6:03 PM, Al Viro v...@zeniv.linux.org.uk wrote:
  On Wed, Apr 30, 2014 at 05:49:58PM +0200, Miklos Szeredi wrote:
   FWIW, the first two are really straightforward expanding the function
   into its only callsite.  The last needs more splitup.  Not sure if the
   following is good enough, but it ought to be at least somewhat cleaner.
   Combined change is identical to the original, so it doesn't invalidate
   the testing so far...
 
  Hmm, patches look okay, but I'm wondering if we really need the morgue 
  list and
  the waiting.  Why not just skip dentries that are presently being handled 
  by
  dentry_kill()?
 
  Who will be freeing them?  If we do that from dentry_kill(), we are back to
  needing them removed from shrink list by something called by dput().  And
  if we do that from shrink_dentry_list(), we'd damn better wait for
  dentry_iput() et.al. to finish.
 
 We can do it from dput if the shrinker gets there first and from the
 shrinker if dput manages to finish before.  See the updated patch in
 the previous mail.

Er?  The only patch I see is removal of RCU from shrink_dentry_list(), which
is fine, but doesn't do anything of that sort.  What was the Message-ID?

Let me see if I understand what you are proposing:
dentry_kill(dentry, 0) seeing DCACHE_DENTRY_KILLED
check DCACHE_MAY_FREE, free it's been set
dentry_kill(dentry, 1) seeing that we are on shrinker list
leave on the list, do the work on killing, retake -d_lock,
if we are still on shrinker list
set DCACHE_MAY_FREE,
else
free it

That would probably work...
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-04-30 Thread Miklos Szeredi
On Wed, Apr 30, 2014 at 8:36 PM, Al Viro v...@zeniv.linux.org.uk wrote:
 On Wed, Apr 30, 2014 at 07:33:38PM +0200, Miklos Szeredi wrote:
 On Wed, Apr 30, 2014 at 6:03 PM, Al Viro v...@zeniv.linux.org.uk wrote:
  On Wed, Apr 30, 2014 at 05:49:58PM +0200, Miklos Szeredi wrote:
   FWIW, the first two are really straightforward expanding the function
   into its only callsite.  The last needs more splitup.  Not sure if the
   following is good enough, but it ought to be at least somewhat cleaner.
   Combined change is identical to the original, so it doesn't invalidate
   the testing so far...
 
  Hmm, patches look okay, but I'm wondering if we really need the morgue 
  list and
  the waiting.  Why not just skip dentries that are presently being handled 
  by
  dentry_kill()?
 
  Who will be freeing them?  If we do that from dentry_kill(), we are back to
  needing them removed from shrink list by something called by dput().  And
  if we do that from shrink_dentry_list(), we'd damn better wait for
  dentry_iput() et.al. to finish.

 We can do it from dput if the shrinker gets there first and from the
 shrinker if dput manages to finish before.  See the updated patch in
 the previous mail.

 Er?  The only patch I see is removal of RCU from shrink_dentry_list(), which
 is fine, but doesn't do anything of that sort.  What was the Message-ID?

Message-ID: 20140430154958.gc3...@tucsk.piliscsaba.szeredi.hu


 Let me see if I understand what you are proposing:
 dentry_kill(dentry, 0) seeing DCACHE_DENTRY_KILLED
 check DCACHE_MAY_FREE, free it's been set
 dentry_kill(dentry, 1) seeing that we are on shrinker list
 leave on the list, do the work on killing, retake -d_lock,
 if we are still on shrinker list
 set DCACHE_MAY_FREE,
 else
 free it

 That would probably work...

Exactly.

Thanks,
Miklos
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-04-30 Thread Al Viro
On Wed, Apr 30, 2014 at 08:42:01PM +0200, Miklos Szeredi wrote:

 Message-ID: 20140430154958.gc3...@tucsk.piliscsaba.szeredi.hu

I see...  Several points:
* I still think that it's better to do handling of we see
DCACHE_DENTRY_KILLED already set in dentry_kill() itself.
* in dentry_kill(dentry, 0) we *know* that it's not on a shrink
list - the caller has just removed it from there and we'd kept -d_lock
since that point.   What's more, with that scheme we don't need to bother
with can_free at all - just grab -d_lock after -d_release() call and check
DCACHE_SHRINK_LIST.  No sense trying to avoid that - in case where we
could just go ahead and free the sucker, there's neither contention nor
anybody else interested in that cacheline, so spin_lock(dentry-d_lock)
is pretty much free.

IOW, I'd prefer to do the following (completely untested) on top of 1/6--4/6:

diff --git a/fs/dcache.c b/fs/dcache.c
index e482775..82d8eb4 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -489,6 +489,17 @@ relock:
goto relock;
}
 
+   if (unlikely(dentry-d_flags  DCACHE_DENTRY_KILLED)) {
+   if (parent)
+   spin_unlock(parent-d_lock);
+   if (dentry-d_flags  DCACHE_MAY_FREE) {
+   spin_unlock(dentry-d_lock);
+   dentry_free(dentry);
+   } else {
+   spin_unlock(dentry-d_lock);
+   }
+   return parent;
+   }
/*
 * The dentry is now unrecoverably dead to the world.
 */
@@ -504,8 +515,6 @@ relock:
if (dentry-d_flags  DCACHE_LRU_LIST) {
if (!(dentry-d_flags  DCACHE_SHRINK_LIST))
d_lru_del(dentry);
-   else
-   d_shrink_del(dentry);
}
/* if it was on the hash then remove it */
__d_drop(dentry);
@@ -527,7 +536,14 @@ relock:
if (dentry-d_op  dentry-d_op-d_release)
dentry-d_op-d_release(dentry);
 
-   dentry_free(dentry);
+   spin_lock(dentry-d_lock);
+   if (dentry-d_flags  DCACHE_SHRINK_LIST) {
+   dentry-d_flags |= DCACHE_MAY_FREE;
+   spin_unlock(dentry-d_lock);
+   } else {
+   spin_unlock(dentry-d_lock);
+   dentry_free(dentry);
+   }
return parent;
 }
 
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 3b9bfdb..3c7ec32 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -221,6 +221,8 @@ struct dentry_operations {
 #define DCACHE_SYMLINK_TYPE0x0030 /* Symlink */
 #define DCACHE_FILE_TYPE   0x0040 /* Other file type */
 
+#define DCACHE_MAY_FREE0x0080
+
 extern seqlock_t rename_lock;
 
 static inline int dname_external(const struct dentry *dentry)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-04-30 Thread Al Viro
On Wed, Apr 30, 2014 at 08:02:27PM +0100, Al Viro wrote:
 On Wed, Apr 30, 2014 at 08:42:01PM +0200, Miklos Szeredi wrote:
 
  Message-ID: 20140430154958.gc3...@tucsk.piliscsaba.szeredi.hu
 
 I see...  Several points:
   * I still think that it's better to do handling of we see
 DCACHE_DENTRY_KILLED already set in dentry_kill() itself.
   * in dentry_kill(dentry, 0) we *know* that it's not on a shrink
 list - the caller has just removed it from there and we'd kept -d_lock
 since that point.   What's more, with that scheme we don't need to bother
 with can_free at all - just grab -d_lock after -d_release() call and check
 DCACHE_SHRINK_LIST.  No sense trying to avoid that - in case where we
 could just go ahead and free the sucker, there's neither contention nor
 anybody else interested in that cacheline, so spin_lock(dentry-d_lock)
 is pretty much free.
 
 IOW, I'd prefer to do the following (completely untested) on top of 1/6--4/6:

Sigh...  One more problem:
/*
 * We found an inuse dentry which was not removed from
 * the LRU because of laziness during lookup. Do not free it.
 */
if (dentry-d_lockref.count) {
spin_unlock(dentry-d_lock);
continue;
}
should become
if (dentry-d_lockref.count  0) {

instead - lockref_mark_dead() slaps -128 into it, so we'll just leak all
dentries where dput() has come first and decided to leave the suckers
for us.

Another thing: I don't like what's going on with freeing vs. -d_lock there.
Had that been a mutex, we'd definitely get a repeat of vfs: fix subtle
use-after-free of pipe_inode_info.  The question is, can spin_unlock(p)
dereference p after another CPU gets through spin_lock(p)?  Linus?

It can be dealt with by setting DCACHE_RCUACCESS in let another guy free
it cases and playing with rcu_read_lock a bit, but I wonder whether we
need to bother - quick look through alpha/sparc/ppc shows that on those
we do not and the same is true for non-paravirt case on x86.  I hadn't
checked what paravirt one does, though, and I certainly hadn't done
exhaustive search for architectures doing something weird...
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-04-30 Thread Linus Torvalds
On Wed, Apr 30, 2014 at 12:59 PM, Al Viro v...@zeniv.linux.org.uk wrote:

 Another thing: I don't like what's going on with freeing vs. -d_lock there.
 Had that been a mutex, we'd definitely get a repeat of vfs: fix subtle
 use-after-free of pipe_inode_info.  The question is, can spin_unlock(p)
 dereference p after another CPU gets through spin_lock(p)?  Linus?

spin_unlock() *should* be safe wrt that issue.

But I have to say, I think paravirtualized spinlocks may break that.
They do all kinds of kick waiters after releasing the lock.

Doesn't the RCU protection solve that, though? Nobody should be
releasing the dentry under us, afaik..

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-04-30 Thread Al Viro
On Wed, Apr 30, 2014 at 01:23:26PM -0700, Linus Torvalds wrote:
 On Wed, Apr 30, 2014 at 12:59 PM, Al Viro v...@zeniv.linux.org.uk wrote:
 
  Another thing: I don't like what's going on with freeing vs. -d_lock there.
  Had that been a mutex, we'd definitely get a repeat of vfs: fix subtle
  use-after-free of pipe_inode_info.  The question is, can spin_unlock(p)
  dereference p after another CPU gets through spin_lock(p)?  Linus?
 
 spin_unlock() *should* be safe wrt that issue.
 
 But I have to say, I think paravirtualized spinlocks may break that.
 They do all kinds of kick waiters after releasing the lock.
 
 Doesn't the RCU protection solve that, though? Nobody should be
 releasing the dentry under us, afaik..

We do not (and cannot) call dentry_kill() with rcu_read_lock held - it can
trigger any amount of IO, for one thing.  We can take it around the
couple of places where do that spin_unlock(dentry-d_lock) (along with
setting DCACHE_RCUACCESS) - that's what I'd been refering to.  Then this
sucker (tests still running, so far everything seems to survive) becomes
the following (again, on top of 1/6..4/6).  BTW, is there any convenient
way to tell git commit --amend to update the commit date?  Something
like --date=now would be nice, but it isn't accepted...

commit 797ff22681dc969b478ed837787d24dfd2dd2132
Author: Al Viro v...@zeniv.linux.org.uk
Date:   Tue Apr 29 23:52:05 2014 -0400

dentry_kill(): don't try to remove from shrink list

If the victim in on the shrink list, don't remove it from there.
If shrink_dentry_list() manages to remove it from the list before
we are done - fine, we'll just free it as usual.  If not - mark
it with new flag (DCACHE_MAY_FREE) and leave it there.

Eventually, shrink_dentry_list() will get to it, remove the sucker
from shrink list and call dentry_kill(dentry, 0).  Which is where
we'll deal with freeing.

Since now dentry_kill(dentry, 0) may happen after or during
dentry_kill(dentry, 1), we need to recognize that (by seeing
DCACHE_DENTRY_KILLED already set), unlock everything
and either free the sucker (in case DCACHE_MAY_FREE has been
set) or leave it for ongoing dentry_kill(dentry, 1) to deal with.

Signed-off-by: Al Viro v...@zeniv.linux.org.uk

diff --git a/fs/dcache.c b/fs/dcache.c
index e482775..fa40d26 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -489,6 +489,20 @@ relock:
goto relock;
}
 
+   if (unlikely(dentry-d_flags  DCACHE_DENTRY_KILLED)) {
+   if (parent)
+   spin_unlock(parent-d_lock);
+   if (dentry-d_flags  DCACHE_MAY_FREE) {
+   spin_unlock(dentry-d_lock);
+   dentry_free(dentry);
+   } else {
+   dentry-d_flags |= DCACHE_RCUACCESS;
+   rcu_read_lock();
+   spin_unlock(dentry-d_lock);
+   rcu_read_unlock();
+   }
+   return parent;
+   }
/*
 * The dentry is now unrecoverably dead to the world.
 */
@@ -504,8 +518,6 @@ relock:
if (dentry-d_flags  DCACHE_LRU_LIST) {
if (!(dentry-d_flags  DCACHE_SHRINK_LIST))
d_lru_del(dentry);
-   else
-   d_shrink_del(dentry);
}
/* if it was on the hash then remove it */
__d_drop(dentry);
@@ -527,7 +539,16 @@ relock:
if (dentry-d_op  dentry-d_op-d_release)
dentry-d_op-d_release(dentry);
 
-   dentry_free(dentry);
+   spin_lock(dentry-d_lock);
+   if (dentry-d_flags  DCACHE_SHRINK_LIST) {
+   dentry-d_flags |= DCACHE_MAY_FREE | DCACHE_RCUACCESS;
+   rcu_read_lock();
+   spin_unlock(dentry-d_lock);
+   rcu_read_unlock();
+   } else {
+   spin_unlock(dentry-d_lock);
+   dentry_free(dentry);
+   }
return parent;
 }
 
@@ -829,7 +850,7 @@ static void shrink_dentry_list(struct list_head *list)
 * We found an inuse dentry which was not removed from
 * the LRU because of laziness during lookup. Do not free it.
 */
-   if (dentry-d_lockref.count) {
+   if (dentry-d_lockref.count  0) {
spin_unlock(dentry-d_lock);
continue;
}
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 3b9bfdb..3c7ec32 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -221,6 +221,8 @@ struct dentry_operations {
 #define DCACHE_SYMLINK_TYPE0x0030 /* Symlink */
 #define DCACHE_FILE_TYPE   0x0040 /* Other file type */
 
+#define DCACHE_MAY_FREE0x0080
+
 extern seqlock_t rename_lock;
 
 static inline int dname_external(const struct dentry *dentry)
--
To unsubscribe from this list: send the 

Re: dcache shrink list corruption?

2014-04-30 Thread Linus Torvalds
On Wed, Apr 30, 2014 at 1:38 PM, Al Viro v...@zeniv.linux.org.uk wrote:

 We do not (and cannot) call dentry_kill() with rcu_read_lock held - it can
 trigger any amount of IO, for one thing.  We can take it around the
 couple of places where do that spin_unlock(dentry-d_lock) (along with
 setting DCACHE_RCUACCESS) - that's what I'd been refering to.

Just the last spin_unlock() would be the case that matters, if the
spin_unlock() is done on something that could be freed immediately and
the lock protects and is inside the entity that gets freed.

   BTW, is there any convenient
 way to tell git commit --amend to update the commit date?  Something
 like --date=now would be nice, but it isn't accepted...

--date=$(date) works.

The --date thing doesn't take the nice git approxidate strings,
maybe it should. So you have to give it a real date.

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-04-30 Thread Al Viro
On Wed, Apr 30, 2014 at 01:57:05PM -0700, Linus Torvalds wrote:
 On Wed, Apr 30, 2014 at 1:38 PM, Al Viro v...@zeniv.linux.org.uk wrote:
 
  We do not (and cannot) call dentry_kill() with rcu_read_lock held - it can
  trigger any amount of IO, for one thing.  We can take it around the
  couple of places where do that spin_unlock(dentry-d_lock) (along with
  setting DCACHE_RCUACCESS) - that's what I'd been refering to.
 
 Just the last spin_unlock() would be the case that matters, if the
 spin_unlock() is done on something that could be freed immediately and
 the lock protects and is inside the entity that gets freed.

*nod*

There are two such spin_unlock (handover from shrink_dentry_list() to
dput() and the opposite one), but they are all that needs protection -
-d_flags update is outside the rcu-critical area.  I really wonder
if we *can* get there without DCACHE_RCUACCESS having been set, though;
dentry would have to be
* picked into shrink list (i.e. have had zero refcount at some point)
* never had been through __d_rehash()
shrink_dentry_list() definitely counts on that being impossible, and it
probably is, but I'm feeling seriously paranoid about the whole area.
I'll finish grepping through the tree and probably drop setting
DCACHE_RCUACCESS from the patch - either that, or set it in d_shrink_add()
it it turns out that it is possible and shrink_dentry_list() is fucked...

Tests seem to be running fine so far...

BTW, is there any convenient
  way to tell git commit --amend to update the commit date?  Something
  like --date=now would be nice, but it isn't accepted...
 
 --date=$(date) works.

Thanks...
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-04-30 Thread Al Viro
On Wed, Apr 30, 2014 at 10:12:06PM +0100, Al Viro wrote:
 On Wed, Apr 30, 2014 at 01:57:05PM -0700, Linus Torvalds wrote:
  On Wed, Apr 30, 2014 at 1:38 PM, Al Viro v...@zeniv.linux.org.uk wrote:
  
   We do not (and cannot) call dentry_kill() with rcu_read_lock held - it can
   trigger any amount of IO, for one thing.  We can take it around the
   couple of places where do that spin_unlock(dentry-d_lock) (along with
   setting DCACHE_RCUACCESS) - that's what I'd been refering to.
  
  Just the last spin_unlock() would be the case that matters, if the
  spin_unlock() is done on something that could be freed immediately and
  the lock protects and is inside the entity that gets freed.
 
 *nod*
 
 There are two such spin_unlock (handover from shrink_dentry_list() to
 dput() and the opposite one), but they are all that needs protection -
 -d_flags update is outside the rcu-critical area.  I really wonder
 if we *can* get there without DCACHE_RCUACCESS having been set, though;
 dentry would have to be
   * picked into shrink list (i.e. have had zero refcount at some point)
   * never had been through __d_rehash()
 shrink_dentry_list() definitely counts on that being impossible, and it
 probably is, but I'm feeling seriously paranoid about the whole area.
 I'll finish grepping through the tree and probably drop setting
 DCACHE_RCUACCESS from the patch - either that, or set it in d_shrink_add()
 it it turns out that it is possible and shrink_dentry_list() is fucked...

OK, it really can't happen.  The proof is more convoluted than I'd like it,
but it's solid enough, so setting that flag in dentry_kill() handover cases
wasn't needed.  I've just pushed the whole thing to vfs.git#for-linus;
review and testing would be very welcome.  I can repost it one more time,
but the only difference compared to the last variant in this thread is not
bothering with DCACHE_RCUACCESS.

It has survived LTP tests, going through xfstests now...
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-04-30 Thread Linus Torvalds
On Wed, Apr 30, 2014 at 3:12 PM, Al Viro v...@zeniv.linux.org.uk wrote:

 I've just pushed the whole thing to vfs.git#for-linus;
 review and testing would be very welcome.

I have no half-way relevant test-case for this, so I'm hoping people
who have good VFS stress-tests (preferably under memory pressure so
that we get that whole shrinking path) will test.

But it looks fine.

That said, I do hate that RCU read-lock around the final spin-unlock.

Let me go talk to the paravirt people. Maybe they don't need this, and
I don't know exactly *how* they use that lock pointer after the unlock
in the kick waiters part. Maybe it's all good.

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-04-30 Thread Linus Torvalds
On Wed, Apr 30, 2014 at 4:04 PM, Linus Torvalds
torva...@linux-foundation.org wrote:

 Let me go talk to the paravirt people. Maybe they don't need this, and
 I don't know exactly *how* they use that lock pointer after the unlock
 in the kick waiters part. Maybe it's all good.

.. looking at current users (xen and kvm) it does in fact look all
good. Yes, we kick possible waiters after the unlock, but the lock
itself is not touched by that, it just uses the pointer to the lock as
a way to figure out who to kick.

In fact, I kind of knew that, but had forgotten. We very much depend
on spin_unlock being safe wrt immediate deleton already: the
completion code very much depends on it. It does a spin_unlock()
to release the completion, and it can get reused immediately (it might
be on the stack, it might be in some data structure that gets freed).

So I'd suggest removing that whole RCU thing, because it should be
safe to unlock something that can go away immediately. We'd have huge
problems elsewhere if it wasn't safe.

   Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-04-30 Thread Al Viro
On Wed, Apr 30, 2014 at 04:04:58PM -0700, Linus Torvalds wrote:
 On Wed, Apr 30, 2014 at 3:12 PM, Al Viro v...@zeniv.linux.org.uk wrote:
 
  I've just pushed the whole thing to vfs.git#for-linus;
  review and testing would be very welcome.
 
 I have no half-way relevant test-case for this, so I'm hoping people
 who have good VFS stress-tests (preferably under memory pressure so
 that we get that whole shrinking path) will test.
 
 But it looks fine.
 
 That said, I do hate that RCU read-lock around the final spin-unlock.

So do I, obviously...  After looking through the rest of arch_spin_unlock(),
it looks like the only suspicious one except the paravirt on x86 is
blackfin.  And that might be misreading - I'm not familiar enough with the
architecture to tell...
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-04-30 Thread Al Viro
On Wed, Apr 30, 2014 at 04:14:14PM -0700, Linus Torvalds wrote:
 On Wed, Apr 30, 2014 at 4:04 PM, Linus Torvalds
 torva...@linux-foundation.org wrote:
 
  Let me go talk to the paravirt people. Maybe they don't need this, and
  I don't know exactly *how* they use that lock pointer after the unlock
  in the kick waiters part. Maybe it's all good.
 
 .. looking at current users (xen and kvm) it does in fact look all
 good. Yes, we kick possible waiters after the unlock, but the lock
 itself is not touched by that, it just uses the pointer to the lock as
 a way to figure out who to kick.
 
 In fact, I kind of knew that, but had forgotten. We very much depend
 on spin_unlock being safe wrt immediate deleton already: the
 completion code very much depends on it. It does a spin_unlock()
 to release the completion, and it can get reused immediately (it might
 be on the stack, it might be in some data structure that gets freed).
 
 So I'd suggest removing that whole RCU thing, because it should be
 safe to unlock something that can go away immediately. We'd have huge
 problems elsewhere if it wasn't safe.

OK, done and force-pushed.  Should propagate in a few...
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-04-30 Thread Linus Torvalds
On Wed, Apr 30, 2014 at 4:43 PM, Al Viro v...@zeniv.linux.org.uk wrote:

 OK, done and force-pushed.  Should propagate in a few...

That made it more obvious how the DCACHE_MAY_FREE case ends up
working. And in particular, mind rewriting this:

if (dentry-d_flags  DCACHE_MAY_FREE) {
spin_unlock(dentry-d_lock);
dentry_free(dentry);
} else {
spin_unlock(dentry-d_lock);
}
return parent;

as just

bool free = dentry-d_flags  DCACHE_MAY_FREE;
spin_unlock(dentry-d_lock);
if (free)
dentry_free(dentry);
return parent;

instead? In fact, I get the feeling that the other case later on
really fits the same model:

spin_lock(dentry-d_lock);
if (dentry-d_flags  DCACHE_SHRINK_LIST) {
dentry-d_flags |= DCACHE_MAY_FREE;
spin_unlock(dentry-d_lock);
} else {
spin_unlock(dentry-d_lock);
dentry_free(dentry);
  }

ends up really being better as

spin_lock(dentry-d_lock);
free = 1;
if (dentry-d_flags  DCACHE_SHRINK_LIST) {
dentry-d_flags |= DCACHE_MAY_FREE;
free = 0;
}
spin_unlock(dentry-d_lock);
if (free)
dentry_free(dentry);
return parent;

and then suddenly it looks like we have a common exit sequence from
that dentry_kill() function, no?

(The earlier unlock_on_failure exit case is altogether a different case).

I dunno. Maybe not a big deal, but one reason I prefer doing that
free flag is because I really tend to prefer the simple case of
lock-unlock pairing cleanly at the same level. NOT the pattern where
you have one lock at one indentation level, paired with multiple
unlocks for all the different cases.

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-04-30 Thread Al Viro
On Wed, Apr 30, 2014 at 05:18:23PM -0700, Linus Torvalds wrote:

 and then suddenly it looks like we have a common exit sequence from
 that dentry_kill() function, no?
 
 (The earlier unlock_on_failure exit case is altogether a different case).
 
 I dunno. Maybe not a big deal, but one reason I prefer doing that
 free flag is because I really tend to prefer the simple case of
 lock-unlock pairing cleanly at the same level. NOT the pattern where
 you have one lock at one indentation level, paired with multiple
 unlocks for all the different cases.

Yeah, but... I have such variant, but the best I could get still generated
the code that wasn't particulary nice.  Part might be gcc code generation
sucking for bool, part - extra register pressure...  It has slightly lower
i-cache footprint, though, so it might be worth doing.  Hell knows; that's a
fairly hot codepath, so let's do it that way - I've just pushed an alternative
branch with bool can_free variant; branches in question: vfs.git#for-linus and
vfs.git#dentry_kill-2. Help with profiling is needed; the loads to watch are
the ones where dentry_kill() + dentry_free() are sufficiently high in profiles
for the differences to matter.  Any takers?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-04-30 Thread Linus Torvalds
On Wed, Apr 30, 2014 at 7:51 PM, Al Viro v...@zeniv.linux.org.uk wrote:

   Help with profiling is needed; the loads to watch are
 the ones where dentry_kill() + dentry_free() are sufficiently high in profiles
 for the differences to matter.  Any takers?

I really hope there are no such loads, my lock/unlock pairing
suggestion was mostly so that the pairing is clearer, not necessarily
for performance.

That said, I'd assume that it migth be worth testing at least the
lots of concurrent lookups of 'simple_dentry_operations' dentries.
So most of /proc, most uses of simple_lookup(). That at least triggers
the dentry_kill() path in dput(), so it should be fairly easy to get
profiles.

But real loads with real filesystems? That sounds harder.

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-04-30 Thread Al Viro
On Wed, Apr 30, 2014 at 07:59:43PM -0700, Linus Torvalds wrote:
 On Wed, Apr 30, 2014 at 7:51 PM, Al Viro v...@zeniv.linux.org.uk wrote:
 
Help with profiling is needed; the loads to watch are
  the ones where dentry_kill() + dentry_free() are sufficiently high in 
  profiles
  for the differences to matter.  Any takers?
 
 I really hope there are no such loads, my lock/unlock pairing
 suggestion was mostly so that the pairing is clearer, not necessarily
 for performance.
 
 That said, I'd assume that it migth be worth testing at least the
 lots of concurrent lookups of 'simple_dentry_operations' dentries.
 So most of /proc, most uses of simple_lookup(). That at least triggers
 the dentry_kill() path in dput(), so it should be fairly easy to get
 profiles.
 
 But real loads with real filesystems? That sounds harder.

Well, the simplest way to do that is a bunch of open/unlink/close.  Another
one is cp -rl / rm -rf of a large tree (rmdir(2) does shrink_dcache_parent()). 
For that matter, unmapping an anon shared mapping will trigger the same path.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-04-29 Thread Al Viro
On Tue, Apr 29, 2014 at 07:56:13PM -0700, Linus Torvalds wrote:
> On Tue, Apr 29, 2014 at 7:31 PM, Al Viro  wrote:
> >
> > OK, aggregate diff follows, more readable splitup (3 commits) attached.
> > It seems to survive beating here; testing, review and comments are
> > welcome.
> 
> Miklos, did you have some particular load that triggered this, or was
> it just some reports? It would be really good to get this patch some
> stress-testing.
> 
> I like how the patch removes more lines than it adds, but apart from
> that it's hard to read the patch (even the split-out ones) and say
> anything more about it. I think this needs a *lot* of testing.

FWIW, the first two are really straightforward expanding the function
into its only callsite.  The last needs more splitup.  Not sure if the
following is good enough, but it ought to be at least somewhat cleaner.
Combined change is identical to the original, so it doesn't invalidate
the testing so far...

>From 895aeb48465bbf78803fd11dee2556d010ada23a Mon Sep 17 00:00:00 2001
From: Al Viro 
Date: Tue, 29 Apr 2014 15:45:28 -0400
Subject: [PATCH 1/6] fold d_kill() and d_free()

Signed-off-by: Al Viro 
---
 fs/dcache.c |   76 +++
 1 file changed, 24 insertions(+), 52 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 494a9def..9b15c5c 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -246,23 +246,6 @@ static void __d_free(struct rcu_head *head)
kmem_cache_free(dentry_cache, dentry); 
 }
 
-/*
- * no locks, please.
- */
-static void d_free(struct dentry *dentry)
-{
-   BUG_ON((int)dentry->d_lockref.count > 0);
-   this_cpu_dec(nr_dentry);
-   if (dentry->d_op && dentry->d_op->d_release)
-   dentry->d_op->d_release(dentry);
-
-   /* if dentry was never visible to RCU, immediate free is OK */
-   if (!(dentry->d_flags & DCACHE_RCUACCESS))
-   __d_free(>d_u.d_rcu);
-   else
-   call_rcu(>d_u.d_rcu, __d_free);
-}
-
 /**
  * dentry_rcuwalk_barrier - invalidate in-progress rcu-walk lookups
  * @dentry: the target dentry
@@ -420,40 +403,6 @@ static void dentry_lru_del(struct dentry *dentry)
 }
 
 /**
- * d_kill - kill dentry and return parent
- * @dentry: dentry to kill
- * @parent: parent dentry
- *
- * The dentry must already be unhashed and removed from the LRU.
- *
- * If this is the root of the dentry tree, return NULL.
- *
- * dentry->d_lock and parent->d_lock must be held by caller, and are dropped by
- * d_kill.
- */
-static struct dentry *d_kill(struct dentry *dentry, struct dentry *parent)
-   __releases(dentry->d_lock)
-   __releases(parent->d_lock)
-   __releases(dentry->d_inode->i_lock)
-{
-   list_del(>d_u.d_child);
-   /*
-* Inform d_walk() that we are no longer attached to the
-* dentry tree
-*/
-   dentry->d_flags |= DCACHE_DENTRY_KILLED;
-   if (parent)
-   spin_unlock(>d_lock);
-   dentry_iput(dentry);
-   /*
-* dentry_iput drops the locks, at which point nobody (except
-* transient RCU lookups) can reach this dentry.
-*/
-   d_free(dentry);
-   return parent;
-}
-
-/**
  * d_drop - drop a dentry
  * @dentry: dentry to drop
  *
@@ -546,7 +495,30 @@ relock:
dentry_lru_del(dentry);
/* if it was on the hash then remove it */
__d_drop(dentry);
-   return d_kill(dentry, parent);
+   list_del(>d_u.d_child);
+   /*
+* Inform d_walk() that we are no longer attached to the
+* dentry tree
+*/
+   dentry->d_flags |= DCACHE_DENTRY_KILLED;
+   if (parent)
+   spin_unlock(>d_lock);
+   dentry_iput(dentry);
+   /*
+* dentry_iput drops the locks, at which point nobody (except
+* transient RCU lookups) can reach this dentry.
+*/
+   BUG_ON((int)dentry->d_lockref.count > 0);
+   this_cpu_dec(nr_dentry);
+   if (dentry->d_op && dentry->d_op->d_release)
+   dentry->d_op->d_release(dentry);
+
+   /* if dentry was never visible to RCU, immediate free is OK */
+   if (!(dentry->d_flags & DCACHE_RCUACCESS))
+   __d_free(>d_u.d_rcu);
+   else
+   call_rcu(>d_u.d_rcu, __d_free);
+   return parent;
 }
 
 /* 
-- 
1.7.10.4


>From aff934c47717be0216c9e2c10a2e8ca0f829bb54 Mon Sep 17 00:00:00 2001
From: Al Viro 
Date: Tue, 29 Apr 2014 16:13:18 -0400
Subject: [PATCH 2/6] fold try_prune_one_dentry()

Signed-off-by: Al Viro 
---
 fs/dcache.c |   75 ---
 1 file changed, 25 insertions(+), 50 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 9b15c5c..a5540d4 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -787,47 +787,9 @@ restart:
 }
 EXPORT_SYMBOL(d_prune_aliases);
 
-/*
- * Try to throw away a dentry - free the inode, dput the parent.
- * Requires dentry->d_lock is held, and dentry->d_count == 0.
- * Releases dentry->d_lock.
- *

Re: dcache shrink list corruption?

2014-04-29 Thread Linus Torvalds
On Tue, Apr 29, 2014 at 7:31 PM, Al Viro  wrote:
>
> OK, aggregate diff follows, more readable splitup (3 commits) attached.
> It seems to survive beating here; testing, review and comments are
> welcome.

Miklos, did you have some particular load that triggered this, or was
it just some reports? It would be really good to get this patch some
stress-testing.

I like how the patch removes more lines than it adds, but apart from
that it's hard to read the patch (even the split-out ones) and say
anything more about it. I think this needs a *lot* of testing.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-04-29 Thread Al Viro
On Wed, Apr 30, 2014 at 12:20:13AM +0100, Al Viro wrote:
> On Tue, Apr 29, 2014 at 04:04:11PM -0700, Linus Torvalds wrote:
> > But at a minimum, we have "d_op->d_prune()" that would now be possibly
> > be called for the old dentry *after* a new dentry has been allocated.
> > Not to mention the inode not having been dropped. So it looks like a
> > disaster where the filesystem now ends up having concurrent "live"
> > dentries for the same entry.  Even if one of them is on its way out,
> > it's still visible to the filesystem. That makes me very
> > uncomfortable.
> > 
> > I'm starting to think that Miklos' original patch (perhaps with the
> > spinlock split to at least be one per superblock) is the most
> > straightforward approach after all. It's annoying having to add locks
> > here, but the whole pruning path should not be a hotpath, so maybe
> > it's not actually a big deal.
> 
> FWIW, my solution is more or less working; I'll give more local beating
> and post it... 

OK, aggregate diff follows, more readable splitup (3 commits) attached.
It seems to survive beating here; testing, review and comments are
welcome.

diff --git a/fs/dcache.c b/fs/dcache.c
index 494a9def..a83b933 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -246,16 +246,8 @@ static void __d_free(struct rcu_head *head)
kmem_cache_free(dentry_cache, dentry); 
 }
 
-/*
- * no locks, please.
- */
-static void d_free(struct dentry *dentry)
+static void dentry_free(struct dentry *dentry)
 {
-   BUG_ON((int)dentry->d_lockref.count > 0);
-   this_cpu_dec(nr_dentry);
-   if (dentry->d_op && dentry->d_op->d_release)
-   dentry->d_op->d_release(dentry);
-
/* if dentry was never visible to RCU, immediate free is OK */
if (!(dentry->d_flags & DCACHE_RCUACCESS))
__d_free(>d_u.d_rcu);
@@ -420,40 +412,6 @@ static void dentry_lru_del(struct dentry *dentry)
 }
 
 /**
- * d_kill - kill dentry and return parent
- * @dentry: dentry to kill
- * @parent: parent dentry
- *
- * The dentry must already be unhashed and removed from the LRU.
- *
- * If this is the root of the dentry tree, return NULL.
- *
- * dentry->d_lock and parent->d_lock must be held by caller, and are dropped by
- * d_kill.
- */
-static struct dentry *d_kill(struct dentry *dentry, struct dentry *parent)
-   __releases(dentry->d_lock)
-   __releases(parent->d_lock)
-   __releases(dentry->d_inode->i_lock)
-{
-   list_del(>d_u.d_child);
-   /*
-* Inform d_walk() that we are no longer attached to the
-* dentry tree
-*/
-   dentry->d_flags |= DCACHE_DENTRY_KILLED;
-   if (parent)
-   spin_unlock(>d_lock);
-   dentry_iput(dentry);
-   /*
-* dentry_iput drops the locks, at which point nobody (except
-* transient RCU lookups) can reach this dentry.
-*/
-   d_free(dentry);
-   return parent;
-}
-
-/**
  * d_drop - drop a dentry
  * @dentry: dentry to drop
  *
@@ -499,6 +457,8 @@ void d_drop(struct dentry *dentry)
 }
 EXPORT_SYMBOL(d_drop);
 
+static DECLARE_WAIT_QUEUE_HEAD(free_wq);
+
 /*
  * Finish off a dentry we've decided to kill.
  * dentry->d_lock must be held, returns with it unlocked.
@@ -506,16 +466,17 @@ EXPORT_SYMBOL(d_drop);
  * Returns dentry requiring refcount drop, or NULL if we're done.
  */
 static struct dentry *
-dentry_kill(struct dentry *dentry, int unlock_on_failure)
+dentry_kill(struct dentry *dentry, struct list_head *morgue)
__releases(dentry->d_lock)
 {
struct inode *inode;
struct dentry *parent;
+   bool can_free = true;
 
inode = dentry->d_inode;
if (inode && !spin_trylock(>i_lock)) {
 relock:
-   if (unlock_on_failure) {
+   if (!morgue) {
spin_unlock(>d_lock);
cpu_relax();
}
@@ -531,6 +492,15 @@ relock:
goto relock;
}
 
+   if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) {
+   /* morgue must be non-NULL */
+   list_move(>d_lru, morgue);
+   if (parent)
+   spin_unlock(>d_lock);
+   /* inode must be NULL */
+   spin_unlock(>d_lock);
+   return parent;
+   }
/*
 * The dentry is now unrecoverably dead to the world.
 */
@@ -543,10 +513,43 @@ relock:
if ((dentry->d_flags & DCACHE_OP_PRUNE) && !d_unhashed(dentry))
dentry->d_op->d_prune(dentry);
 
-   dentry_lru_del(dentry);
+   if (dentry->d_flags & DCACHE_LRU_LIST) {
+   if (!(dentry->d_flags & DCACHE_SHRINK_LIST))
+   d_lru_del(dentry);
+   else if (morgue)
+   d_shrink_del(dentry);
+   else
+   can_free = false;
+   }
/* if it was on the hash then remove it */
__d_drop(dentry);
-   return d_kill(dentry, parent);
+   

Re: dcache shrink list corruption?

2014-04-29 Thread Al Viro
On Tue, Apr 29, 2014 at 04:04:11PM -0700, Linus Torvalds wrote:
> But at a minimum, we have "d_op->d_prune()" that would now be possibly
> be called for the old dentry *after* a new dentry has been allocated.
> Not to mention the inode not having been dropped. So it looks like a
> disaster where the filesystem now ends up having concurrent "live"
> dentries for the same entry.  Even if one of them is on its way out,
> it's still visible to the filesystem. That makes me very
> uncomfortable.
> 
> I'm starting to think that Miklos' original patch (perhaps with the
> spinlock split to at least be one per superblock) is the most
> straightforward approach after all. It's annoying having to add locks
> here, but the whole pruning path should not be a hotpath, so maybe
> it's not actually a big deal.

FWIW, my solution is more or less working; I'll give more local beating
and post it... 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-04-29 Thread Linus Torvalds
On Tue, Apr 29, 2014 at 2:48 PM, Al Viro  wrote:
>
> Ummm...  You mean, have d_lookup() et.al. fail on something that is on
> a shrink list?

So I tried to see if that would work just consider it dead by the time
it hits the shrink list, and if somebody does a lookup on the dentry,
fail on it and just allocate a new dentry and do a real lookup.

But at a minimum, we have "d_op->d_prune()" that would now be possibly
be called for the old dentry *after* a new dentry has been allocated.
Not to mention the inode not having been dropped. So it looks like a
disaster where the filesystem now ends up having concurrent "live"
dentries for the same entry.  Even if one of them is on its way out,
it's still visible to the filesystem. That makes me very
uncomfortable.

I'm starting to think that Miklos' original patch (perhaps with the
spinlock split to at least be one per superblock) is the most
straightforward approach after all. It's annoying having to add locks
here, but the whole pruning path should not be a hotpath, so maybe
it's not actually a big deal.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dcache shrink list corruption?

2014-04-29 Thread Al Viro
On Wed, Apr 30, 2014 at 07:18:51AM +1000, Dave Chinner wrote:

> Seems like it would work, but it seems fragile to me - I'm
> wondering how we can ensure that the private shrink list
> manipulations can be kept private.
> 
> We have a similar situation with the inode cache (private shrink
> list) but the I_FREEING flag is set the entire time the inode is on
> the shrink list. Any new hash lookup or attempt to grab the inode
> that occurs while I_FREEING is set fails, so perhaps dentries also
> need a well defined "being torn down and freed" state where new
> references cannot be taken even though the dentry can still be
> found...

Ummm...  You mean, have d_lookup() et.al. fail on something that is on
a shrink list?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   >