subject:"msync\(2\) bug\(\?\), returns AOP_WRITEPAGE

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-11-19 Thread Erez Zadok

In message <[EMAIL PROTECTED]>, Hugh Dickins writes:
> On Tue, 13 Nov 2007, Erez Zadok wrote:
[...]
> I'm glad to report that this unionfs, not the one in 2.6.24-rc2-mm1
> but the one including those 9 patches you posted, now gets through
> my testing with tmpfs without a problem.  I do still get occasional
> "unionfs: new lower inode mtime (bindex=0, name=)"
> messages, but nothing worse seen yet: a big improvement.

Excellent.

> I did think you could clean up the doubled set_page_dirtys,
> but it's of no consequence.

Yes, looks good.  I'll send that as a patch.  Thanks.

> Hugh
> 
> --- 2.6.24-rc2-mm1+9/fs/unionfs/mmap.c2007-11-17 12:23:30.0 
> +
> +++ linux/fs/unionfs/mmap.c   2007-11-17 20:22:29.0 +
> @@ -56,6 +56,7 @@ static int unionfs_writepage(struct page
>   copy_highpage(lower_page, page);
>   flush_dcache_page(lower_page);
>   SetPageUptodate(lower_page);
> + set_page_dirty(lower_page);
>  
>   /*
>* Call lower writepage (expects locked page).  However, if we are
> @@ -66,12 +67,11 @@ static int unionfs_writepage(struct page
>* success.
>*/
>   if (wbc->for_reclaim) {
> - set_page_dirty(lower_page);
>   unlock_page(lower_page);
>   goto out_release;
>   }
> +
>   BUG_ON(!lower_mapping->a_ops->writepage);
> - set_page_dirty(lower_page);
>   clear_page_dirty_for_io(lower_page); /* emulate VFS behavior */
>   err = lower_mapping->a_ops->writepage(lower_page, wbc);
>   if (err < 0)

Erez.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-11-19 Thread Erez Zadok

In message [EMAIL PROTECTED], Hugh Dickins writes:
 On Tue, 13 Nov 2007, Erez Zadok wrote:
[...]
 I'm glad to report that this unionfs, not the one in 2.6.24-rc2-mm1
 but the one including those 9 patches you posted, now gets through
 my testing with tmpfs without a problem.  I do still get occasional
 unionfs: new lower inode mtime (bindex=0, name=directory)
 messages, but nothing worse seen yet: a big improvement.

Excellent.

 I did think you could clean up the doubled set_page_dirtys,
 but it's of no consequence.

Yes, looks good.  I'll send that as a patch.  Thanks.

 Hugh
 
 --- 2.6.24-rc2-mm1+9/fs/unionfs/mmap.c2007-11-17 12:23:30.0 
 +
 +++ linux/fs/unionfs/mmap.c   2007-11-17 20:22:29.0 +
 @@ -56,6 +56,7 @@ static int unionfs_writepage(struct page
   copy_highpage(lower_page, page);
   flush_dcache_page(lower_page);
   SetPageUptodate(lower_page);
 + set_page_dirty(lower_page);
  
   /*
* Call lower writepage (expects locked page).  However, if we are
 @@ -66,12 +67,11 @@ static int unionfs_writepage(struct page
* success.
*/
   if (wbc-for_reclaim) {
 - set_page_dirty(lower_page);
   unlock_page(lower_page);
   goto out_release;
   }
 +
   BUG_ON(!lower_mapping-a_ops-writepage);
 - set_page_dirty(lower_page);
   clear_page_dirty_for_io(lower_page); /* emulate VFS behavior */
   err = lower_mapping-a_ops-writepage(lower_page, wbc);
   if (err  0)

Erez.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-11-17 Thread Hugh Dickins

On Tue, 13 Nov 2007, Erez Zadok wrote:
> 
> I posted all of these patches just now.  You're CC'ed.  Hopefully Andrew can
> pull from my unionfs.git branch soon.
> 
> You also reported in your previous emails some hangs/oopses while doing make
> -j 20 in unionfs on top of a single tmpfs, using -mm.  After several days,
> I've not been able to reproduce this w/ my latest set of patches.  If you
> can send me your .config and the specs on the h/w you're using (cpus, mem,
> etc.), I'll see if I can find something similar to it on my end and run the
> same tests.

I'm glad to report that this unionfs, not the one in 2.6.24-rc2-mm1
but the one including those 9 patches you posted, now gets through
my testing with tmpfs without a problem.  I do still get occasional
"unionfs: new lower inode mtime (bindex=0, name=)"
messages, but nothing worse seen yet: a big improvement.

I deceived myself for a while that the danger of shmem_writepage
hitting its BUG_ON(entry->val) was dealt with too; but that's wrong,
I must go back to working out an escape from that one (despite never
seeing it).

I did think you could clean up the doubled set_page_dirtys,
but it's of no consequence.

Hugh

--- 2.6.24-rc2-mm1+9/fs/unionfs/mmap.c  2007-11-17 12:23:30.0 +
+++ linux/fs/unionfs/mmap.c 2007-11-17 20:22:29.0 +
@@ -56,6 +56,7 @@ static int unionfs_writepage(struct page
copy_highpage(lower_page, page);
flush_dcache_page(lower_page);
SetPageUptodate(lower_page);
+   set_page_dirty(lower_page);
 
/*
 * Call lower writepage (expects locked page).  However, if we are
@@ -66,12 +67,11 @@ static int unionfs_writepage(struct page
 * success.
 */
if (wbc->for_reclaim) {
-   set_page_dirty(lower_page);
unlock_page(lower_page);
goto out_release;
}
+
BUG_ON(!lower_mapping->a_ops->writepage);
-   set_page_dirty(lower_page);
clear_page_dirty_for_io(lower_page); /* emulate VFS behavior */
err = lower_mapping->a_ops->writepage(lower_page, wbc);
if (err < 0)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-11-17 Thread Hugh Dickins

On Tue, 13 Nov 2007, Erez Zadok wrote:
 
 I posted all of these patches just now.  You're CC'ed.  Hopefully Andrew can
 pull from my unionfs.git branch soon.
 
 You also reported in your previous emails some hangs/oopses while doing make
 -j 20 in unionfs on top of a single tmpfs, using -mm.  After several days,
 I've not been able to reproduce this w/ my latest set of patches.  If you
 can send me your .config and the specs on the h/w you're using (cpus, mem,
 etc.), I'll see if I can find something similar to it on my end and run the
 same tests.

I'm glad to report that this unionfs, not the one in 2.6.24-rc2-mm1
but the one including those 9 patches you posted, now gets through
my testing with tmpfs without a problem.  I do still get occasional
unionfs: new lower inode mtime (bindex=0, name=directory)
messages, but nothing worse seen yet: a big improvement.

I deceived myself for a while that the danger of shmem_writepage
hitting its BUG_ON(entry-val) was dealt with too; but that's wrong,
I must go back to working out an escape from that one (despite never
seeing it).

I did think you could clean up the doubled set_page_dirtys,
but it's of no consequence.

Hugh

--- 2.6.24-rc2-mm1+9/fs/unionfs/mmap.c  2007-11-17 12:23:30.0 +
+++ linux/fs/unionfs/mmap.c 2007-11-17 20:22:29.0 +
@@ -56,6 +56,7 @@ static int unionfs_writepage(struct page
copy_highpage(lower_page, page);
flush_dcache_page(lower_page);
SetPageUptodate(lower_page);
+   set_page_dirty(lower_page);
 
/*
 * Call lower writepage (expects locked page).  However, if we are
@@ -66,12 +67,11 @@ static int unionfs_writepage(struct page
 * success.
 */
if (wbc-for_reclaim) {
-   set_page_dirty(lower_page);
unlock_page(lower_page);
goto out_release;
}
+
BUG_ON(!lower_mapping-a_ops-writepage);
-   set_page_dirty(lower_page);
clear_page_dirty_for_io(lower_page); /* emulate VFS behavior */
err = lower_mapping-a_ops-writepage(lower_page, wbc);
if (err  0)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-11-13 Thread Erez Zadok

In message <[EMAIL PROTECTED]>, Hugh Dickins writes:
> On Fri, 9 Nov 2007, Erez Zadok wrote:
> > In message <[EMAIL PROTECTED]>, Hugh Dickins writes:
> > 
> > > Three, I believe you need to add a flush_dcache_page(lower_page)
> > > after the copy_highpage(lower_page): some architectures will need
> > > that to see the new data if they have lower_page mapped (though I
> > > expect it's anyway shaky ground to be accessing through the lower
> > > mount at the same time as modifying through the upper).
> > 
> > OK.
> 
> While looking into something else entirely, I realize that _here_
> you are missing a SetPageUptodate(lower_page): should go in after
> the flush_dcache_page(lower_page) I'm suggesting.  (Nick would argue
> for some kind of barrier there too, but I don't think unionfs has a
> special need to be ahead of the pack on that issue.)
> 
> Think about it:
> when find_or_create_page has created a fresh page in the cache,
> and you've just done copy_highpage to put the data into it, you
> now need to mark it as Uptodate: otherwise a subsequent vfs_read
> or whatever on the lower level will find that page !Uptodate and
> read stale data back from disk instead of what you just copied in,
> unless its dirtiness has got it written back to disk meanwhile.

Hehe.  Funny, you mention this...  A few days ago, while I was doing your other
recommended pageuptodate cleanups, I also added the same call to
SetPageUptodate(lower_page) as you suggested.  I tested that change along w/
the other changes you suggested, and they all seem to work great all the way
from my 2.6.9 backport to 2.6.24-rc2 and -mm (modulo the fact that I had to
work around or fix more non-unionfs bugs in -mm than unionfs ones to get it
to work :-)

I posted all of these patches just now.  You're CC'ed.  Hopefully Andrew can
pull from my unionfs.git branch soon.

You also reported in your previous emails some hangs/oopses while doing make
-j 20 in unionfs on top of a single tmpfs, using -mm.  After several days,
I've not been able to reproduce this w/ my latest set of patches.  If you
can send me your .config and the specs on the h/w you're using (cpus, mem,
etc.), I'll see if I can find something similar to it on my end and run the
same tests.

Cheers,
Erez.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-11-13 Thread Erez Zadok

In message [EMAIL PROTECTED], Hugh Dickins writes:
 On Fri, 9 Nov 2007, Erez Zadok wrote:
  In message [EMAIL PROTECTED], Hugh Dickins writes:
  
   Three, I believe you need to add a flush_dcache_page(lower_page)
   after the copy_highpage(lower_page): some architectures will need
   that to see the new data if they have lower_page mapped (though I
   expect it's anyway shaky ground to be accessing through the lower
   mount at the same time as modifying through the upper).
  
  OK.
 
 While looking into something else entirely, I realize that _here_
 you are missing a SetPageUptodate(lower_page): should go in after
 the flush_dcache_page(lower_page) I'm suggesting.  (Nick would argue
 for some kind of barrier there too, but I don't think unionfs has a
 special need to be ahead of the pack on that issue.)
 
 Think about it:
 when find_or_create_page has created a fresh page in the cache,
 and you've just done copy_highpage to put the data into it, you
 now need to mark it as Uptodate: otherwise a subsequent vfs_read
 or whatever on the lower level will find that page !Uptodate and
 read stale data back from disk instead of what you just copied in,
 unless its dirtiness has got it written back to disk meanwhile.

Hehe.  Funny, you mention this...  A few days ago, while I was doing your other
recommended pageuptodate cleanups, I also added the same call to
SetPageUptodate(lower_page) as you suggested.  I tested that change along w/
the other changes you suggested, and they all seem to work great all the way
from my 2.6.9 backport to 2.6.24-rc2 and -mm (modulo the fact that I had to
work around or fix more non-unionfs bugs in -mm than unionfs ones to get it
to work :-)

I posted all of these patches just now.  You're CC'ed.  Hopefully Andrew can
pull from my unionfs.git branch soon.

You also reported in your previous emails some hangs/oopses while doing make
-j 20 in unionfs on top of a single tmpfs, using -mm.  After several days,
I've not been able to reproduce this w/ my latest set of patches.  If you
can send me your .config and the specs on the h/w you're using (cpus, mem,
etc.), I'll see if I can find something similar to it on my end and run the
same tests.

Cheers,
Erez.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-11-12 Thread Hugh Dickins

On Fri, 9 Nov 2007, Erez Zadok wrote:
> In message <[EMAIL PROTECTED]>, Hugh Dickins writes:
> 
> > Three, I believe you need to add a flush_dcache_page(lower_page)
> > after the copy_highpage(lower_page): some architectures will need
> > that to see the new data if they have lower_page mapped (though I
> > expect it's anyway shaky ground to be accessing through the lower
> > mount at the same time as modifying through the upper).
> 
> OK.

While looking into something else entirely, I realize that _here_
you are missing a SetPageUptodate(lower_page): should go in after
the flush_dcache_page(lower_page) I'm suggesting.  (Nick would argue
for some kind of barrier there too, but I don't think unionfs has a
special need to be ahead of the pack on that issue.)

Think about it:
when find_or_create_page has created a fresh page in the cache,
and you've just done copy_highpage to put the data into it, you
now need to mark it as Uptodate: otherwise a subsequent vfs_read
or whatever on the lower level will find that page !Uptodate and
read stale data back from disk instead of what you just copied in,
unless its dirtiness has got it written back to disk meanwhile.

Odd that that hasn't been noticed at all: I guess it may be hard
to get testing to reclaim lower/upper pages in such a way as to
test out such paths thoroughly.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-11-12 Thread Hugh Dickins

On Fri, 9 Nov 2007, Erez Zadok wrote:
 In message [EMAIL PROTECTED], Hugh Dickins writes:
 
  Three, I believe you need to add a flush_dcache_page(lower_page)
  after the copy_highpage(lower_page): some architectures will need
  that to see the new data if they have lower_page mapped (though I
  expect it's anyway shaky ground to be accessing through the lower
  mount at the same time as modifying through the upper).
 
 OK.

While looking into something else entirely, I realize that _here_
you are missing a SetPageUptodate(lower_page): should go in after
the flush_dcache_page(lower_page) I'm suggesting.  (Nick would argue
for some kind of barrier there too, but I don't think unionfs has a
special need to be ahead of the pack on that issue.)

Think about it:
when find_or_create_page has created a fresh page in the cache,
and you've just done copy_highpage to put the data into it, you
now need to mark it as Uptodate: otherwise a subsequent vfs_read
or whatever on the lower level will find that page !Uptodate and
read stale data back from disk instead of what you just copied in,
unless its dirtiness has got it written back to disk meanwhile.

Odd that that hasn't been noticed at all: I guess it may be hard
to get testing to reclaim lower/upper pages in such a way as to
test out such paths thoroughly.

Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-11-11 Thread Hugh Dickins

On Fri, 9 Nov 2007, Erez Zadok wrote:
> In message <[EMAIL PROTECTED]>, Hugh Dickins writes:
> > 
> > One, I think you would be safer to do a set_page_dirty(lower_page)
> > before your clear_page_dirty_for_io(lower_page).  I know that sounds
> > silly, but see Linus' "Yes, Virginia" comment in clear_page_dirty_for_io:
> > there's a lot of subtlety hereabouts, and I think you'd be mimicing the
> > usual path closer if you set_page_dirty first - there's nothing else
> > doing it on that lower_page, is there?  I'm not certain that you need
> > to, but I think you'd do well to look into it and make up your own mind.
> 
> Hugh, my code looks like:
> 
>   if (wbc->for_reclaim) {
>   set_page_dirty(lower_page);
>   unlock_page(lower_page);
>   goto out_release;
>   }
>   BUG_ON(!lower_mapping->a_ops->writepage);
>   clear_page_dirty_for_io(lower_page); /* emulate VFS behavior */
>   err = lower_mapping->a_ops->writepage(lower_page, wbc);
> 
> Do you mean I should set_page_dirty(lower_page) unconditionally before
> clear_page_dirty_for_io?  (I already do that in the 'if' statement above it.)

Yes.  Whether you're wrong not to be doing that already, I've not checked;
but I think doing so will make unionfs safer against any future changes
in the relationship between set_page_dirty and clear_page_dirty_for_io.

For example, look at clear_page_dirty_for_io: it's decrementing some
statistics which __set_page_dirty_nobuffers increments.  Does use of
unionfs (over some filesystems) make those numbers wrap increasingly
negative?  Does adding this set_page_dirty(lower_page) correct that?
I suspect so, but may be wrong.

> > Two, I'm unsure of the way you're clearing or setting PageUptodate on
> > the upper page there.  The rules for PageUptodate are fairly obvious
> > when reading, but when a write fails, it's not so obvious.  Again, I'm
> > not saying what you've got is wrong (it may be unavoidable, to keep
> > synch between lower and upper), but it deserves a second thought.
> 
> I looked at all mainline filesystems's ->writepage to see what, if any, they
> do with their page's uptodate flag.  Most f/s don't touch the flag one way
> or another.

I'm not going to try and guess what assorted filesystems are up to,
and not all of them will be bugfree.  The crucial point of PageUptodate
is that we insert a filesystem page into the page cache before it's had
any data read in: it needs to be !PageUptodate until the data is there,
and then marked PageUptodate to say the data is good and others can
start using it.  See mm/filemap.c.

> And finally, unionfs clears the uptodate flag on error from the lower
> ->writepage, and otherwise sets the flag on success from the lower
> ->writepage.  My gut feeling is that unionfs shouldn't change the page
> uptodate flag at all: if the VFS passes unionfs_writepage a page which isn't
> uptodate, then the VFS has a serious problem b/c it'd be asking a f/s to
> write out a page which isn't up-to-date, right?  Otherwise, whether
> unionfs_writepage manages to write the lower page or not, why should that
> invalidate the state of the unionfs page itself?  Come to think of it, I
> think clearing pageuptodate on error from ->writepage(lower_page) may be
> bad.  Imagine if after such a failed unionfs_writepage, I get a
> unionfs_readpage: that ->readpage will get data from the lower f/s page and
> copy it *over* the unionfs page, even if the upper page's data was more
> recent prior to the failed call to unionfs_writepage.  IOW, we could be
> reverting a user-visible mmap'ed page to a previous on-disk version.  What
> do you think: could this happen?  Anyway, I'll run some exhaustive testing
> next and see what happens if I don't set/clear the uptodate flag in
> unionfs_writepage.

That was my point, and I don't really have more to add.  It's unusual
to do anything with PageUptodate when writing.  By clearing it when the
lower level has an error, you're throwing away the changes already made
at the upper level.  You might have some good reason for that, but it's
worth questioning.  If you don't know why you're Set/Clear'ing it there,
better to just take that out.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-11-11 Thread Hugh Dickins

On Fri, 9 Nov 2007, Erez Zadok wrote:
 In message [EMAIL PROTECTED], Hugh Dickins writes:
  
  One, I think you would be safer to do a set_page_dirty(lower_page)
  before your clear_page_dirty_for_io(lower_page).  I know that sounds
  silly, but see Linus' Yes, Virginia comment in clear_page_dirty_for_io:
  there's a lot of subtlety hereabouts, and I think you'd be mimicing the
  usual path closer if you set_page_dirty first - there's nothing else
  doing it on that lower_page, is there?  I'm not certain that you need
  to, but I think you'd do well to look into it and make up your own mind.
 
 Hugh, my code looks like:
 
   if (wbc-for_reclaim) {
   set_page_dirty(lower_page);
   unlock_page(lower_page);
   goto out_release;
   }
   BUG_ON(!lower_mapping-a_ops-writepage);
   clear_page_dirty_for_io(lower_page); /* emulate VFS behavior */
   err = lower_mapping-a_ops-writepage(lower_page, wbc);
 
 Do you mean I should set_page_dirty(lower_page) unconditionally before
 clear_page_dirty_for_io?  (I already do that in the 'if' statement above it.)

Yes.  Whether you're wrong not to be doing that already, I've not checked;
but I think doing so will make unionfs safer against any future changes
in the relationship between set_page_dirty and clear_page_dirty_for_io.

For example, look at clear_page_dirty_for_io: it's decrementing some
statistics which __set_page_dirty_nobuffers increments.  Does use of
unionfs (over some filesystems) make those numbers wrap increasingly
negative?  Does adding this set_page_dirty(lower_page) correct that?
I suspect so, but may be wrong.

  Two, I'm unsure of the way you're clearing or setting PageUptodate on
  the upper page there.  The rules for PageUptodate are fairly obvious
  when reading, but when a write fails, it's not so obvious.  Again, I'm
  not saying what you've got is wrong (it may be unavoidable, to keep
  synch between lower and upper), but it deserves a second thought.
 
 I looked at all mainline filesystems's -writepage to see what, if any, they
 do with their page's uptodate flag.  Most f/s don't touch the flag one way
 or another.

I'm not going to try and guess what assorted filesystems are up to,
and not all of them will be bugfree.  The crucial point of PageUptodate
is that we insert a filesystem page into the page cache before it's had
any data read in: it needs to be !PageUptodate until the data is there,
and then marked PageUptodate to say the data is good and others can
start using it.  See mm/filemap.c.

 And finally, unionfs clears the uptodate flag on error from the lower
 -writepage, and otherwise sets the flag on success from the lower
 -writepage.  My gut feeling is that unionfs shouldn't change the page
 uptodate flag at all: if the VFS passes unionfs_writepage a page which isn't
 uptodate, then the VFS has a serious problem b/c it'd be asking a f/s to
 write out a page which isn't up-to-date, right?  Otherwise, whether
 unionfs_writepage manages to write the lower page or not, why should that
 invalidate the state of the unionfs page itself?  Come to think of it, I
 think clearing pageuptodate on error from -writepage(lower_page) may be
 bad.  Imagine if after such a failed unionfs_writepage, I get a
 unionfs_readpage: that -readpage will get data from the lower f/s page and
 copy it *over* the unionfs page, even if the upper page's data was more
 recent prior to the failed call to unionfs_writepage.  IOW, we could be
 reverting a user-visible mmap'ed page to a previous on-disk version.  What
 do you think: could this happen?  Anyway, I'll run some exhaustive testing
 next and see what happens if I don't set/clear the uptodate flag in
 unionfs_writepage.

That was my point, and I don't really have more to add.  It's unusual
to do anything with PageUptodate when writing.  By clearing it when the
lower level has an error, you're throwing away the changes already made
at the upper level.  You might have some good reason for that, but it's
worth questioning.  If you don't know why you're Set/Clear'ing it there,
better to just take that out.

Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-11-08 Thread Erez Zadok

In message <[EMAIL PROTECTED]>, Hugh Dickins writes:
> [Dave, I've Cc'ed you re handle_write_count_underflow, see below.]
> 
> On Wed, 31 Oct 2007, Erez Zadok wrote:
> > 
> > Hi Hugh, I've addressed all of your concerns and am happy to report that the
> > newly revised unionfs_writepage works even better, including under my
> > memory-pressure conditions.  To summarize my changes since the last time:
> > 
> > - I'm only masking __GFP_FS, not __GFP_IO
> > - using find_or_create_page to avoid locking issues around mapping mask
> > - handle for_reclaim case more efficiently
> > - using copy_highpage so we handle KM_USER*
> > - un/locking upper/lower page as/when needed
> > - updated comments to clarify what/why
> > - unionfs_sync_page: gone (yes, vfs.txt did confuse me, plus ecryptfs used
> >   to have it)
> > 
> > Below is the newest version of unionfs_writepage.  Let me know what you
> > think.
> > 
> > I have to say that with these changes, unionfs appears visibly faster under
> > memory pressure.  I suspect the for_reclaim handling is probably the largest
> > contributor to this speedup.
> 
> That's good news, and that unionfs_writepage looks good to me -
> with three reservations I've not observed before.
> 
> One, I think you would be safer to do a set_page_dirty(lower_page)
> before your clear_page_dirty_for_io(lower_page).  I know that sounds
> silly, but see Linus' "Yes, Virginia" comment in clear_page_dirty_for_io:
> there's a lot of subtlety hereabouts, and I think you'd be mimicing the
> usual path closer if you set_page_dirty first - there's nothing else
> doing it on that lower_page, is there?  I'm not certain that you need
> to, but I think you'd do well to look into it and make up your own mind.

Hugh, my code looks like:

if (wbc->for_reclaim) {
set_page_dirty(lower_page);
unlock_page(lower_page);
goto out_release;
}
BUG_ON(!lower_mapping->a_ops->writepage);
clear_page_dirty_for_io(lower_page); /* emulate VFS behavior */
err = lower_mapping->a_ops->writepage(lower_page, wbc);

Do you mean I should set_page_dirty(lower_page) unconditionally before
clear_page_dirty_for_io?  (I already do that in the 'if' statement above it.)

> Two, I'm unsure of the way you're clearing or setting PageUptodate on
> the upper page there.  The rules for PageUptodate are fairly obvious
> when reading, but when a write fails, it's not so obvious.  Again, I'm
> not saying what you've got is wrong (it may be unavoidable, to keep
> synch between lower and upper), but it deserves a second thought.

I looked at all mainline filesystems's ->writepage to see what, if any, they
do with their page's uptodate flag.  Most f/s don't touch the flag one way
or another.

cifs_writepage sets the uptodate flag unconditionally: why?

ecryptfs_writepage has a legit reason: if encrypting the page failed, it 
doesn't want
anyone to use it, so it clears its page's uptodate flag (else it sets it as
uptodate).

hostfs_writepage clears pageuptodate if it failed to write_file(), which I'm
not sure if it makes sense or not.

ntfs_writepage goes as far as doing BUG_ON(!PageUptodate(page)) which
indicates to me that the page passed to ->writepage should always be
uptodate.  Is that a fair statement?

smb_writepage pretty much unconditionally calls SetPageUptodate(page).  Why?

Is there a reason smbfs and cifs both do this unconditionally?  If so, then
why is ntfs calling BUG_ON if the page isn't uptodate?  Either that BUG_ON
in ntfs is redundant, or cifs/smbfs's SetPageUptodate is redundant, but they
can't both be right.

And finally, unionfs clears the uptodate flag on error from the lower
->writepage, and otherwise sets the flag on success from the lower
->writepage.  My gut feeling is that unionfs shouldn't change the page
uptodate flag at all: if the VFS passes unionfs_writepage a page which isn't
uptodate, then the VFS has a serious problem b/c it'd be asking a f/s to
write out a page which isn't up-to-date, right?  Otherwise, whether
unionfs_writepage manages to write the lower page or not, why should that
invalidate the state of the unionfs page itself?  Come to think of it, I
think clearing pageuptodate on error from ->writepage(lower_page) may be
bad.  Imagine if after such a failed unionfs_writepage, I get a
unionfs_readpage: that ->readpage will get data from the lower f/s page and
copy it *over* the unionfs page, even if the upper page's data was more
recent prior to the failed call to unionfs_writepage.  IOW, we could be
reverting a user-visible mmap'ed page to a previous on-disk version.  What
do you think: could this happen?  Anyway, I'll run some exhaustive testing
next and see what happens if I don't set/clear the uptodate flag in
unionfs_writepage.

> Three, I believe you need to add a flush_dcache_page(lower_page)
> after the copy_highpage(lower_page): some architectures will need
> that to see the new data if they have lower_page mapped

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-11-08 Thread Erez Zadok

In message <[EMAIL PROTECTED]>, Dave Hansen writes:
> On Mon, 2007-11-05 at 15:40 +, Hugh Dickins wrote:
[...]
> I have a decent guess what the bug is, too.  In the unionfs code:
> 
> > int init_lower_nd(struct nameidata *nd, unsigned int flags)
> > {
> > ...
> > #ifdef ALLOC_LOWER_ND_FILE
> > file = kzalloc(sizeof(struct file), GFP_KERNEL);
> > if (unlikely(!file)) {
> > err = -ENOMEM;
> > break; /* exit switch statement and thus return */
> > }
> > nd->intent.open.file = file;
> > #endif /* ALLOC_LOWER_ND_FILE */
> 
> The r/o bind mount code will mnt_drop_write() on that file's f_vfsmnt at
> __fput() time.  Since that code never got a write on the mount, we'll
> see an imbalance if the file was opened for a write.  I don't see this
> file's mnt set anywhere, so I'm not completely sure that this is it.  In
> any case, rolling your own 'struct file' without using alloc_file() and
> friends is a no-no.
[...]

This #ifdef'd code in unionfs is actually not enabled.  I left it there as a
reminder of possible future things to come (esp. if nameidata gets split).
There's a related comment earlier in fs/unionfs/lookup.c:init_lower_nd()
that says:

#ifdef ALLOC_LOWER_ND_FILE
/*
 * XXX: one day we may need to have the lower return an open file
 * for us.  It is not needed in 2.6.23-rc1 for nfs2/nfs3, but may
 * very well be needed for nfs4.
 */
struct file *file;
#endif /* ALLOC_LOWER_ND_FILE */

In the interest of keeping unionfs as simple as I can, when I implemented
the whole "pass a lower nd" stuff, I left thos bits of semi-experimental
#ifdef code for this lower file upon open-intent.  It's not enabled and up
until now, it didn't seem to be needed.

Do you think unionfs has to start using this nd->intent.open.file stuff?

Thanks,
Erez.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-11-08 Thread Erez Zadok

In message [EMAIL PROTECTED], Dave Hansen writes:
 On Mon, 2007-11-05 at 15:40 +, Hugh Dickins wrote:
[...]
 I have a decent guess what the bug is, too.  In the unionfs code:
 
  int init_lower_nd(struct nameidata *nd, unsigned int flags)
  {
  ...
  #ifdef ALLOC_LOWER_ND_FILE
  file = kzalloc(sizeof(struct file), GFP_KERNEL);
  if (unlikely(!file)) {
  err = -ENOMEM;
  break; /* exit switch statement and thus return */
  }
  nd-intent.open.file = file;
  #endif /* ALLOC_LOWER_ND_FILE */
 
 The r/o bind mount code will mnt_drop_write() on that file's f_vfsmnt at
 __fput() time.  Since that code never got a write on the mount, we'll
 see an imbalance if the file was opened for a write.  I don't see this
 file's mnt set anywhere, so I'm not completely sure that this is it.  In
 any case, rolling your own 'struct file' without using alloc_file() and
 friends is a no-no.
[...]

This #ifdef'd code in unionfs is actually not enabled.  I left it there as a
reminder of possible future things to come (esp. if nameidata gets split).
There's a related comment earlier in fs/unionfs/lookup.c:init_lower_nd()
that says:

#ifdef ALLOC_LOWER_ND_FILE
/*
 * XXX: one day we may need to have the lower return an open file
 * for us.  It is not needed in 2.6.23-rc1 for nfs2/nfs3, but may
 * very well be needed for nfs4.
 */
struct file *file;
#endif /* ALLOC_LOWER_ND_FILE */

In the interest of keeping unionfs as simple as I can, when I implemented
the whole pass a lower nd stuff, I left thos bits of semi-experimental
#ifdef code for this lower file upon open-intent.  It's not enabled and up
until now, it didn't seem to be needed.

Do you think unionfs has to start using this nd-intent.open.file stuff?

Thanks,
Erez.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-11-08 Thread Erez Zadok

In message [EMAIL PROTECTED], Hugh Dickins writes:
 [Dave, I've Cc'ed you re handle_write_count_underflow, see below.]
 
 On Wed, 31 Oct 2007, Erez Zadok wrote:
  
  Hi Hugh, I've addressed all of your concerns and am happy to report that the
  newly revised unionfs_writepage works even better, including under my
  memory-pressure conditions.  To summarize my changes since the last time:
  
  - I'm only masking __GFP_FS, not __GFP_IO
  - using find_or_create_page to avoid locking issues around mapping mask
  - handle for_reclaim case more efficiently
  - using copy_highpage so we handle KM_USER*
  - un/locking upper/lower page as/when needed
  - updated comments to clarify what/why
  - unionfs_sync_page: gone (yes, vfs.txt did confuse me, plus ecryptfs used
to have it)
  
  Below is the newest version of unionfs_writepage.  Let me know what you
  think.
  
  I have to say that with these changes, unionfs appears visibly faster under
  memory pressure.  I suspect the for_reclaim handling is probably the largest
  contributor to this speedup.
 
 That's good news, and that unionfs_writepage looks good to me -
 with three reservations I've not observed before.
 
 One, I think you would be safer to do a set_page_dirty(lower_page)
 before your clear_page_dirty_for_io(lower_page).  I know that sounds
 silly, but see Linus' Yes, Virginia comment in clear_page_dirty_for_io:
 there's a lot of subtlety hereabouts, and I think you'd be mimicing the
 usual path closer if you set_page_dirty first - there's nothing else
 doing it on that lower_page, is there?  I'm not certain that you need
 to, but I think you'd do well to look into it and make up your own mind.

Hugh, my code looks like:

if (wbc-for_reclaim) {
set_page_dirty(lower_page);
unlock_page(lower_page);
goto out_release;
}
BUG_ON(!lower_mapping-a_ops-writepage);
clear_page_dirty_for_io(lower_page); /* emulate VFS behavior */
err = lower_mapping-a_ops-writepage(lower_page, wbc);

Do you mean I should set_page_dirty(lower_page) unconditionally before
clear_page_dirty_for_io?  (I already do that in the 'if' statement above it.)

 Two, I'm unsure of the way you're clearing or setting PageUptodate on
 the upper page there.  The rules for PageUptodate are fairly obvious
 when reading, but when a write fails, it's not so obvious.  Again, I'm
 not saying what you've got is wrong (it may be unavoidable, to keep
 synch between lower and upper), but it deserves a second thought.

I looked at all mainline filesystems's -writepage to see what, if any, they
do with their page's uptodate flag.  Most f/s don't touch the flag one way
or another.

cifs_writepage sets the uptodate flag unconditionally: why?

ecryptfs_writepage has a legit reason: if encrypting the page failed, it 
doesn't want
anyone to use it, so it clears its page's uptodate flag (else it sets it as
uptodate).

hostfs_writepage clears pageuptodate if it failed to write_file(), which I'm
not sure if it makes sense or not.

ntfs_writepage goes as far as doing BUG_ON(!PageUptodate(page)) which
indicates to me that the page passed to -writepage should always be
uptodate.  Is that a fair statement?

smb_writepage pretty much unconditionally calls SetPageUptodate(page).  Why?

Is there a reason smbfs and cifs both do this unconditionally?  If so, then
why is ntfs calling BUG_ON if the page isn't uptodate?  Either that BUG_ON
in ntfs is redundant, or cifs/smbfs's SetPageUptodate is redundant, but they
can't both be right.

And finally, unionfs clears the uptodate flag on error from the lower
-writepage, and otherwise sets the flag on success from the lower
-writepage.  My gut feeling is that unionfs shouldn't change the page
uptodate flag at all: if the VFS passes unionfs_writepage a page which isn't
uptodate, then the VFS has a serious problem b/c it'd be asking a f/s to
write out a page which isn't up-to-date, right?  Otherwise, whether
unionfs_writepage manages to write the lower page or not, why should that
invalidate the state of the unionfs page itself?  Come to think of it, I
think clearing pageuptodate on error from -writepage(lower_page) may be
bad.  Imagine if after such a failed unionfs_writepage, I get a
unionfs_readpage: that -readpage will get data from the lower f/s page and
copy it *over* the unionfs page, even if the upper page's data was more
recent prior to the failed call to unionfs_writepage.  IOW, we could be
reverting a user-visible mmap'ed page to a previous on-disk version.  What
do you think: could this happen?  Anyway, I'll run some exhaustive testing
next and see what happens if I don't set/clear the uptodate flag in
unionfs_writepage.

 Three, I believe you need to add a flush_dcache_page(lower_page)
 after the copy_highpage(lower_page): some architectures will need
 that to see the new data if they have lower_page mapped (though I
 expect it's anyway shaky ground to be accessing through the lower

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-11-05 Thread Hugh Dickins

On Mon, 5 Nov 2007, Dave Hansen wrote:
> 
> Actually, I think your s/while/if/ change is probably a decent fix.

Any resemblance to a decent fix is purely coincidental.

> Barring any other races, that loop should always have made progress on
> mnt->__mnt_writers the way it is written.  If we get to:
> 
> > lock_and_coalesce_cpu_mnt_writer_counts();
> ->HERE
> > mnt_unlock_cpus();
> 
> and don't have a positive mnt->__mnt_writers, we know something is going
> badly.  We WARN_ON() there, which should at least give an earlier
> warning that the system is not doing well.  But it doesn't fix the
> inevitable.  Could you try the attached patch and see if it at least
> warns you earlier?

Thanks, Dave, yes, that gives me a nice warning:

leak detected on mount(c25ebd80) writers count: -65537
WARNING: at fs/namespace.c:249 handle_write_count_underflow()
 [] show_trace_log_lvl+0x1b/0x2e
 [] show_trace+0x16/0x1b
 [] dump_stack+0x19/0x1e
 [] handle_write_count_underflow+0x4c/0x60
 [] mnt_drop_write+0x69/0x8e
 [] __fput+0xff/0x162
 [] fput+0x2e/0x33
 [] unionfs_file_release+0xc2/0x1c5
 [] __fput+0x8f/0x162
 [] fput+0x2e/0x33
 [] filp_close+0x50/0x5d
 [] sys_close+0x74/0xb4
 [] sysenter_past_esp+0x5f/0x85

and the test then goes quietly on its way instead of hanging.  Though
I imagine, with your patch or mine, that it's then making an unfortunate
frequency of calls to lock_and_coalesce_longer_name_than_I_care_to_type
thereafter.  But it's hardly your responsibility to optimize for bugs
elsewhere.

The 2.6.23-mm1 tree has MNT_USER at 0x200, so I adjusted your flag to
#define MNT_IMBALANCED_WRITE_COUNT  0x400 /* just for debugging */

> 
> I have a decent guess what the bug is, too.  In the unionfs code:

I'll let Erez take it from there...

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-11-05 Thread Dave Hansen

On Mon, 2007-11-05 at 15:40 +, Hugh Dickins wrote:
> The second problem was a hang: all cpus in
> handle_write_count_underflow
> doing lock_and_coalesce_cpu_mnt_writer_counts: new -mm stuff from Dave
> Hansen.  At first I thought that was a locking problem in Dave's code,
> but I now suspect it's that your unionfs reference counting is wrong
> somewhere, and the error accumulates until __mnt_writers drops below
> MNT_WRITER_UNDERFLOW_LIMIT, but the coalescence does nothing to help
> and we're stuck in that loop. 

I've never actually seen this happen in practice, but I do know exactly
what you're talking about.

> but I hope Dave can
> also make handle_write_count_underflow more robust, it's unfortunate
> if refcount errors elsewhere first show up as a hang there.

Actually, I think your s/while/if/ change is probably a decent fix.
Barring any other races, that loop should always have made progress on
mnt->__mnt_writers the way it is written.  If we get to:

> lock_and_coalesce_cpu_mnt_writer_counts();
->HERE
> mnt_unlock_cpus();

and don't have a positive mnt->__mnt_writers, we know something is going
badly.  We WARN_ON() there, which should at least give an earlier
warning that the system is not doing well.  But it doesn't fix the
inevitable.  Could you try the attached patch and see if it at least
warns you earlier?

I have a decent guess what the bug is, too.  In the unionfs code:

> int init_lower_nd(struct nameidata *nd, unsigned int flags)
> {
> ...
> #ifdef ALLOC_LOWER_ND_FILE
> file = kzalloc(sizeof(struct file), GFP_KERNEL);
> if (unlikely(!file)) {
> err = -ENOMEM;
> break; /* exit switch statement and thus return */
> }
> nd->intent.open.file = file;
> #endif /* ALLOC_LOWER_ND_FILE */

The r/o bind mount code will mnt_drop_write() on that file's f_vfsmnt at
__fput() time.  Since that code never got a write on the mount, we'll
see an imbalance if the file was opened for a write.  I don't see this
file's mnt set anywhere, so I'm not completely sure that this is it.  In
any case, rolling your own 'struct file' without using alloc_file() and
friends is a no-no.

BTW, I have some "debugging" code in my latest set of patches that I
think should fix this kind of imbalance with the mnt->__mnt_writers().
It ensures that before we do that mnt_drop_write() at __fput() that we
absolutely did a mnt_want_write() at some point in the 'struct file's
life.  

-- Dave

 linux-2.6.git-dave/fs/namespace.c|   31 ++-
 linux-2.6.git-dave/include/linux/mount.h |1 +
 2 files changed, 23 insertions(+), 9 deletions(-)

diff -puN fs/namei.c~fix-naughty-loop fs/namei.c
diff -puN fs/namespace.c~fix-naughty-loop fs/namespace.c
--- linux-2.6.git/fs/namespace.c~fix-naughty-loop   2007-11-05 
08:03:59.0 -0800
+++ linux-2.6.git-dave/fs/namespace.c   2007-11-05 08:35:06.0 -0800
@@ -225,16 +225,29 @@ static void lock_and_coalesce_cpu_mnt_wr
  */
 static void handle_write_count_underflow(struct vfsmount *mnt)
 {
-   while (atomic_read(>__mnt_writers) <
-   MNT_WRITER_UNDERFLOW_LIMIT) {
-   /*
-* It isn't necessary to hold all of the locks
-* at the same time, but doing it this way makes
-* us share a lot more code.
-*/
-   lock_and_coalesce_cpu_mnt_writer_counts();
-   mnt_unlock_cpus();
+   if (atomic_read(>__mnt_writers) >=
+   MNT_WRITER_UNDERFLOW_LIMIT)
+   return;
+   /*
+* It isn't necessary to hold all of the locks
+* at the same time, but doing it this way makes
+* us share a lot more code.
+*/
+   lock_and_coalesce_cpu_mnt_writer_counts();
+   /*
+* If coalescing the per-cpu writer counts did not
+* get us back to a positive writer count, we have
+* a bug.
+*/
+   if ((atomic_read(>__mnt_writers) < 0) &&
+   !(mnt->mnt_flags & MNT_IMBALANCED_WRITE_COUNT)) {
+   printk("leak detected on mount(%p) writers count: %d\n",
+   mnt, atomic_read(>__mnt_writers));
+   WARN_ON(1);
+   /* use the flag to keep the dmesg spam down */
+   mnt->mnt_flags |= MNT_IMBALANCED_WRITE_COUNT;
}
+   mnt_unlock_cpus();
 }
 
 /**
diff -puN include/linux/mount.h~fix-naughty-loop include/linux/mount.h
--- linux-2.6.git/include/linux/mount.h~fix-naughty-loop2007-11-05 
08:22:21.0 -0800
+++ linux-2.6.git-dave/include/linux/mount.h2007-11-05 08:28:20.0 
-0800
@@ -32,6 +32,7 @@ struct mnt_namespace;
 #define MNT_READONLY   0x40/* does the user want this to be r/o? */
 
 #define MNT_SHRINKABLE 0x100
+#define MNT_IMBALANCED_WRITE_COUNT 0x200 /* just for debugging */
 
 #define MNT_SHARED 0x1000

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-11-05 Thread Hugh Dickins

[Dave, I've Cc'ed you re handle_write_count_underflow, see below.]

On Wed, 31 Oct 2007, Erez Zadok wrote:
> 
> Hi Hugh, I've addressed all of your concerns and am happy to report that the
> newly revised unionfs_writepage works even better, including under my
> memory-pressure conditions.  To summarize my changes since the last time:
> 
> - I'm only masking __GFP_FS, not __GFP_IO
> - using find_or_create_page to avoid locking issues around mapping mask
> - handle for_reclaim case more efficiently
> - using copy_highpage so we handle KM_USER*
> - un/locking upper/lower page as/when needed
> - updated comments to clarify what/why
> - unionfs_sync_page: gone (yes, vfs.txt did confuse me, plus ecryptfs used
>   to have it)
> 
> Below is the newest version of unionfs_writepage.  Let me know what you
> think.
> 
> I have to say that with these changes, unionfs appears visibly faster under
> memory pressure.  I suspect the for_reclaim handling is probably the largest
> contributor to this speedup.

That's good news, and that unionfs_writepage looks good to me -
with three reservations I've not observed before.

One, I think you would be safer to do a set_page_dirty(lower_page)
before your clear_page_dirty_for_io(lower_page).  I know that sounds
silly, but see Linus' "Yes, Virginia" comment in clear_page_dirty_for_io:
there's a lot of subtlety hereabouts, and I think you'd be mimicing the
usual path closer if you set_page_dirty first - there's nothing else
doing it on that lower_page, is there?  I'm not certain that you need
to, but I think you'd do well to look into it and make up your own mind.

Two, I'm unsure of the way you're clearing or setting PageUptodate on
the upper page there.  The rules for PageUptodate are fairly obvious
when reading, but when a write fails, it's not so obvious.  Again, I'm
not saying what you've got is wrong (it may be unavoidable, to keep
synch between lower and upper), but it deserves a second thought.

Three, I believe you need to add a flush_dcache_page(lower_page)
after the copy_highpage(lower_page): some architectures will need
that to see the new data if they have lower_page mapped (though I
expect it's anyway shaky ground to be accessing through the lower
mount at the same time as modifying through the upper).

I've been trying this out on 2.6.23-mm1 with your 21 Oct 1-9/9
and your 2 Nov 1-8/8 patches applied (rejects being patches which
were already in 2.6.23-mm1).  I was hoping to reproduce the
BUG_ON(entry->val) that I fear from shmem_writepage(), before
fixing it; but not seen that at all yet - that might be good
news, but it's more likely I just haven't tried hard enough yet.

For now I'm doing repeated make -j20 kernel builds, pushing into
swap, in a unionfs mount of just a single dir on tmpfs.  This has
shown up several problems, two of which I've had to hack around to
get further.

The first: I very quickly hit "BUG: atomic counter underflow"
from -mm's i386 atomic_dec_and_test: from filp_close calling
unionfs_flush.  I did a little test fork()ing while holding a file
open on unionfs, and indeed it appears that your totalopens code is
broken, being unaware of how fork() bumps up a file count without
an open.  That's rather basic, I'm puzzled that this has remained
undiscovered until now - or perhaps it's just a recent addition.

It looked to me as if the totalopens count was about avoiding some
d_deleted processing in unionfs_flush, which actually should be left
until unionfs_release (and that your unionfs_flush ought to be calling
the lower level flush in all cases).  To get going, I've been running
with the quick hack patch below: but I've spent very little time
thinking about it, plus it's a long time since I've done VFS stuff;
so that patch may be nothing but an embarrassment that reflects
neither your intentions nor the VFS rules!  And it may itself be
responsible for the further problems I've seen.

The second problem was a hang: all cpus in handle_write_count_underflow
doing lock_and_coalesce_cpu_mnt_writer_counts: new -mm stuff from Dave
Hansen.  At first I thought that was a locking problem in Dave's code,
but I now suspect it's that your unionfs reference counting is wrong
somewhere, and the error accumulates until __mnt_writers drops below
MNT_WRITER_UNDERFLOW_LIMIT, but the coalescence does nothing to help
and we're stuck in that loop.  My even greater hack to solve that one
was to change Dave's "while" to "if"!  Then indeed tests can run for
some while.  As I say, my suspicion is that the actual error is within
unionfs (perhaps introduced by my first hack); but I hope Dave can
also make handle_write_count_underflow more robust, it's unfortunate
if refcount errors elsewhere first show up as a hang there.

I've had CONFIG_UNION_FS_DEBUG=y but will probably turn it off when
I come back to this, since it's rather noisy at present.  I've not
checked whether its reports are peculiar to having tmpfs below or not.
I get lots of "unionfs: new lower inode mtime"

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-11-05 Thread Hugh Dickins

[Dave, I've Cc'ed you re handle_write_count_underflow, see below.]

On Wed, 31 Oct 2007, Erez Zadok wrote:
 
 Hi Hugh, I've addressed all of your concerns and am happy to report that the
 newly revised unionfs_writepage works even better, including under my
 memory-pressure conditions.  To summarize my changes since the last time:
 
 - I'm only masking __GFP_FS, not __GFP_IO
 - using find_or_create_page to avoid locking issues around mapping mask
 - handle for_reclaim case more efficiently
 - using copy_highpage so we handle KM_USER*
 - un/locking upper/lower page as/when needed
 - updated comments to clarify what/why
 - unionfs_sync_page: gone (yes, vfs.txt did confuse me, plus ecryptfs used
   to have it)
 
 Below is the newest version of unionfs_writepage.  Let me know what you
 think.
 
 I have to say that with these changes, unionfs appears visibly faster under
 memory pressure.  I suspect the for_reclaim handling is probably the largest
 contributor to this speedup.

That's good news, and that unionfs_writepage looks good to me -
with three reservations I've not observed before.

One, I think you would be safer to do a set_page_dirty(lower_page)
before your clear_page_dirty_for_io(lower_page).  I know that sounds
silly, but see Linus' Yes, Virginia comment in clear_page_dirty_for_io:
there's a lot of subtlety hereabouts, and I think you'd be mimicing the
usual path closer if you set_page_dirty first - there's nothing else
doing it on that lower_page, is there?  I'm not certain that you need
to, but I think you'd do well to look into it and make up your own mind.

Two, I'm unsure of the way you're clearing or setting PageUptodate on
the upper page there.  The rules for PageUptodate are fairly obvious
when reading, but when a write fails, it's not so obvious.  Again, I'm
not saying what you've got is wrong (it may be unavoidable, to keep
synch between lower and upper), but it deserves a second thought.

Three, I believe you need to add a flush_dcache_page(lower_page)
after the copy_highpage(lower_page): some architectures will need
that to see the new data if they have lower_page mapped (though I
expect it's anyway shaky ground to be accessing through the lower
mount at the same time as modifying through the upper).

I've been trying this out on 2.6.23-mm1 with your 21 Oct 1-9/9
and your 2 Nov 1-8/8 patches applied (rejects being patches which
were already in 2.6.23-mm1).  I was hoping to reproduce the
BUG_ON(entry-val) that I fear from shmem_writepage(), before
fixing it; but not seen that at all yet - that might be good
news, but it's more likely I just haven't tried hard enough yet.

For now I'm doing repeated make -j20 kernel builds, pushing into
swap, in a unionfs mount of just a single dir on tmpfs.  This has
shown up several problems, two of which I've had to hack around to
get further.

The first: I very quickly hit BUG: atomic counter underflow
from -mm's i386 atomic_dec_and_test: from filp_close calling
unionfs_flush.  I did a little test fork()ing while holding a file
open on unionfs, and indeed it appears that your totalopens code is
broken, being unaware of how fork() bumps up a file count without
an open.  That's rather basic, I'm puzzled that this has remained
undiscovered until now - or perhaps it's just a recent addition.

It looked to me as if the totalopens count was about avoiding some
d_deleted processing in unionfs_flush, which actually should be left
until unionfs_release (and that your unionfs_flush ought to be calling
the lower level flush in all cases).  To get going, I've been running
with the quick hack patch below: but I've spent very little time
thinking about it, plus it's a long time since I've done VFS stuff;
so that patch may be nothing but an embarrassment that reflects
neither your intentions nor the VFS rules!  And it may itself be
responsible for the further problems I've seen.

The second problem was a hang: all cpus in handle_write_count_underflow
doing lock_and_coalesce_cpu_mnt_writer_counts: new -mm stuff from Dave
Hansen.  At first I thought that was a locking problem in Dave's code,
but I now suspect it's that your unionfs reference counting is wrong
somewhere, and the error accumulates until __mnt_writers drops below
MNT_WRITER_UNDERFLOW_LIMIT, but the coalescence does nothing to help
and we're stuck in that loop.  My even greater hack to solve that one
was to change Dave's while to if!  Then indeed tests can run for
some while.  As I say, my suspicion is that the actual error is within
unionfs (perhaps introduced by my first hack); but I hope Dave can
also make handle_write_count_underflow more robust, it's unfortunate
if refcount errors elsewhere first show up as a hang there.

I've had CONFIG_UNION_FS_DEBUG=y but will probably turn it off when
I come back to this, since it's rather noisy at present.  I've not
checked whether its reports are peculiar to having tmpfs below or not.
I get lots of unionfs: new lower inode mtime reports on directories
(if

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-11-05 Thread Dave Hansen

On Mon, 2007-11-05 at 15:40 +, Hugh Dickins wrote:
 The second problem was a hang: all cpus in
 handle_write_count_underflow
 doing lock_and_coalesce_cpu_mnt_writer_counts: new -mm stuff from Dave
 Hansen.  At first I thought that was a locking problem in Dave's code,
 but I now suspect it's that your unionfs reference counting is wrong
 somewhere, and the error accumulates until __mnt_writers drops below
 MNT_WRITER_UNDERFLOW_LIMIT, but the coalescence does nothing to help
 and we're stuck in that loop. 

I've never actually seen this happen in practice, but I do know exactly
what you're talking about.

 but I hope Dave can
 also make handle_write_count_underflow more robust, it's unfortunate
 if refcount errors elsewhere first show up as a hang there.

Actually, I think your s/while/if/ change is probably a decent fix.
Barring any other races, that loop should always have made progress on
mnt-__mnt_writers the way it is written.  If we get to:

 lock_and_coalesce_cpu_mnt_writer_counts();
-HERE
 mnt_unlock_cpus();

and don't have a positive mnt-__mnt_writers, we know something is going
badly.  We WARN_ON() there, which should at least give an earlier
warning that the system is not doing well.  But it doesn't fix the
inevitable.  Could you try the attached patch and see if it at least
warns you earlier?

I have a decent guess what the bug is, too.  In the unionfs code:

 int init_lower_nd(struct nameidata *nd, unsigned int flags)
 {
 ...
 #ifdef ALLOC_LOWER_ND_FILE
 file = kzalloc(sizeof(struct file), GFP_KERNEL);
 if (unlikely(!file)) {
 err = -ENOMEM;
 break; /* exit switch statement and thus return */
 }
 nd-intent.open.file = file;
 #endif /* ALLOC_LOWER_ND_FILE */

The r/o bind mount code will mnt_drop_write() on that file's f_vfsmnt at
__fput() time.  Since that code never got a write on the mount, we'll
see an imbalance if the file was opened for a write.  I don't see this
file's mnt set anywhere, so I'm not completely sure that this is it.  In
any case, rolling your own 'struct file' without using alloc_file() and
friends is a no-no.

BTW, I have some debugging code in my latest set of patches that I
think should fix this kind of imbalance with the mnt-__mnt_writers().
It ensures that before we do that mnt_drop_write() at __fput() that we
absolutely did a mnt_want_write() at some point in the 'struct file's
life.  

-- Dave

 linux-2.6.git-dave/fs/namespace.c|   31 ++-
 linux-2.6.git-dave/include/linux/mount.h |1 +
 2 files changed, 23 insertions(+), 9 deletions(-)

diff -puN fs/namei.c~fix-naughty-loop fs/namei.c
diff -puN fs/namespace.c~fix-naughty-loop fs/namespace.c
--- linux-2.6.git/fs/namespace.c~fix-naughty-loop   2007-11-05 
08:03:59.0 -0800
+++ linux-2.6.git-dave/fs/namespace.c   2007-11-05 08:35:06.0 -0800
@@ -225,16 +225,29 @@ static void lock_and_coalesce_cpu_mnt_wr
  */
 static void handle_write_count_underflow(struct vfsmount *mnt)
 {
-   while (atomic_read(mnt-__mnt_writers) 
-   MNT_WRITER_UNDERFLOW_LIMIT) {
-   /*
-* It isn't necessary to hold all of the locks
-* at the same time, but doing it this way makes
-* us share a lot more code.
-*/
-   lock_and_coalesce_cpu_mnt_writer_counts();
-   mnt_unlock_cpus();
+   if (atomic_read(mnt-__mnt_writers) =
+   MNT_WRITER_UNDERFLOW_LIMIT)
+   return;
+   /*
+* It isn't necessary to hold all of the locks
+* at the same time, but doing it this way makes
+* us share a lot more code.
+*/
+   lock_and_coalesce_cpu_mnt_writer_counts();
+   /*
+* If coalescing the per-cpu writer counts did not
+* get us back to a positive writer count, we have
+* a bug.
+*/
+   if ((atomic_read(mnt-__mnt_writers)  0) 
+   !(mnt-mnt_flags  MNT_IMBALANCED_WRITE_COUNT)) {
+   printk(leak detected on mount(%p) writers count: %d\n,
+   mnt, atomic_read(mnt-__mnt_writers));
+   WARN_ON(1);
+   /* use the flag to keep the dmesg spam down */
+   mnt-mnt_flags |= MNT_IMBALANCED_WRITE_COUNT;
}
+   mnt_unlock_cpus();
 }
 
 /**
diff -puN include/linux/mount.h~fix-naughty-loop include/linux/mount.h
--- linux-2.6.git/include/linux/mount.h~fix-naughty-loop2007-11-05 
08:22:21.0 -0800
+++ linux-2.6.git-dave/include/linux/mount.h2007-11-05 08:28:20.0 
-0800
@@ -32,6 +32,7 @@ struct mnt_namespace;
 #define MNT_READONLY   0x40/* does the user want this to be r/o? */
 
 #define MNT_SHRINKABLE 0x100
+#define MNT_IMBALANCED_WRITE_COUNT 0x200 /* just for debugging */
 
 #define MNT_SHARED 0x1000  /* if the vfsmount is a

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-11-05 Thread Hugh Dickins

On Mon, 5 Nov 2007, Dave Hansen wrote:
 
 Actually, I think your s/while/if/ change is probably a decent fix.

Any resemblance to a decent fix is purely coincidental.

 Barring any other races, that loop should always have made progress on
 mnt-__mnt_writers the way it is written.  If we get to:
 
  lock_and_coalesce_cpu_mnt_writer_counts();
 -HERE
  mnt_unlock_cpus();
 
 and don't have a positive mnt-__mnt_writers, we know something is going
 badly.  We WARN_ON() there, which should at least give an earlier
 warning that the system is not doing well.  But it doesn't fix the
 inevitable.  Could you try the attached patch and see if it at least
 warns you earlier?

Thanks, Dave, yes, that gives me a nice warning:

leak detected on mount(c25ebd80) writers count: -65537
WARNING: at fs/namespace.c:249 handle_write_count_underflow()
 [c0103486] show_trace_log_lvl+0x1b/0x2e
 [c01034b6] show_trace+0x16/0x1b
 [c0103589] dump_stack+0x19/0x1e
 [c0171906] handle_write_count_underflow+0x4c/0x60
 [c0171983] mnt_drop_write+0x69/0x8e
 [c0160211] __fput+0xff/0x162
 [c016010d] fput+0x2e/0x33
 [c01b8f63] unionfs_file_release+0xc2/0x1c5
 [c01601a1] __fput+0x8f/0x162
 [c016010d] fput+0x2e/0x33
 [c015ec9d] filp_close+0x50/0x5d
 [c015ed1e] sys_close+0x74/0xb4
 [c01026ce] sysenter_past_esp+0x5f/0x85

and the test then goes quietly on its way instead of hanging.  Though
I imagine, with your patch or mine, that it's then making an unfortunate
frequency of calls to lock_and_coalesce_longer_name_than_I_care_to_type
thereafter.  But it's hardly your responsibility to optimize for bugs
elsewhere.

The 2.6.23-mm1 tree has MNT_USER at 0x200, so I adjusted your flag to
#define MNT_IMBALANCED_WRITE_COUNT  0x400 /* just for debugging */

 
 I have a decent guess what the bug is, too.  In the unionfs code:

I'll let Erez take it from there...

Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-31 Thread Erez Zadok


Hi Hugh, I've addressed all of your concerns and am happy to report that the
newly revised unionfs_writepage works even better, including under my
memory-pressure conditions.  To summarize my changes since the last time:

- I'm only masking __GFP_FS, not __GFP_IO
- using find_or_create_page to avoid locking issues around mapping mask
- handle for_reclaim case more efficiently
- using copy_highpage so we handle KM_USER*
- un/locking upper/lower page as/when needed
- updated comments to clarify what/why
- unionfs_sync_page: gone (yes, vfs.txt did confuse me, plus ecryptfs used
  to have it)

Below is the newest version of unionfs_writepage.  Let me know what you
think.

I have to say that with these changes, unionfs appears visibly faster under
memory pressure.  I suspect the for_reclaim handling is probably the largest
contributor to this speedup.

Many thanks,
Erez.

//

static int unionfs_writepage(struct page *page, struct writeback_control *wbc)
{
int err = -EIO;
struct inode *inode;
struct inode *lower_inode;
struct page *lower_page;
struct address_space *lower_mapping; /* lower inode mapping */
gfp_t mask;

inode = page->mapping->host;
lower_inode = unionfs_lower_inode(inode);
lower_mapping = lower_inode->i_mapping;

/*
 * find lower page (returns a locked page)
 *
 * We turn off __GFP_FS while we look for or create a new lower
 * page.  This prevents a recursion into the file system code, which
 * under memory pressure conditions could lead to a deadlock.  This
 * is similar to how the loop driver behaves (see loop_set_fd in
 * drivers/block/loop.c).  If we can't find the lower page, we
 * redirty our page and return "success" so that the VM will call us
 * again in the (hopefully near) future.
 */
mask = mapping_gfp_mask(lower_mapping) & ~(__GFP_FS);
lower_page = find_or_create_page(lower_mapping, page->index, mask);
if (!lower_page) {
err = 0;
set_page_dirty(page);
goto out;
}

/* copy page data from our upper page to the lower page */
copy_highpage(lower_page, page);

/*
 * Call lower writepage (expects locked page).  However, if we are
 * called with wbc->for_reclaim, then the VFS/VM just wants to
 * reclaim our page.  Therefore, we don't need to call the lower
 * ->writepage: just copy our data to the lower page (already done
 * above), then mark the lower page dirty and unlock it, and return
 * success.
 */
if (wbc->for_reclaim) {
set_page_dirty(lower_page);
unlock_page(lower_page);
goto out_release;
}
BUG_ON(!lower_mapping->a_ops->writepage);
clear_page_dirty_for_io(lower_page); /* emulate VFS behavior */
err = lower_mapping->a_ops->writepage(lower_page, wbc);
if (err < 0) {
ClearPageUptodate(page);
goto out_release;
}

/*
 * Lower file systems such as ramfs and tmpfs, may return
 * AOP_WRITEPAGE_ACTIVATE so that the VM won't try to (pointlessly)
 * write the page again for a while.  But those lower file systems
 * also set the page dirty bit back again.  Since we successfully
 * copied our page data to the lower page, then the VM will come
 * back to the lower page (directly) and try to flush it.  So we can
 * save the VM the hassle of coming back to our page and trying to
 * flush too.  Therefore, we don't re-dirty our own page, and we
 * never return AOP_WRITEPAGE_ACTIVATE back to the VM (we consider
 * this a success).
 *
 * We also unlock the lower page if the lower ->writepage returned
 * AOP_WRITEPAGE_ACTIVATE.  (This "anomalous" behaviour may be
 * addressed in future shmem/VM code.)
 */
if (err == AOP_WRITEPAGE_ACTIVATE) {
err = 0;
unlock_page(lower_page);
}

/* all is well */
SetPageUptodate(page);
/* lower mtimes have changed: update ours */
unionfs_copy_attr_times(inode);

out_release:
/* b/c find_or_create_page increased refcnt */
page_cache_release(lower_page);
out:
/*
 * We unlock our page unconditionally, because we never return
 * AOP_WRITEPAGE_ACTIVATE.
 */
unlock_page(page);
return err;
}

//
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-31 Thread Erez Zadok


Hi Hugh, I've addressed all of your concerns and am happy to report that the
newly revised unionfs_writepage works even better, including under my
memory-pressure conditions.  To summarize my changes since the last time:

- I'm only masking __GFP_FS, not __GFP_IO
- using find_or_create_page to avoid locking issues around mapping mask
- handle for_reclaim case more efficiently
- using copy_highpage so we handle KM_USER*
- un/locking upper/lower page as/when needed
- updated comments to clarify what/why
- unionfs_sync_page: gone (yes, vfs.txt did confuse me, plus ecryptfs used
  to have it)

Below is the newest version of unionfs_writepage.  Let me know what you
think.

I have to say that with these changes, unionfs appears visibly faster under
memory pressure.  I suspect the for_reclaim handling is probably the largest
contributor to this speedup.

Many thanks,
Erez.

//

static int unionfs_writepage(struct page *page, struct writeback_control *wbc)
{
int err = -EIO;
struct inode *inode;
struct inode *lower_inode;
struct page *lower_page;
struct address_space *lower_mapping; /* lower inode mapping */
gfp_t mask;

inode = page-mapping-host;
lower_inode = unionfs_lower_inode(inode);
lower_mapping = lower_inode-i_mapping;

/*
 * find lower page (returns a locked page)
 *
 * We turn off __GFP_FS while we look for or create a new lower
 * page.  This prevents a recursion into the file system code, which
 * under memory pressure conditions could lead to a deadlock.  This
 * is similar to how the loop driver behaves (see loop_set_fd in
 * drivers/block/loop.c).  If we can't find the lower page, we
 * redirty our page and return success so that the VM will call us
 * again in the (hopefully near) future.
 */
mask = mapping_gfp_mask(lower_mapping)  ~(__GFP_FS);
lower_page = find_or_create_page(lower_mapping, page-index, mask);
if (!lower_page) {
err = 0;
set_page_dirty(page);
goto out;
}

/* copy page data from our upper page to the lower page */
copy_highpage(lower_page, page);

/*
 * Call lower writepage (expects locked page).  However, if we are
 * called with wbc-for_reclaim, then the VFS/VM just wants to
 * reclaim our page.  Therefore, we don't need to call the lower
 * -writepage: just copy our data to the lower page (already done
 * above), then mark the lower page dirty and unlock it, and return
 * success.
 */
if (wbc-for_reclaim) {
set_page_dirty(lower_page);
unlock_page(lower_page);
goto out_release;
}
BUG_ON(!lower_mapping-a_ops-writepage);
clear_page_dirty_for_io(lower_page); /* emulate VFS behavior */
err = lower_mapping-a_ops-writepage(lower_page, wbc);
if (err  0) {
ClearPageUptodate(page);
goto out_release;
}

/*
 * Lower file systems such as ramfs and tmpfs, may return
 * AOP_WRITEPAGE_ACTIVATE so that the VM won't try to (pointlessly)
 * write the page again for a while.  But those lower file systems
 * also set the page dirty bit back again.  Since we successfully
 * copied our page data to the lower page, then the VM will come
 * back to the lower page (directly) and try to flush it.  So we can
 * save the VM the hassle of coming back to our page and trying to
 * flush too.  Therefore, we don't re-dirty our own page, and we
 * never return AOP_WRITEPAGE_ACTIVATE back to the VM (we consider
 * this a success).
 *
 * We also unlock the lower page if the lower -writepage returned
 * AOP_WRITEPAGE_ACTIVATE.  (This anomalous behaviour may be
 * addressed in future shmem/VM code.)
 */
if (err == AOP_WRITEPAGE_ACTIVATE) {
err = 0;
unlock_page(lower_page);
}

/* all is well */
SetPageUptodate(page);
/* lower mtimes have changed: update ours */
unionfs_copy_attr_times(inode);

out_release:
/* b/c find_or_create_page increased refcnt */
page_cache_release(lower_page);
out:
/*
 * We unlock our page unconditionally, because we never return
 * AOP_WRITEPAGE_ACTIVATE.
 */
unlock_page(page);
return err;
}

//
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-29 Thread Hugh Dickins

On Sun, 28 Oct 2007, Erez Zadok wrote:
> 
> I took your advise regarding ~(__GFP_FS|__GFP_IO), AOP_WRITEPAGE_ACTIVATE,
> and such.  I revised my unionfs_writepage and unionfs_sync_page, and tested
> it under memory pressure: I have a couple of live CDs that use tmpfs and can
> deterministically reproduce the conditions resulting in A_W_A.  I also went
> back to using grab_cache_page but with the gfp_mask suggestions you made.
> 
> I'm happy to report that it all works great now!

That's very encouraging...

> Below is the entirety of
> the new unionfs_mmap and unionfs_sync_page code.  I'd appreciate if you and
> others can look it over and see if you find any problems.

... but still a few problems, I'm afraid.

The greatest problem is a tmpfs one, that would be for me to solve.
But first...

> static int unionfs_writepage(struct page *page, struct writeback_control *wbc)
> {
>   int err = -EIO;
>   struct inode *inode;
>   struct inode *lower_inode;
>   struct page *lower_page;
>   char *kaddr, *lower_kaddr;
>   struct address_space *mapping; /* lower inode mapping */
>   gfp_t old_gfp_mask;
> 
>   inode = page->mapping->host;
>   lower_inode = unionfs_lower_inode(inode);
>   mapping = lower_inode->i_mapping;
> 
>   /*
>* find lower page (returns a locked page)
>*
>* We turn off __GFP_IO|__GFP_FS so as to prevent a deadlock under

On reflection, I think I went too far in asking you to mask off __GFP_IO.
Loop has to do so because it's a block device, down towards the IO layer;
but unionfs is a filesystem, so masking off __GFP_FS is enough to prevent
recursion into the FS layer with danger of deadlock, and leaving __GFP_IO
on gives a better chance of success.

>* memory pressure conditions.  This is similar to how the loop
>* driver behaves (see loop_set_fd in drivers/block/loop.c).
>* If we can't find the lower page, we redirty our page and return
>* "success" so that the VM will call us again in the (hopefully
>* near) future.
>*/
>   old_gfp_mask = mapping_gfp_mask(mapping);
>   mapping_set_gfp_mask(mapping, old_gfp_mask & ~(__GFP_IO|__GFP_FS));
> 
>   lower_page = grab_cache_page(mapping, page->index);
>   mapping_set_gfp_mask(mapping, old_gfp_mask);

Hmm, several points on that.

When suggesting something like this, I did remark "what locking needed?".
You've got none: which is problematic if two stacked mounts are playing
with the same underlying file concurrently (yes, userspace would have
a data coherency problem in such a case, but the kernel still needs to
worry about its own internal integrity) - you'd be in danger of masking
(__GFP_IO|__GFP_FS) permanently off the underlying file; and furthermore,
losing error flags (AS_EIO, AS_ENOSPC) which share the same unsigned long.
Neither likely but both wrong.

See the comment on mapping_set_gfp_mask() in include/pagemap.h:
 * This is non-atomic.  Only to be used before the mapping is activated.
Strictly speaking, I guess loop was a little bit guilty even when just
loop_set_fd() did it: the underlying mapping might already be active.
It appears to be just as guilty as you in its do_loop_switch() case
(done at BIO completion time), but that's for a LOOP_CHANGE_FD ioctl
which would only be expected to be called once, during installation;
whereas you're using mapping_set_gfp_mask here with great frequency.

Another point on this is: loop masks __GFP_IO|__GFP_FS off the file
for the whole duration while it is looped, whereas you're flipping it
just in this preliminary section of unionfs_writepage.  I think you're
probably okay to be doing it only here within ->writepage: I think
loop covered every operation because it's at the block device level,
perhaps both reads and writes needed to serve reclaim at the higher
FS level; and also easier to do it once for all.

Are you wrong to be doing it only around the grab_cache_page,
leaving the lower level ->writepage further down unprotected?
Certainly doing it around the grab_cache_page is likely to be way
more important than around the ->writepage (but rather depends on
filesystem).  And on reflection, I think that the lower filesystem's
writepage should already be using GFP_NOFS to avoid deadlocks in
any of its allocations when wbc->for_reclaim, so you should be
okay just masking off around the grab_cache_page.

(Actually, in the wbc->for_reclaim case, I think you don't really
need to call the lower level writepage at all.  Just set_page_dirty
on the lower page, unlock it and return.  In due course that memory
pressure which has called unionfs_writepage, will come around to the
lower level page and do writepage upon it.  Whether that's a better
strategy or not, I'm do not know.)

There's an attractively simple answer to the mapping_set_gfp_mask
locking problem, if we're confident that it's only needed around
the grab_cache_page.  Look at the declaration of grab_cache_page
in

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-29 Thread Hugh Dickins

On Sun, 28 Oct 2007, Erez Zadok wrote:
 
 I took your advise regarding ~(__GFP_FS|__GFP_IO), AOP_WRITEPAGE_ACTIVATE,
 and such.  I revised my unionfs_writepage and unionfs_sync_page, and tested
 it under memory pressure: I have a couple of live CDs that use tmpfs and can
 deterministically reproduce the conditions resulting in A_W_A.  I also went
 back to using grab_cache_page but with the gfp_mask suggestions you made.
 
 I'm happy to report that it all works great now!

That's very encouraging...

 Below is the entirety of
 the new unionfs_mmap and unionfs_sync_page code.  I'd appreciate if you and
 others can look it over and see if you find any problems.

... but still a few problems, I'm afraid.

The greatest problem is a tmpfs one, that would be for me to solve.
But first...

 static int unionfs_writepage(struct page *page, struct writeback_control *wbc)
 {
   int err = -EIO;
   struct inode *inode;
   struct inode *lower_inode;
   struct page *lower_page;
   char *kaddr, *lower_kaddr;
   struct address_space *mapping; /* lower inode mapping */
   gfp_t old_gfp_mask;
 
   inode = page-mapping-host;
   lower_inode = unionfs_lower_inode(inode);
   mapping = lower_inode-i_mapping;
 
   /*
* find lower page (returns a locked page)
*
* We turn off __GFP_IO|__GFP_FS so as to prevent a deadlock under

On reflection, I think I went too far in asking you to mask off __GFP_IO.
Loop has to do so because it's a block device, down towards the IO layer;
but unionfs is a filesystem, so masking off __GFP_FS is enough to prevent
recursion into the FS layer with danger of deadlock, and leaving __GFP_IO
on gives a better chance of success.

* memory pressure conditions.  This is similar to how the loop
* driver behaves (see loop_set_fd in drivers/block/loop.c).
* If we can't find the lower page, we redirty our page and return
* success so that the VM will call us again in the (hopefully
* near) future.
*/
   old_gfp_mask = mapping_gfp_mask(mapping);
   mapping_set_gfp_mask(mapping, old_gfp_mask  ~(__GFP_IO|__GFP_FS));
 
   lower_page = grab_cache_page(mapping, page-index);
   mapping_set_gfp_mask(mapping, old_gfp_mask);

Hmm, several points on that.

When suggesting something like this, I did remark what locking needed?.
You've got none: which is problematic if two stacked mounts are playing
with the same underlying file concurrently (yes, userspace would have
a data coherency problem in such a case, but the kernel still needs to
worry about its own internal integrity) - you'd be in danger of masking
(__GFP_IO|__GFP_FS) permanently off the underlying file; and furthermore,
losing error flags (AS_EIO, AS_ENOSPC) which share the same unsigned long.
Neither likely but both wrong.

See the comment on mapping_set_gfp_mask() in include/pagemap.h:
 * This is non-atomic.  Only to be used before the mapping is activated.
Strictly speaking, I guess loop was a little bit guilty even when just
loop_set_fd() did it: the underlying mapping might already be active.
It appears to be just as guilty as you in its do_loop_switch() case
(done at BIO completion time), but that's for a LOOP_CHANGE_FD ioctl
which would only be expected to be called once, during installation;
whereas you're using mapping_set_gfp_mask here with great frequency.

Another point on this is: loop masks __GFP_IO|__GFP_FS off the file
for the whole duration while it is looped, whereas you're flipping it
just in this preliminary section of unionfs_writepage.  I think you're
probably okay to be doing it only here within -writepage: I think
loop covered every operation because it's at the block device level,
perhaps both reads and writes needed to serve reclaim at the higher
FS level; and also easier to do it once for all.

Are you wrong to be doing it only around the grab_cache_page,
leaving the lower level -writepage further down unprotected?
Certainly doing it around the grab_cache_page is likely to be way
more important than around the -writepage (but rather depends on
filesystem).  And on reflection, I think that the lower filesystem's
writepage should already be using GFP_NOFS to avoid deadlocks in
any of its allocations when wbc-for_reclaim, so you should be
okay just masking off around the grab_cache_page.

(Actually, in the wbc-for_reclaim case, I think you don't really
need to call the lower level writepage at all.  Just set_page_dirty
on the lower page, unlock it and return.  In due course that memory
pressure which has called unionfs_writepage, will come around to the
lower level page and do writepage upon it.  Whether that's a better
strategy or not, I'm do not know.)

There's an attractively simple answer to the mapping_set_gfp_mask
locking problem, if we're confident that it's only needed around
the grab_cache_page.  Look at the declaration of grab_cache_page
in linux/pagemap.h: it immediately extracts the gfp_mask from the

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-28 Thread Erez Zadok

Huge,

I took your advise regarding ~(__GFP_FS|__GFP_IO), AOP_WRITEPAGE_ACTIVATE,
and such.  I revised my unionfs_writepage and unionfs_sync_page, and tested
it under memory pressure: I have a couple of live CDs that use tmpfs and can
deterministically reproduce the conditions resulting in A_W_A.  I also went
back to using grab_cache_page but with the gfp_mask suggestions you made.

I'm happy to report that it all works great now!  Below is the entirety of
the new unionfs_mmap and unionfs_sync_page code.  I'd appreciate if you and
others can look it over and see if you find any problems.

Thanks,
Erez.


static int unionfs_writepage(struct page *page, struct writeback_control *wbc)
{
int err = -EIO;
struct inode *inode;
struct inode *lower_inode;
struct page *lower_page;
char *kaddr, *lower_kaddr;
struct address_space *mapping; /* lower inode mapping */
gfp_t old_gfp_mask;

inode = page->mapping->host;
lower_inode = unionfs_lower_inode(inode);
mapping = lower_inode->i_mapping;

/*
 * find lower page (returns a locked page)
 *
 * We turn off __GFP_IO|__GFP_FS so as to prevent a deadlock under
 * memory pressure conditions.  This is similar to how the loop
 * driver behaves (see loop_set_fd in drivers/block/loop.c).
 * If we can't find the lower page, we redirty our page and return
 * "success" so that the VM will call us again in the (hopefully
 * near) future.
 */
old_gfp_mask = mapping_gfp_mask(mapping);
mapping_set_gfp_mask(mapping, old_gfp_mask & ~(__GFP_IO|__GFP_FS));

lower_page = grab_cache_page(mapping, page->index);
mapping_set_gfp_mask(mapping, old_gfp_mask);

if (!lower_page) {
err = 0;
set_page_dirty(page);
goto out;
}

/* get page address, and encode it */
kaddr = kmap(page);
lower_kaddr = kmap(lower_page);

memcpy(lower_kaddr, kaddr, PAGE_CACHE_SIZE);

kunmap(page);
kunmap(lower_page);

BUG_ON(!mapping->a_ops->writepage);

/* call lower writepage (expects locked page) */
clear_page_dirty_for_io(lower_page); /* emulate VFS behavior */
err = mapping->a_ops->writepage(lower_page, wbc);

/* b/c grab_cache_page locked it and ->writepage unlocks on success */
if (err)
unlock_page(lower_page);
/* b/c grab_cache_page increased refcnt */
page_cache_release(lower_page);

if (err < 0) {
ClearPageUptodate(page);
goto out;
}
/*
 * Lower file systems such as ramfs and tmpfs, may return
 * AOP_WRITEPAGE_ACTIVATE so that the VM won't try to (pointlessly)
 * write the page again for a while.  But those lower file systems
 * also set the page dirty bit back again.  Since we successfully
 * copied our page data to the lower page, then the VM will come
 * back to the lower page (directly) and try to flush it.  So we can
 * save the VM the hassle of coming back to our page and trying to
 * flush too.  Therefore, we don't re-dirty our own page, and we
 * don't return AOP_WRITEPAGE_ACTIVATE back to the VM (we consider
 * this a success).
 */
if (err == AOP_WRITEPAGE_ACTIVATE)
err = 0;

/* all is well */
SetPageUptodate(page);
/* lower mtimes has changed: update ours */
unionfs_copy_attr_times(inode);

unlock_page(page);

out:
return err;
}


static void unionfs_sync_page(struct page *page)
{
struct inode *inode;
struct inode *lower_inode;
struct page *lower_page;
struct address_space *mapping; /* lower inode mapping */
gfp_t old_gfp_mask;

inode = page->mapping->host;
lower_inode = unionfs_lower_inode(inode);
mapping = lower_inode->i_mapping;

/*
 * Find lower page (returns a locked page).
 *
 * We turn off __GFP_IO|__GFP_FS so as to prevent a deadlock under
 * memory pressure conditions.  This is similar to how the loop
 * driver behaves (see loop_set_fd in drivers/block/loop.c).
 * If we can't find the lower page, we redirty our page and return
 * "success" so that the VM will call us again in the (hopefully
 * near) future.
 */
old_gfp_mask = mapping_gfp_mask(mapping);
mapping_set_gfp_mask(mapping, old_gfp_mask & ~(__GFP_IO|__GFP_FS));

lower_page = grab_cache_page(mapping, page->index);
mapping_set_gfp_mask(mapping, old_gfp_mask);

if (!lower_page) {
printk(KERN_ERR "unionfs: grab_cache_page failed\n");
goto out;
}

/* do the actual sync */

/*
 * XXX: can we optimize ala RAIF

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-28 Thread Erez Zadok

Huge,

I took your advise regarding ~(__GFP_FS|__GFP_IO), AOP_WRITEPAGE_ACTIVATE,
and such.  I revised my unionfs_writepage and unionfs_sync_page, and tested
it under memory pressure: I have a couple of live CDs that use tmpfs and can
deterministically reproduce the conditions resulting in A_W_A.  I also went
back to using grab_cache_page but with the gfp_mask suggestions you made.

I'm happy to report that it all works great now!  Below is the entirety of
the new unionfs_mmap and unionfs_sync_page code.  I'd appreciate if you and
others can look it over and see if you find any problems.

Thanks,
Erez.


static int unionfs_writepage(struct page *page, struct writeback_control *wbc)
{
int err = -EIO;
struct inode *inode;
struct inode *lower_inode;
struct page *lower_page;
char *kaddr, *lower_kaddr;
struct address_space *mapping; /* lower inode mapping */
gfp_t old_gfp_mask;

inode = page-mapping-host;
lower_inode = unionfs_lower_inode(inode);
mapping = lower_inode-i_mapping;

/*
 * find lower page (returns a locked page)
 *
 * We turn off __GFP_IO|__GFP_FS so as to prevent a deadlock under
 * memory pressure conditions.  This is similar to how the loop
 * driver behaves (see loop_set_fd in drivers/block/loop.c).
 * If we can't find the lower page, we redirty our page and return
 * success so that the VM will call us again in the (hopefully
 * near) future.
 */
old_gfp_mask = mapping_gfp_mask(mapping);
mapping_set_gfp_mask(mapping, old_gfp_mask  ~(__GFP_IO|__GFP_FS));

lower_page = grab_cache_page(mapping, page-index);
mapping_set_gfp_mask(mapping, old_gfp_mask);

if (!lower_page) {
err = 0;
set_page_dirty(page);
goto out;
}

/* get page address, and encode it */
kaddr = kmap(page);
lower_kaddr = kmap(lower_page);

memcpy(lower_kaddr, kaddr, PAGE_CACHE_SIZE);

kunmap(page);
kunmap(lower_page);

BUG_ON(!mapping-a_ops-writepage);

/* call lower writepage (expects locked page) */
clear_page_dirty_for_io(lower_page); /* emulate VFS behavior */
err = mapping-a_ops-writepage(lower_page, wbc);

/* b/c grab_cache_page locked it and -writepage unlocks on success */
if (err)
unlock_page(lower_page);
/* b/c grab_cache_page increased refcnt */
page_cache_release(lower_page);

if (err  0) {
ClearPageUptodate(page);
goto out;
}
/*
 * Lower file systems such as ramfs and tmpfs, may return
 * AOP_WRITEPAGE_ACTIVATE so that the VM won't try to (pointlessly)
 * write the page again for a while.  But those lower file systems
 * also set the page dirty bit back again.  Since we successfully
 * copied our page data to the lower page, then the VM will come
 * back to the lower page (directly) and try to flush it.  So we can
 * save the VM the hassle of coming back to our page and trying to
 * flush too.  Therefore, we don't re-dirty our own page, and we
 * don't return AOP_WRITEPAGE_ACTIVATE back to the VM (we consider
 * this a success).
 */
if (err == AOP_WRITEPAGE_ACTIVATE)
err = 0;

/* all is well */
SetPageUptodate(page);
/* lower mtimes has changed: update ours */
unionfs_copy_attr_times(inode);

unlock_page(page);

out:
return err;
}


static void unionfs_sync_page(struct page *page)
{
struct inode *inode;
struct inode *lower_inode;
struct page *lower_page;
struct address_space *mapping; /* lower inode mapping */
gfp_t old_gfp_mask;

inode = page-mapping-host;
lower_inode = unionfs_lower_inode(inode);
mapping = lower_inode-i_mapping;

/*
 * Find lower page (returns a locked page).
 *
 * We turn off __GFP_IO|__GFP_FS so as to prevent a deadlock under
 * memory pressure conditions.  This is similar to how the loop
 * driver behaves (see loop_set_fd in drivers/block/loop.c).
 * If we can't find the lower page, we redirty our page and return
 * success so that the VM will call us again in the (hopefully
 * near) future.
 */
old_gfp_mask = mapping_gfp_mask(mapping);
mapping_set_gfp_mask(mapping, old_gfp_mask  ~(__GFP_IO|__GFP_FS));

lower_page = grab_cache_page(mapping, page-index);
mapping_set_gfp_mask(mapping, old_gfp_mask);

if (!lower_page) {
printk(KERN_ERR unionfs: grab_cache_page failed\n);
goto out;
}

/* do the actual sync */

/*
 * XXX: can we optimize ala RAIF and set the lower page

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-27 Thread Erez Zadok

In message <[EMAIL PROTECTED]>, Hugh Dickins writes:
> On Mon, 22 Oct 2007, Erez Zadok wrote:
[...]
> > If you've got suggestions how I can handle unionfs_write more cleanly, or
> > comments on the above possibilities, I'd love to hear them.
> 
> For now I think you should pursue the ~(__GFP_FS|__GFP_IO) idea somehow.
> 
> Hugh

Hugh, thanks for the great explanations and suggestions (in multiple
emails).  I'm going to test all of those soon.

Erez.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-27 Thread Erez Zadok

In message [EMAIL PROTECTED], Hugh Dickins writes:
 On Mon, 22 Oct 2007, Erez Zadok wrote:
[...]
  If you've got suggestions how I can handle unionfs_write more cleanly, or
  comments on the above possibilities, I'd love to hear them.
 
 For now I think you should pursue the ~(__GFP_FS|__GFP_IO) idea somehow.
 
 Hugh

Hugh, thanks for the great explanations and suggestions (in multiple
emails).  I'm going to test all of those soon.

Erez.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-26 Thread Hugh Dickins

On Fri, 26 Oct 2007, Neil Brown wrote:
> On Thursday October 25, [EMAIL PROTECTED] wrote:
> 
> The patch looks like it makes perfect sense to me.

Great, thanks a lot for looking at it, Neil and Pekka.

> Before the change, ->writepage could return AOP_WRITEPAGE_ACTIVATE
> without unlocking the page, and this has precisely the effect of:
>ClearPageReclaim();  (if the call path was through pageout)
>SetPageActive();  (if the call was through shrink_page_list)
>unlock_page();
> 
> With the patch, the ->writepage method does the SetPageActive and the
> unlock_page, which on the whole seems cleaner.
> 
> We seem to have lost a call to ClearPageReclaim - I don't know if that
> is significant.

It doesn't show up in the diff at all, but pageout() already has
if (!PageWriteback(page)) {
/* synchronous write or broken a_ops? */
ClearPageReclaim(page);
}
which will clear it since we've never set PageWriteback.

I think no harm would come from leaving it set there, since it only
takes effect in end_page_writeback (its effect being to let the just
written page be moved to the end of the LRU, so that it will then
be soon reclaimed), and clear_page_dirty_for_io clears it before
coming down this way.  But I'd never argue for that: I hate having
leftover flags hanging around outside the scope of their relevance.

> > Special, hidden, undocumented, secret hack!  Then in 2.6.7 Andrew
> > stole his own secret and used it when concocting ramdisk_writepage.
> > Oh, and NFS made some kind of use of it in 2.6.6 only.  Then Neil
> > revealed the secret to the uninitiated in 2.6.17: now, what's the
> > appropriate punishment for that?
> 
> Surely the punishment should be for writing hidden undocumented hacks
> in the first place!  I vote we just make him maintainer for the whole
> kernel - that will keep him so busy that he will never have a chance
> to do it again :-)

That is a splendid retort, which has won you absolution.
But it makes me a little sad: that smiley should be a weepy.

> > --- 2.6.24-rc1/Documentation/filesystems/vfs.txt2007-10-24 
> > 07:15:11.0 +0100
> > +++ linux/Documentation/filesystems/vfs.txt 2007-10-24 08:42:07.0 
> > +0100
> > @@ -567,9 +567,7 @@ struct address_space_operations {
> >If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to
> >try too hard if there are problems, and may choose to write out
> >other pages from the mapping if that is easier (e.g. due to
> > -  internal dependencies).  If it chooses not to start writeout, it
> > -  should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep
> > -  calling ->writepage on that page.
> > +  internal dependencies).
> >  
> 
> It seems that the new requirement is that if the address_space
> chooses not to write out the page, it should now call SetPageActive().
> If that is the case, I think it should be explicit in the
> documentation - please?

No, it's not the case; but you're right that I should add something
there, to put an end to the idea.  It'll be something along the lines
of "You may notice shmem setting PageActive there, but please don't do
that; or if you insist, be sure never to do so in the !wbc->for_reclaim
case".

The PageActive thing is for when a filesystem regrets that it even
had a ->writepage (it replicates the behaviour of the writepage == NULL
case or the VM_LOCKED SWAP_FAIL case or the !add_to_swap case, delaying
the return of this page to writepage for as long as it can).  It's done
in shmem_writepage because shm_lock (equivalent to VM_LOCKED) is only
discovered within that writepage, and no-swap is discovered there too.

ramdisk does it too: I've not tried to understand ramdisk as Nick and
Eric have, but it used to have no writepage, and would prefer to have
no writepage, but appears to need one for some PageUptodate reasons.

It's fairly normal for a filesystem to find that for some reason it
cannot carry out a writepage on this page right now (in the reclaim
case: the sync case demands action, IIRC); so it then simply does
set_page_dirty and unlock_page and returns "success".

I'll try to condense this down for the Doc when finalizing the patch;
which I've still not yet tested properly - thanks for the eyes, but
I can't submit it until I've checked in detail that it really gets
to do what we think it does.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-26 Thread Pekka Enberg

Hi,

On 10/26/07, Neil Brown <[EMAIL PROTECTED]> wrote:
> It seems that the new requirement is that if the address_space
> chooses not to write out the page, it should now call SetPageActive().
> If that is the case, I think it should be explicit in the
> documentation - please?

Agreed.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-26 Thread Pekka Enberg

Hi Hugh,

On 10/25/07, Hugh Dickins <[EMAIL PROTECTED]> wrote:
> @@ -349,10 +349,6 @@ static pageout_t pageout(struct page *pa
> res = mapping->a_ops->writepage(page, );
> if (res < 0)
> handle_write_error(mapping, page, res);
> -   if (res == AOP_WRITEPAGE_ACTIVATE) {
> -   ClearPageReclaim(page);
> -   return PAGE_ACTIVATE;
> -   }

I don't see ClearPageReclaim added anywhere. Is that done on purpose?
Other than that, the patch looks good to me and I think we should
stick it into -mm to punish Andrew for his secret hack ;-).

  Pekka
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-26 Thread Pekka Enberg

Hi,

On 10/26/07, Neil Brown [EMAIL PROTECTED] wrote:
 It seems that the new requirement is that if the address_space
 chooses not to write out the page, it should now call SetPageActive().
 If that is the case, I think it should be explicit in the
 documentation - please?

Agreed.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-26 Thread Pekka Enberg

Hi Hugh,

On 10/25/07, Hugh Dickins [EMAIL PROTECTED] wrote:
 @@ -349,10 +349,6 @@ static pageout_t pageout(struct page *pa
 res = mapping-a_ops-writepage(page, wbc);
 if (res  0)
 handle_write_error(mapping, page, res);
 -   if (res == AOP_WRITEPAGE_ACTIVATE) {
 -   ClearPageReclaim(page);
 -   return PAGE_ACTIVATE;
 -   }

I don't see ClearPageReclaim added anywhere. Is that done on purpose?
Other than that, the patch looks good to me and I think we should
stick it into -mm to punish Andrew for his secret hack ;-).

  Pekka
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-26 Thread Hugh Dickins

On Fri, 26 Oct 2007, Neil Brown wrote:
 On Thursday October 25, [EMAIL PROTECTED] wrote:
 
 The patch looks like it makes perfect sense to me.

Great, thanks a lot for looking at it, Neil and Pekka.

 Before the change, -writepage could return AOP_WRITEPAGE_ACTIVATE
 without unlocking the page, and this has precisely the effect of:
ClearPageReclaim();  (if the call path was through pageout)
SetPageActive();  (if the call was through shrink_page_list)
unlock_page();
 
 With the patch, the -writepage method does the SetPageActive and the
 unlock_page, which on the whole seems cleaner.
 
 We seem to have lost a call to ClearPageReclaim - I don't know if that
 is significant.

It doesn't show up in the diff at all, but pageout() already has
if (!PageWriteback(page)) {
/* synchronous write or broken a_ops? */
ClearPageReclaim(page);
}
which will clear it since we've never set PageWriteback.

I think no harm would come from leaving it set there, since it only
takes effect in end_page_writeback (its effect being to let the just
written page be moved to the end of the LRU, so that it will then
be soon reclaimed), and clear_page_dirty_for_io clears it before
coming down this way.  But I'd never argue for that: I hate having
leftover flags hanging around outside the scope of their relevance.

  Special, hidden, undocumented, secret hack!  Then in 2.6.7 Andrew
  stole his own secret and used it when concocting ramdisk_writepage.
  Oh, and NFS made some kind of use of it in 2.6.6 only.  Then Neil
  revealed the secret to the uninitiated in 2.6.17: now, what's the
  appropriate punishment for that?
 
 Surely the punishment should be for writing hidden undocumented hacks
 in the first place!  I vote we just make him maintainer for the whole
 kernel - that will keep him so busy that he will never have a chance
 to do it again :-)

That is a splendid retort, which has won you absolution.
But it makes me a little sad: that smiley should be a weepy.

  --- 2.6.24-rc1/Documentation/filesystems/vfs.txt2007-10-24 
  07:15:11.0 +0100
  +++ linux/Documentation/filesystems/vfs.txt 2007-10-24 08:42:07.0 
  +0100
  @@ -567,9 +567,7 @@ struct address_space_operations {
 If wbc-sync_mode is WB_SYNC_NONE, -writepage doesn't have to
 try too hard if there are problems, and may choose to write out
 other pages from the mapping if that is easier (e.g. due to
  -  internal dependencies).  If it chooses not to start writeout, it
  -  should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep
  -  calling -writepage on that page.
  +  internal dependencies).
   
 
 It seems that the new requirement is that if the address_space
 chooses not to write out the page, it should now call SetPageActive().
 If that is the case, I think it should be explicit in the
 documentation - please?

No, it's not the case; but you're right that I should add something
there, to put an end to the idea.  It'll be something along the lines
of You may notice shmem setting PageActive there, but please don't do
that; or if you insist, be sure never to do so in the !wbc-for_reclaim
case.

The PageActive thing is for when a filesystem regrets that it even
had a -writepage (it replicates the behaviour of the writepage == NULL
case or the VM_LOCKED SWAP_FAIL case or the !add_to_swap case, delaying
the return of this page to writepage for as long as it can).  It's done
in shmem_writepage because shm_lock (equivalent to VM_LOCKED) is only
discovered within that writepage, and no-swap is discovered there too.

ramdisk does it too: I've not tried to understand ramdisk as Nick and
Eric have, but it used to have no writepage, and would prefer to have
no writepage, but appears to need one for some PageUptodate reasons.

It's fairly normal for a filesystem to find that for some reason it
cannot carry out a writepage on this page right now (in the reclaim
case: the sync case demands action, IIRC); so it then simply does
set_page_dirty and unlock_page and returns success.

I'll try to condense this down for the Doc when finalizing the patch;
which I've still not yet tested properly - thanks for the eyes, but
I can't submit it until I've checked in detail that it really gets
to do what we think it does.

Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-25 Thread Neil Brown

On Thursday October 25, [EMAIL PROTECTED] wrote:
> On Mon, 22 Oct 2007, Pekka Enberg wrote:
> > On 10/22/07, Hugh Dickins <[EMAIL PROTECTED]> wrote:
> > > Only ramdisk and shmem have been returning AOP_WRITEPAGE_ACTIVATE.
> > > Both of those set BDI_CAP_NO_WRITEBACK.  ramdisk never returned it
> > > if !wbc->for_reclaim.  I contend that shmem shouldn't either: it's
> > > a special code to get the LRU rotation right, not useful elsewhere.
> > > Though Documentation/filesystems/vfs.txt does imply wider use.
> > >
> > > I think this is where people use the phrase "go figure" ;)
> > 
> > Heh. As far as I can tell, the implication of "wider use" was added by
> > Neil in commit "341546f5ad6fce584531f744853a5807a140f2a9 Update some
> > VFS documentation", so perhaps he might know? Neil?

I just read the code, tried to understand it, translated that
understanding into English, and put that in vfs.txt.
It is very possible that what I wrote didn't match the intention of
the author, but it seemed to match the behaviour of the code.

The patch looks like it makes perfect sense to me.
Before the change, ->writepage could return AOP_WRITEPAGE_ACTIVATE
without unlocking the page, and this has precisely the effect of:
   ClearPageReclaim();  (if the call path was through pageout)
   SetPageActive();  (if the call was through shrink_page_list)
   unlock_page();

With the patch, the ->writepage method does the SetPageActive and the
unlock_page, which on the whole seems cleaner.

We seem to have lost a call to ClearPageReclaim - I don't know if that
is significant.

> 
> Special, hidden, undocumented, secret hack!  Then in 2.6.7 Andrew
> stole his own secret and used it when concocting ramdisk_writepage.
> Oh, and NFS made some kind of use of it in 2.6.6 only.  Then Neil
> revealed the secret to the uninitiated in 2.6.17: now, what's the
> appropriate punishment for that?

Surely the punishment should be for writing hidden undocumented hacks
in the first place!  I vote we just make him maintainer for the whole
kernel - that will keep him so busy that he will never have a chance
to do it again :-)

> --- 2.6.24-rc1/Documentation/filesystems/vfs.txt  2007-10-24 
> 07:15:11.0 +0100
> +++ linux/Documentation/filesystems/vfs.txt   2007-10-24 08:42:07.0 
> +0100
> @@ -567,9 +567,7 @@ struct address_space_operations {
>If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to
>try too hard if there are problems, and may choose to write out
>other pages from the mapping if that is easier (e.g. due to
> -  internal dependencies).  If it chooses not to start writeout, it
> -  should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep
> -  calling ->writepage on that page.
> +  internal dependencies).
>  

It seems that the new requirement is that if the address_space
chooses not to write out the page, it should now call SetPageActive().
If that is the case, I think it should be explicit in the
documentation - please?

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-25 Thread Hugh Dickins

On Thu, 25 Oct 2007, Erez Zadok wrote:
> 
> On a related note, I would just love to get rid of calling the lower
> ->writepage in unionfs b/c I can't even tell if I have a lower page to use
> all the time.  I'd prefer to call vfs_write() if I can, but I'll need a
> struct file, or at least a dentry.

Why do you want to do that?  You gave a good reason why it's easier
for ecryptfs, but I doubt it's robust.  The higher the level you
choose to use, the harder to guarantee it won't deadlock.

Or that's my gut feeling anyway.  It's several years since I've
thought about such issues: just because I came into this knowing
about shmem_writepage, is perhaps not a good reason to choose me
as advisor!

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-25 Thread Hugh Dickins

On Mon, 22 Oct 2007, Erez Zadok wrote:
> 
> What's the precise semantics of AOP_WRITEPAGE_ACTIVATE?

Sigh - not at you, at it!  It's a secret that couldn't be kept secret,
a hack for tmpfs reclaim, let's just look forward to it going away.

> Is it considered an error or not?

No, it's definitely not an error.  It'a a private note from tmpfs
(or ramdisk) to vmscan, saying "don't waste your time coming back
to me with this page until you have to, please move on to another
more likely to be freeable".

> If it's an error, then I usually feel that it's important for
> a stacked f/s to return that error indication upwards.

Indeed, but this is not an error.  Remember, neither ramdisk nor
tmpfs is stable storage: okay, tmpfs can go out to disk by using
swap, but that's not stable storage - it's not reconstituted after
reboot.  (If there's an error in writing to swap, well, that's a
different issue; and there's few filesystems where such an I/O
error would be reported from ->writepage.)

> 
> The unionfs page and the lower page are somewhat tied together, at least
> logically.  For unionfs's page to be considered to have been written
> successfully, the lower page has to be written successfully.  So again, if
> the lower f/s returns AOP_WRITEPAGE_ACTIVATE, should I consider my unionfs
> page to have been written successfully or not?

Consider it written successfully.  (What does written mean with tmpfs?
it means a page can be freed, it doesn't mean the data is forever safe.)

> If I don't return
> AOP_WRITEPAGE_ACTIVATE up, can there be any chance that some vital data may
> never get flushed out?

Things should work better if you don't return AOP_WRITEPAGE_ACTIVATE.
If you mark your page as clean and successfully written, vmscan will
be able to free it.  If needed, we can get the data back from the
lower page on demand, but meanwhile a page has been freed, which
is what vmscan reclaim is all about.  (But of course, in the case
where you couldn't get hold of a page for the lower, you must redirty
yours before returning.)

> > unionfs_writepage also sets AOP_WRITEPAGE_ACTIVATE when it cannot
> > find_lock_page: that case may be appropriate.  Though I don't really
> > understand it: seems dangerous to be relying upon the lower level page
> > just happening to be there already.  Isn't memory pressure then likely
> > to clog up with lots of upper level dirty pages which cannot get
> > written out to the lower level?
> 
> Based on vfs.txt (which perhaps should be revised :-), I was trying to do
> the best I can to ensure that no data is lost if the current page cannot be
> written out to the lower f/s.
> 
> I used to do grab_cache_page() before, but that caused problems: writepage
> is not the right place to _increase_ memory pressure by allocating a new
> page...

Yes, but just hoping the lower page will be there, and doing nothing
to encourage it to become there, sounds an even poorer strategy to me.

It's not easy, I know.  Your position reminds me of the loop driver
(drivers/block/loop.c), which has long handled this situation (with
great success, though I doubt an absolute guarantee) by taking
__GFP_IO|__GFP_FS off the mapping_gfp_mask of the underlying file:
look for gfp_mask in loop_set_fd() (and I think ignore do_loop_switch(),
that's new to me and seems to be for a very special case).

I grepped for gfp in unionfs, and there seems to be nothing: I doubt
you can be robust under memory pressure without doing something about
that.  If you can take __GFP_IO|__GFP_FS off the lower mapping (just
while in unionfs_writepage, or longer term? what locking needed?),
then you should be able to go back to using grab_cache_page().

> 
> One solution I thought of is do what ecryptfs does: keep an open struct file
> in my inode and call vfs_write(), but I don't see that as a significantly
> cleaner/better solution.

I agree with you.

> (BTW, ecrypfts kinda had to go for vfs_write b/c
> it changes the data size and content of what it writes below; unionfs is
> simpler in that manner b/c it needs to write the same data to the lower file
> at the same offset.)

Ah, yes.

> 
> Another idea we've experimented with before is "page pointer flipping."  In
> writepage, we temporarily set the page->mapping->host to the lower_inode;
> then we call the lower writepage with OUR page; then fix back the
> page->mapping->host to the upper inode.  This had two benefits: first we can
> guarantee that we always have a page to write below, and second we don't
> need to keep both upper and lower pages (reduces memory pressure).  Before
> we did this page pointer flipping, we verified that the page is locked so no
> other user could be written the page->mapping->host in this transient state,
> and we ensured that no lower f/s was somehow caching the temporarily changed
> value of page->mapping->host for later use.  But, mucking with the pointers
> in this manner is kinda ugly, to say the least.  Still, I'd love to find a
> clean and simple way

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-25 Thread Hugh Dickins

On Mon, 22 Oct 2007, Erez Zadok wrote:
> In message <[EMAIL PROTECTED]>, Hugh Dickins writes:
> > 
> > Only ramdisk and shmem have been returning AOP_WRITEPAGE_ACTIVATE.
> > Both of those set BDI_CAP_NO_WRITEBACK.  ramdisk never returned it
> > if !wbc->for_reclaim.  I contend that shmem shouldn't either: it's
> > a special code to get the LRU rotation right, not useful elsewhere.
> > Though Documentation/filesystems/vfs.txt does imply wider use.
> 
> Yes, based on vfs.txt I figured unionfs should return
> AOP_WRITEPAGE_ACTIVATE.

unionfs_writepage returns it in two different cases: when it can't
find the underlying page; and when the underlying writepage returns
it.  I'd say it's wrong to return it in both cases.

In the first case, you don't really want your page put back to the head
of the active list, you want to come back to try it again quite soon
(I think): so you should just redirty and unlock and pretend success.

ramdisk uses A_W_A because none of its pages will ever become freeable
(and comment points out it'd be better if they weren't even on the
LRUs - I think several people have recently been putting forward
patches to keep such timewasters off the LRUs).

shmem uses A_W_A when there's no swap (left), or when the underlying
shm is marked as locked in memory: in each case, best to move on to
look for other pages to swap out.  (But I'm not quite convincing myself
that the temporarily out-of-swap case is different from yours above.)
It's about fixing some horrid busy loops where vmscan kept going
over the same hopeless pages repeatedly, instead of moving on to
better candidates.  Oh, there's a third case, when move_to_swap_cache
fails: that's rare, and I think I was just too lazy to separate them.

In your second case, I fail to see why the unionfs level should
mimic the lower level: you've successfully copied data and marked
the lower level pages as dirty, vmscan will come back to those in
due course, but it's just a waste of time for it to come back to
the unionfs pages again - isn't it?

> But, now that unionfs has ->writepages which won't
> even call the lower writepage if BDI_CAP_NO_WRITEBACK is on, then perhaps I
> no longer need unionfs_writepage to bother checking for
> AOP_WRITEPAGE_ACTIVATE, or even return it up?

unionfs_writepages handles the sync/msync/fsync leaking of A_W_A to
userspace issue, as does Pekka & Andrew's patch to write_cache_pages,
as does my patch to shmem_writepage.  And I'm contending that
unionfs_writepage should in no case return A_W_A up.

But so long as A_W_A is still defined, unionfs_writepage does
still need to check for it after calling the lower level ->writepage
(because it needs to do the missing unlock_page): unionfs_writepages
prevents unionfs_writepage being called on the normal writeout path,
but it's still getting called by vmscan under memory pressure.

(I'm in the habit of saying "vmscan" rather than naming the functions
in question, because every few months someone restructures that file
and changes their names.  I exaggerate, but it's happened often enough.)

> But, a future file system _could_ return AOP_WRITEPAGE_ACTIVATE w/o setting
> BDI_CAP_NO_WRITEBACK, right?  In that case, unionfs will still need to
> handle AOP_WRITEPAGE_ACTIVATE in ->writepage, right?

For so long as AOP_WRITEPAGE_ACTIVATE exists, unionfs_writepage needs to
check for it coming from the lower level ->writepage, as I said above.

But your/Pekka's unionfs_writepages doesn't need to worry about it
at all, because Andrew/Pekka's write_cache_pages fix prevents it
leaking up in the !reclaim case (as does my shmem_writepage fix):
please remove that AOP_WRITEPAGE_ACTIVATE comment from unionfs_writepages.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-25 Thread Erez Zadok

That's a nice historical review, Huge, of how got into these mess we're in
now -- it all starts with good intentions. :-)

On a related note, I would just love to get rid of calling the lower
->writepage in unionfs b/c I can't even tell if I have a lower page to use
all the time.  I'd prefer to call vfs_write() if I can, but I'll need a
struct file, or at least a dentry.

What ecryptfs does is store a struct file inside it's inode, so it can use
it later in ->writepage to call vfs_write on the lower f/s.  And Unionfs may
have to go in that direction too, but this trick is not terribly clean --
storing a file inside an inode.

I realize that the calling path to ->writepage doesn't have a file/dentry
any more, but if we're considering larger changes to the writepage related
code, can we perhaps consider passing a file or dentry to >writepage (same
as commit_write, perhaps).

Thanks,
Erez.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-25 Thread Hugh Dickins

On Mon, 22 Oct 2007, Pekka Enberg wrote:
> On 10/22/07, Hugh Dickins <[EMAIL PROTECTED]> wrote:
> > Only ramdisk and shmem have been returning AOP_WRITEPAGE_ACTIVATE.
> > Both of those set BDI_CAP_NO_WRITEBACK.  ramdisk never returned it
> > if !wbc->for_reclaim.  I contend that shmem shouldn't either: it's
> > a special code to get the LRU rotation right, not useful elsewhere.
> > Though Documentation/filesystems/vfs.txt does imply wider use.
> >
> > I think this is where people use the phrase "go figure" ;)
> 
> Heh. As far as I can tell, the implication of "wider use" was added by
> Neil in commit "341546f5ad6fce584531f744853a5807a140f2a9 Update some
> VFS documentation", so perhaps he might know? Neil?

I take as gospel this extract from Andrew's original 2.5.52 comment:

  So the locking rules for writepage() are unchanged.  They are:

  - Called with the page locked
  - Returns with the page unlocked
  - Must redirty the page itself if it wasn't all written.

  But there is a new, special, hidden, undocumented, secret hack for
  tmpfs: writepage may return WRITEPAGE_ACTIVATE to tell the VM to move
  the page to the active list.  The page must be kept locked in this one
  case.

Special, hidden, undocumented, secret hack!  Then in 2.6.7 Andrew
stole his own secret and used it when concocting ramdisk_writepage.
Oh, and NFS made some kind of use of it in 2.6.6 only.  Then Neil
revealed the secret to the uninitiated in 2.6.17: now, what's the
appropriate punishment for that?

In the full 2.5.52 comment, Andrew explains how prior to this secret
code, we used fail_writepage, which in the memory pressure case did
an activate_page, with the intention of moving the page to the active
list - but that didn't actually work, because the page is off the
LRUs at this point, being passed around between pagevecs.

I've always preferred the way it was originally trying to do it, which
seems clearer and less error-prone than having a special code which
people then have to worry about.  Here's the patch I'd like to present
in due course (against 2.6.24-rc1, so unionfs absent): tmpfs and ramdisk
simply SetPageActive for this case (and go back to obeying the usual
unlocking rule for writepage), vmscan.c observe and act accordingly.

But I've not tested it at all (well, I've run with it in, but not
actually going down the paths in question): it may suffer from
something silly like the original fail_writepage.  Plus I might be
persuaded into making inc_zone_page_state(page, NR_VMSCAN_WRITE)
conditional on !PageActive(page), just to produce the same stats
as before (though they don't make a lot of sense, counting other
non-writes as writes).  And would it need a deprecation phase?

Hugh

 Documentation/filesystems/Locking |6 +-
 Documentation/filesystems/vfs.txt |4 +---
 drivers/block/rd.c|5 ++---
 include/linux/fs.h|   10 --
 mm/migrate.c  |5 ++---
 mm/page-writeback.c   |4 
 mm/shmem.c|   11 ---
 mm/vmscan.c   |   17 ++---
 8 files changed, 20 insertions(+), 42 deletions(-)

--- 2.6.24-rc1/Documentation/filesystems/Locking2007-10-24 
07:15:11.0 +0100
+++ linux/Documentation/filesystems/Locking 2007-10-24 08:42:07.0 
+0100
@@ -228,11 +228,7 @@ If the filesystem is called for sync the
 in-progress I/O and then start new I/O.

 The filesystem should unlock the page synchronously, before returning to the
-caller, unless ->writepage() returns special WRITEPAGE_ACTIVATE
-value. WRITEPAGE_ACTIVATE means that page cannot really be written out
-currently, and VM should stop calling ->writepage() on this page for some
-time. VM does this by moving page to the head of the active list, hence the
-name.
+caller.

 Unless the filesystem is going to redirty_page_for_writepage(), unlock the page
 and return zero, writepage *must* run set_page_writeback() against the page,
--- 2.6.24-rc1/Documentation/filesystems/vfs.txt2007-10-24 
07:15:11.0 +0100
+++ linux/Documentation/filesystems/vfs.txt 2007-10-24 08:42:07.0 
+0100
@@ -567,9 +567,7 @@ struct address_space_operations {
   If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to
   try too hard if there are problems, and may choose to write out
   other pages from the mapping if that is easier (e.g. due to
-  internal dependencies).  If it chooses not to start writeout, it
-  should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep
-  calling ->writepage on that page.
+  internal dependencies).

   See the file "Locking" for more details.

--- 2.6.24-rc1/drivers/block/rd.c   2007-10-24 07:15:23.0 +0100
+++ linux/drivers/block/rd.c2007-10-24 08:42:07.0 +0100
@@ -152,8 +152,7 @@ static int ramdisk_commit_write(struct f

 /*
  * ->writepage to the blockdev's mapping has to redirty the page so

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-25 Thread Hugh Dickins

On Mon, 22 Oct 2007, Pekka Enberg wrote:
 On 10/22/07, Hugh Dickins [EMAIL PROTECTED] wrote:
  Only ramdisk and shmem have been returning AOP_WRITEPAGE_ACTIVATE.
  Both of those set BDI_CAP_NO_WRITEBACK.  ramdisk never returned it
  if !wbc-for_reclaim.  I contend that shmem shouldn't either: it's
  a special code to get the LRU rotation right, not useful elsewhere.
  Though Documentation/filesystems/vfs.txt does imply wider use.
 
  I think this is where people use the phrase go figure ;)
 
 Heh. As far as I can tell, the implication of wider use was added by
 Neil in commit 341546f5ad6fce584531f744853a5807a140f2a9 Update some
 VFS documentation, so perhaps he might know? Neil?

I take as gospel this extract from Andrew's original 2.5.52 comment:

  So the locking rules for writepage() are unchanged.  They are:
  
  - Called with the page locked
  - Returns with the page unlocked
  - Must redirty the page itself if it wasn't all written.
  
  But there is a new, special, hidden, undocumented, secret hack for
  tmpfs: writepage may return WRITEPAGE_ACTIVATE to tell the VM to move
  the page to the active list.  The page must be kept locked in this one
  case.

Special, hidden, undocumented, secret hack!  Then in 2.6.7 Andrew
stole his own secret and used it when concocting ramdisk_writepage.
Oh, and NFS made some kind of use of it in 2.6.6 only.  Then Neil
revealed the secret to the uninitiated in 2.6.17: now, what's the
appropriate punishment for that?

In the full 2.5.52 comment, Andrew explains how prior to this secret
code, we used fail_writepage, which in the memory pressure case did
an activate_page, with the intention of moving the page to the active
list - but that didn't actually work, because the page is off the
LRUs at this point, being passed around between pagevecs.

I've always preferred the way it was originally trying to do it, which
seems clearer and less error-prone than having a special code which
people then have to worry about.  Here's the patch I'd like to present
in due course (against 2.6.24-rc1, so unionfs absent): tmpfs and ramdisk
simply SetPageActive for this case (and go back to obeying the usual
unlocking rule for writepage), vmscan.c observe and act accordingly.

But I've not tested it at all (well, I've run with it in, but not
actually going down the paths in question): it may suffer from
something silly like the original fail_writepage.  Plus I might be
persuaded into making inc_zone_page_state(page, NR_VMSCAN_WRITE)
conditional on !PageActive(page), just to produce the same stats
as before (though they don't make a lot of sense, counting other
non-writes as writes).  And would it need a deprecation phase?

Hugh

 Documentation/filesystems/Locking |6 +-
 Documentation/filesystems/vfs.txt |4 +---
 drivers/block/rd.c|5 ++---
 include/linux/fs.h|   10 --
 mm/migrate.c  |5 ++---
 mm/page-writeback.c   |4 
 mm/shmem.c|   11 ---
 mm/vmscan.c   |   17 ++---
 8 files changed, 20 insertions(+), 42 deletions(-)

--- 2.6.24-rc1/Documentation/filesystems/Locking2007-10-24 
07:15:11.0 +0100
+++ linux/Documentation/filesystems/Locking 2007-10-24 08:42:07.0 
+0100
@@ -228,11 +228,7 @@ If the filesystem is called for sync the
 in-progress I/O and then start new I/O.
 
 The filesystem should unlock the page synchronously, before returning to the
-caller, unless -writepage() returns special WRITEPAGE_ACTIVATE
-value. WRITEPAGE_ACTIVATE means that page cannot really be written out
-currently, and VM should stop calling -writepage() on this page for some
-time. VM does this by moving page to the head of the active list, hence the
-name.
+caller.
 
 Unless the filesystem is going to redirty_page_for_writepage(), unlock the page
 and return zero, writepage *must* run set_page_writeback() against the page,
--- 2.6.24-rc1/Documentation/filesystems/vfs.txt2007-10-24 
07:15:11.0 +0100
+++ linux/Documentation/filesystems/vfs.txt 2007-10-24 08:42:07.0 
+0100
@@ -567,9 +567,7 @@ struct address_space_operations {
   If wbc-sync_mode is WB_SYNC_NONE, -writepage doesn't have to
   try too hard if there are problems, and may choose to write out
   other pages from the mapping if that is easier (e.g. due to
-  internal dependencies).  If it chooses not to start writeout, it
-  should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep
-  calling -writepage on that page.
+  internal dependencies).
 
   See the file Locking for more details.
 
--- 2.6.24-rc1/drivers/block/rd.c   2007-10-24 07:15:23.0 +0100
+++ linux/drivers/block/rd.c2007-10-24 08:42:07.0 +0100
@@ -152,8 +152,7 @@ static int ramdisk_commit_write(struct f
 
 /*
  * -writepage to the blockdev's mapping has to redirty the page so that the
- * VM doesn't go and steal

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-25 Thread Hugh Dickins

On Mon, 22 Oct 2007, Erez Zadok wrote:
 In message [EMAIL PROTECTED], Hugh Dickins writes:
  
  Only ramdisk and shmem have been returning AOP_WRITEPAGE_ACTIVATE.
  Both of those set BDI_CAP_NO_WRITEBACK.  ramdisk never returned it
  if !wbc-for_reclaim.  I contend that shmem shouldn't either: it's
  a special code to get the LRU rotation right, not useful elsewhere.
  Though Documentation/filesystems/vfs.txt does imply wider use.
 
 Yes, based on vfs.txt I figured unionfs should return
 AOP_WRITEPAGE_ACTIVATE.

unionfs_writepage returns it in two different cases: when it can't
find the underlying page; and when the underlying writepage returns
it.  I'd say it's wrong to return it in both cases.

In the first case, you don't really want your page put back to the head
of the active list, you want to come back to try it again quite soon
(I think): so you should just redirty and unlock and pretend success.

ramdisk uses A_W_A because none of its pages will ever become freeable
(and comment points out it'd be better if they weren't even on the
LRUs - I think several people have recently been putting forward
patches to keep such timewasters off the LRUs).

shmem uses A_W_A when there's no swap (left), or when the underlying
shm is marked as locked in memory: in each case, best to move on to
look for other pages to swap out.  (But I'm not quite convincing myself
that the temporarily out-of-swap case is different from yours above.)
It's about fixing some horrid busy loops where vmscan kept going
over the same hopeless pages repeatedly, instead of moving on to
better candidates.  Oh, there's a third case, when move_to_swap_cache
fails: that's rare, and I think I was just too lazy to separate them.

In your second case, I fail to see why the unionfs level should
mimic the lower level: you've successfully copied data and marked
the lower level pages as dirty, vmscan will come back to those in
due course, but it's just a waste of time for it to come back to
the unionfs pages again - isn't it?

 But, now that unionfs has -writepages which won't
 even call the lower writepage if BDI_CAP_NO_WRITEBACK is on, then perhaps I
 no longer need unionfs_writepage to bother checking for
 AOP_WRITEPAGE_ACTIVATE, or even return it up?

unionfs_writepages handles the sync/msync/fsync leaking of A_W_A to
userspace issue, as does Pekka  Andrew's patch to write_cache_pages,
as does my patch to shmem_writepage.  And I'm contending that
unionfs_writepage should in no case return A_W_A up.

But so long as A_W_A is still defined, unionfs_writepage does
still need to check for it after calling the lower level -writepage
(because it needs to do the missing unlock_page): unionfs_writepages
prevents unionfs_writepage being called on the normal writeout path,
but it's still getting called by vmscan under memory pressure.

(I'm in the habit of saying vmscan rather than naming the functions
in question, because every few months someone restructures that file
and changes their names.  I exaggerate, but it's happened often enough.)

 But, a future file system _could_ return AOP_WRITEPAGE_ACTIVATE w/o setting
 BDI_CAP_NO_WRITEBACK, right?  In that case, unionfs will still need to
 handle AOP_WRITEPAGE_ACTIVATE in -writepage, right?

For so long as AOP_WRITEPAGE_ACTIVATE exists, unionfs_writepage needs to
check for it coming from the lower level -writepage, as I said above.

But your/Pekka's unionfs_writepages doesn't need to worry about it
at all, because Andrew/Pekka's write_cache_pages fix prevents it
leaking up in the !reclaim case (as does my shmem_writepage fix):
please remove that AOP_WRITEPAGE_ACTIVATE comment from unionfs_writepages.

Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-25 Thread Erez Zadok

That's a nice historical review, Huge, of how got into these mess we're in
now -- it all starts with good intentions. :-)

On a related note, I would just love to get rid of calling the lower
-writepage in unionfs b/c I can't even tell if I have a lower page to use
all the time.  I'd prefer to call vfs_write() if I can, but I'll need a
struct file, or at least a dentry.

What ecryptfs does is store a struct file inside it's inode, so it can use
it later in -writepage to call vfs_write on the lower f/s.  And Unionfs may
have to go in that direction too, but this trick is not terribly clean --
storing a file inside an inode.

I realize that the calling path to -writepage doesn't have a file/dentry
any more, but if we're considering larger changes to the writepage related
code, can we perhaps consider passing a file or dentry to writepage (same
as commit_write, perhaps).

Thanks,
Erez.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-25 Thread Hugh Dickins

On Mon, 22 Oct 2007, Erez Zadok wrote:
 
 What's the precise semantics of AOP_WRITEPAGE_ACTIVATE?

Sigh - not at you, at it!  It's a secret that couldn't be kept secret,
a hack for tmpfs reclaim, let's just look forward to it going away.

 Is it considered an error or not?

No, it's definitely not an error.  It'a a private note from tmpfs
(or ramdisk) to vmscan, saying don't waste your time coming back
to me with this page until you have to, please move on to another
more likely to be freeable.

 If it's an error, then I usually feel that it's important for
 a stacked f/s to return that error indication upwards.

Indeed, but this is not an error.  Remember, neither ramdisk nor
tmpfs is stable storage: okay, tmpfs can go out to disk by using
swap, but that's not stable storage - it's not reconstituted after
reboot.  (If there's an error in writing to swap, well, that's a
different issue; and there's few filesystems where such an I/O
error would be reported from -writepage.)

 
 The unionfs page and the lower page are somewhat tied together, at least
 logically.  For unionfs's page to be considered to have been written
 successfully, the lower page has to be written successfully.  So again, if
 the lower f/s returns AOP_WRITEPAGE_ACTIVATE, should I consider my unionfs
 page to have been written successfully or not?

Consider it written successfully.  (What does written mean with tmpfs?
it means a page can be freed, it doesn't mean the data is forever safe.)

 If I don't return
 AOP_WRITEPAGE_ACTIVATE up, can there be any chance that some vital data may
 never get flushed out?

Things should work better if you don't return AOP_WRITEPAGE_ACTIVATE.
If you mark your page as clean and successfully written, vmscan will
be able to free it.  If needed, we can get the data back from the
lower page on demand, but meanwhile a page has been freed, which
is what vmscan reclaim is all about.  (But of course, in the case
where you couldn't get hold of a page for the lower, you must redirty
yours before returning.)

  unionfs_writepage also sets AOP_WRITEPAGE_ACTIVATE when it cannot
  find_lock_page: that case may be appropriate.  Though I don't really
  understand it: seems dangerous to be relying upon the lower level page
  just happening to be there already.  Isn't memory pressure then likely
  to clog up with lots of upper level dirty pages which cannot get
  written out to the lower level?
 
 Based on vfs.txt (which perhaps should be revised :-), I was trying to do
 the best I can to ensure that no data is lost if the current page cannot be
 written out to the lower f/s.
 
 I used to do grab_cache_page() before, but that caused problems: writepage
 is not the right place to _increase_ memory pressure by allocating a new
 page...

Yes, but just hoping the lower page will be there, and doing nothing
to encourage it to become there, sounds an even poorer strategy to me.

It's not easy, I know.  Your position reminds me of the loop driver
(drivers/block/loop.c), which has long handled this situation (with
great success, though I doubt an absolute guarantee) by taking
__GFP_IO|__GFP_FS off the mapping_gfp_mask of the underlying file:
look for gfp_mask in loop_set_fd() (and I think ignore do_loop_switch(),
that's new to me and seems to be for a very special case).

I grepped for gfp in unionfs, and there seems to be nothing: I doubt
you can be robust under memory pressure without doing something about
that.  If you can take __GFP_IO|__GFP_FS off the lower mapping (just
while in unionfs_writepage, or longer term? what locking needed?),
then you should be able to go back to using grab_cache_page().

 
 One solution I thought of is do what ecryptfs does: keep an open struct file
 in my inode and call vfs_write(), but I don't see that as a significantly
 cleaner/better solution.

I agree with you.

 (BTW, ecrypfts kinda had to go for vfs_write b/c
 it changes the data size and content of what it writes below; unionfs is
 simpler in that manner b/c it needs to write the same data to the lower file
 at the same offset.)

Ah, yes.

 
 Another idea we've experimented with before is page pointer flipping.  In
 writepage, we temporarily set the page-mapping-host to the lower_inode;
 then we call the lower writepage with OUR page; then fix back the
 page-mapping-host to the upper inode.  This had two benefits: first we can
 guarantee that we always have a page to write below, and second we don't
 need to keep both upper and lower pages (reduces memory pressure).  Before
 we did this page pointer flipping, we verified that the page is locked so no
 other user could be written the page-mapping-host in this transient state,
 and we ensured that no lower f/s was somehow caching the temporarily changed
 value of page-mapping-host for later use.  But, mucking with the pointers
 in this manner is kinda ugly, to say the least.  Still, I'd love to find a
 clean and simple way that two layers can share the same struct page and
 cleanly pass the

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-25 Thread Hugh Dickins

On Thu, 25 Oct 2007, Erez Zadok wrote:
 
 On a related note, I would just love to get rid of calling the lower
 -writepage in unionfs b/c I can't even tell if I have a lower page to use
 all the time.  I'd prefer to call vfs_write() if I can, but I'll need a
 struct file, or at least a dentry.

Why do you want to do that?  You gave a good reason why it's easier
for ecryptfs, but I doubt it's robust.  The higher the level you
choose to use, the harder to guarantee it won't deadlock.

Or that's my gut feeling anyway.  It's several years since I've
thought about such issues: just because I came into this knowing
about shmem_writepage, is perhaps not a good reason to choose me
as advisor!

Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-25 Thread Neil Brown

On Thursday October 25, [EMAIL PROTECTED] wrote:
 On Mon, 22 Oct 2007, Pekka Enberg wrote:
  On 10/22/07, Hugh Dickins [EMAIL PROTECTED] wrote:
   Only ramdisk and shmem have been returning AOP_WRITEPAGE_ACTIVATE.
   Both of those set BDI_CAP_NO_WRITEBACK.  ramdisk never returned it
   if !wbc-for_reclaim.  I contend that shmem shouldn't either: it's
   a special code to get the LRU rotation right, not useful elsewhere.
   Though Documentation/filesystems/vfs.txt does imply wider use.
  
   I think this is where people use the phrase go figure ;)
  
  Heh. As far as I can tell, the implication of wider use was added by
  Neil in commit 341546f5ad6fce584531f744853a5807a140f2a9 Update some
  VFS documentation, so perhaps he might know? Neil?

I just read the code, tried to understand it, translated that
understanding into English, and put that in vfs.txt.
It is very possible that what I wrote didn't match the intention of
the author, but it seemed to match the behaviour of the code.

The patch looks like it makes perfect sense to me.
Before the change, -writepage could return AOP_WRITEPAGE_ACTIVATE
without unlocking the page, and this has precisely the effect of:
   ClearPageReclaim();  (if the call path was through pageout)
   SetPageActive();  (if the call was through shrink_page_list)
   unlock_page();

With the patch, the -writepage method does the SetPageActive and the
unlock_page, which on the whole seems cleaner.

We seem to have lost a call to ClearPageReclaim - I don't know if that
is significant.

 
 Special, hidden, undocumented, secret hack!  Then in 2.6.7 Andrew
 stole his own secret and used it when concocting ramdisk_writepage.
 Oh, and NFS made some kind of use of it in 2.6.6 only.  Then Neil
 revealed the secret to the uninitiated in 2.6.17: now, what's the
 appropriate punishment for that?

Surely the punishment should be for writing hidden undocumented hacks
in the first place!  I vote we just make him maintainer for the whole
kernel - that will keep him so busy that he will never have a chance
to do it again :-)

 --- 2.6.24-rc1/Documentation/filesystems/vfs.txt  2007-10-24 
 07:15:11.0 +0100
 +++ linux/Documentation/filesystems/vfs.txt   2007-10-24 08:42:07.0 
 +0100
 @@ -567,9 +567,7 @@ struct address_space_operations {
If wbc-sync_mode is WB_SYNC_NONE, -writepage doesn't have to
try too hard if there are problems, and may choose to write out
other pages from the mapping if that is easier (e.g. due to
 -  internal dependencies).  If it chooses not to start writeout, it
 -  should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep
 -  calling -writepage on that page.
 +  internal dependencies).
  

It seems that the new requirement is that if the address_space
chooses not to write out the page, it should now call SetPageActive().
If that is the case, I think it should be explicit in the
documentation - please?

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-22 Thread Erez Zadok

In message <[EMAIL PROTECTED]>, Hugh Dickins writes:
> Sorry for my delay, here are a few replies.
> 

> > In unionfs_writepage() I tried to emulate as best possible what the lower
> > f/s will have returned to the VFS.  Since tmpfs's ->writepage can return
> > AOP_WRITEPAGE_ACTIVATE and re-mark its page as dirty, I did the same in
> > unionfs: mark again my page as dirty, and return AOP_WRITEPAGE_ACTIVATE.
> 
> I think that's inappropriate.  Why should unionfs_writepage re-mark its
> page as dirty when the lower level does so?  Unionfs has successfully
> done its write to the lower level, what the lower level then gets up to
> (writing then or not) is its own business: needn't be propagated upwards.

What's the precise semantics of AOP_WRITEPAGE_ACTIVATE?  Is it considered an
error or not?  If it's an error, then I usually feel that it's important for
a stacked f/s to return that error indication upwards.

The unionfs page and the lower page are somewhat tied together, at least
logically.  For unionfs's page to be considered to have been written
successfully, the lower page has to be written successfully.  So again, if
the lower f/s returns AOP_WRITEPAGE_ACTIVATE, should I consider my unionfs
page to have been written successfully or not?  If I don't return
AOP_WRITEPAGE_ACTIVATE up, can there be any chance that some vital data may
never get flushed out?

Anyway, now that unionfs has ->writepages that won't bother calling ->write
for file systems with BDI_CAP_NO_WRITEBACK, the issue of
AOP_WRITEPAGE_ACTIVATE in ->writepage may be less important.

> unionfs_writepage also sets AOP_WRITEPAGE_ACTIVATE when it cannot
> find_lock_page: that case may be appropriate.  Though I don't really
> understand it: seems dangerous to be relying upon the lower level page
> just happening to be there already.  Isn't memory pressure then likely
> to clog up with lots of upper level dirty pages which cannot get
> written out to the lower level?

Based on vfs.txt (which perhaps should be revised :-), I was trying to do
the best I can to ensure that no data is lost if the current page cannot be
written out to the lower f/s.

I used to do grab_cache_page() before, but that caused problems: writepage
is not the right place to _increase_ memory pressure by allocating a new
page...

One solution I thought of is do what ecryptfs does: keep an open struct file
in my inode and call vfs_write(), but I don't see that as a significantly
cleaner/better solution.  (BTW, ecrypfts kinda had to go for vfs_write b/c
it changes the data size and content of what it writes below; unionfs is
simpler in that manner b/c it needs to write the same data to the lower file
at the same offset.)

Another idea we've experimented with before is "page pointer flipping."  In
writepage, we temporarily set the page->mapping->host to the lower_inode;
then we call the lower writepage with OUR page; then fix back the
page->mapping->host to the upper inode.  This had two benefits: first we can
guarantee that we always have a page to write below, and second we don't
need to keep both upper and lower pages (reduces memory pressure).  Before
we did this page pointer flipping, we verified that the page is locked so no
other user could be written the page->mapping->host in this transient state,
and we ensured that no lower f/s was somehow caching the temporarily changed
value of page->mapping->host for later use.  But, mucking with the pointers
in this manner is kinda ugly, to say the least.  Still, I'd love to find a
clean and simple way that two layers can share the same struct page and
cleanly pass the upper page to a lower f/s.

If you've got suggestions how I can handle unionfs_write more cleanly, or
comments on the above possibilities, I'd love to hear them.

> > Should I be doing something different when unionfs stacks on top of tmpfs?
> 
> I think not.
> 
> > (BTW, this is probably also relevant to ecryptfs.)
> 
> You're both agreed on that, but I don't see how: ecryptfs writes the
> lower level via vfs_write, it's not using the lower level's writepage,
> is it?

Yup.  ecryptfs no longer does that: it recently changed things and now it
stores and open struct file in its inode, so it can always pass the file to
vfs_write.  This nicely avoids calling the lower writepage, but one has to
keep an open file for every inode.  Neither the solutions employed currently
by unionfs and ecryptfs seem really satisfactory (clean and efficient).

> Hugh

Thanks,
Erez.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-22 Thread Erez Zadok

In message <[EMAIL PROTECTED]>, Hugh Dickins writes:
> On Mon, 15 Oct 2007, Pekka Enberg wrote:
> > 
> > I wonder whether _not setting_ BDI_CAP_NO_WRITEBACK implies that
> > ->writepage() will never return AOP_WRITEPAGE_ACTIVATE for
> > !wbc->for_reclaim case which would explain why we haven't hit this bug
> > before. Hugh, Andrew?
> 
> Only ramdisk and shmem have been returning AOP_WRITEPAGE_ACTIVATE.
> Both of those set BDI_CAP_NO_WRITEBACK.  ramdisk never returned it
> if !wbc->for_reclaim.  I contend that shmem shouldn't either: it's
> a special code to get the LRU rotation right, not useful elsewhere.
> Though Documentation/filesystems/vfs.txt does imply wider use.

Yes, based on vfs.txt I figured unionfs should return
AOP_WRITEPAGE_ACTIVATE.  But, now that unionfs has ->writepages which won't
even call the lower writepage if BDI_CAP_NO_WRITEBACK is on, then perhaps I
no longer need unionfs_writepage to bother checking for
AOP_WRITEPAGE_ACTIVATE, or even return it up?

But, a future file system _could_ return AOP_WRITEPAGE_ACTIVATE w/o setting
BDI_CAP_NO_WRITEBACK, right?  In that case, unionfs will still need to
handle AOP_WRITEPAGE_ACTIVATE in ->writepage, right?

> I think this is where people use the phrase "go figure" ;)
> 
> Hugh

Erez.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-22 Thread Pekka Enberg

Hi Hugh,

On Mon, 15 Oct 2007, Pekka Enberg wrote:
> > I wonder whether _not setting_ BDI_CAP_NO_WRITEBACK implies that
> > ->writepage() will never return AOP_WRITEPAGE_ACTIVATE for
> > !wbc->for_reclaim case which would explain why we haven't hit this bug
> > before. Hugh, Andrew?

On 10/22/07, Hugh Dickins <[EMAIL PROTECTED]> wrote:
> Only ramdisk and shmem have been returning AOP_WRITEPAGE_ACTIVATE.
> Both of those set BDI_CAP_NO_WRITEBACK.  ramdisk never returned it
> if !wbc->for_reclaim.  I contend that shmem shouldn't either: it's
> a special code to get the LRU rotation right, not useful elsewhere.
> Though Documentation/filesystems/vfs.txt does imply wider use.
>
> I think this is where people use the phrase "go figure" ;)

Heh. As far as I can tell, the implication of "wider use" was added by
Neil in commit "341546f5ad6fce584531f744853a5807a140f2a9 Update some
VFS documentation", so perhaps he might know? Neil?

   Pekka
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-22 Thread Pekka Enberg

Hi Hugh,

On 10/22/07, Hugh Dickins <[EMAIL PROTECTED]> wrote:
> I don't disagree with your unionfs_writepages patch, Pekka, but I think
> it should be viewed as an optimization (don't waste time trying to write
> a group of pages when we know that nothing will be done) rather than as
> essential.

Ok, so tmpfs needs your fix still.

On 10/22/07, Hugh Dickins <[EMAIL PROTECTED]> wrote:
> > So now I wonder if we still need the patch to prevent AOP_WRITEPAGE_ACTIVATE
> > from being returned to userland.  I guess we still need it, b/c even with
> > your patch, generic_writepages() can return AOP_WRITEPAGE_ACTIVATE back to
> > the VFS and we need to ensure that doesn't "leak" outside the kernel.
>
> Can it now?  Current git has a patch from Andrew which bears a striking
> resemblance to that from Pekka, stopping the leak from write_cache_pages.

I don't think it can, it looks ok now.

 Pekka
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-22 Thread Hugh Dickins

On Mon, 15 Oct 2007, Pekka Enberg wrote:
> 
> I wonder whether _not setting_ BDI_CAP_NO_WRITEBACK implies that
> ->writepage() will never return AOP_WRITEPAGE_ACTIVATE for
> !wbc->for_reclaim case which would explain why we haven't hit this bug
> before. Hugh, Andrew?

Only ramdisk and shmem have been returning AOP_WRITEPAGE_ACTIVATE.
Both of those set BDI_CAP_NO_WRITEBACK.  ramdisk never returned it
if !wbc->for_reclaim.  I contend that shmem shouldn't either: it's
a special code to get the LRU rotation right, not useful elsewhere.
Though Documentation/filesystems/vfs.txt does imply wider use.

I think this is where people use the phrase "go figure" ;)

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-22 Thread Hugh Dickins

On Sun, 14 Oct 2007, Erez Zadok wrote:
> In message <[EMAIL PROTECTED]>, Pekka J Enberg writes:
> > 
> > Look at mm/filemap.c:__filemap_fdatawrite_range(). You shouldn't be 
> > calling unionfs_writepage() _at all_ if the lower mapping has 
> > BDI_CAP_NO_WRITEBACK capability set. Perhaps something like the totally 
> > untested patch below?
...

I don't disagree with your unionfs_writepages patch, Pekka, but I think
it should be viewed as an optimization (don't waste time trying to write
a group of pages when we know that nothing will be done) rather than as
essential.

Prior to unionfs's own use of AOP_WRITEPAGE_ACTIVATE, there have only
been ramdisk and shmem generating it.  ramdisk is careful only to
return it in the wbc->for_reclaim case: I think (as in the patch
I sent out before) shmem now ought to do so too for safety.

Back in 2.4 days it was reasonable to assume that ->writepage would
only get called from certain places, but things move faster nowadays,
and the unionfs example shows others are liable to start ab/using it.
I'll send Andrew that patch tomorrow (it's simple enough, but I'd
like at least to try to reproduce the page_mapped bug first).

> 
> Pekka, with a small change to your patch (to handle time-based cache
> coherency), your patch worked well and passed all my tests.  Thanks.
> 
> So now I wonder if we still need the patch to prevent AOP_WRITEPAGE_ACTIVATE
> from being returned to userland.  I guess we still need it, b/c even with
> your patch, generic_writepages() can return AOP_WRITEPAGE_ACTIVATE back to
> the VFS and we need to ensure that doesn't "leak" outside the kernel.

Can it now?  Current git has a patch from Andrew which bears a striking
resemblance to that from Pekka, stopping the leak from write_cache_pages.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-22 Thread Hugh Dickins

Sorry for my delay, here are a few replies.

On Sun, 14 Oct 2007, Erez Zadok wrote:
> In message <[EMAIL PROTECTED]>, "Pekka Enberg" writes:
> > 
> > However, I don't think the mapping_cap_writeback_dirty() check in
> > __filemap_fdatawrite_range() works as expected when tmpfs is a lower
> > mount for an unionfs mount. There's no BDI_CAP_NO_WRITEBACK capability
> > for unionfs mappings so do_fsync() will call write_cache_pages() that
> > unconditionally invokes shmem_writepage() via unionfs_writepage().
> > Unless, of course, there's some other unionfs magic I am missing.

Thanks, Pekka, yes that made a lot of sense.

> 
> In unionfs_writepage() I tried to emulate as best possible what the lower
> f/s will have returned to the VFS.  Since tmpfs's ->writepage can return
> AOP_WRITEPAGE_ACTIVATE and re-mark its page as dirty, I did the same in
> unionfs: mark again my page as dirty, and return AOP_WRITEPAGE_ACTIVATE.

I think that's inappropriate.  Why should unionfs_writepage re-mark its
page as dirty when the lower level does so?  Unionfs has successfully
done its write to the lower level, what the lower level then gets up to
(writing then or not) is its own business: needn't be propagated upwards.

The fewer places that supply AOP_WRITEPAGE_ACTIVATE the better.
What I'd like most of all is to eliminate it, in favour of vmscan.c
working out the condition for itself: but I've given that no thought,
it may not be reasonable.

unionfs_writepage also sets AOP_WRITEPAGE_ACTIVATE when it cannot
find_lock_page: that case may be appropriate.  Though I don't really
understand it: seems dangerous to be relying upon the lower level page
just happening to be there already.  Isn't memory pressure then likely
to clog up with lots of upper level dirty pages which cannot get
written out to the lower level?

> 
> Should I be doing something different when unionfs stacks on top of tmpfs?

I think not.

> (BTW, this is probably also relevant to ecryptfs.)

You're both agreed on that, but I don't see how: ecryptfs writes the
lower level via vfs_write, it's not using the lower level's writepage,
is it?

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-22 Thread Hugh Dickins

Sorry for my delay, here are a few replies.

On Sun, 14 Oct 2007, Erez Zadok wrote:
 In message [EMAIL PROTECTED], Pekka Enberg writes:
  
  However, I don't think the mapping_cap_writeback_dirty() check in
  __filemap_fdatawrite_range() works as expected when tmpfs is a lower
  mount for an unionfs mount. There's no BDI_CAP_NO_WRITEBACK capability
  for unionfs mappings so do_fsync() will call write_cache_pages() that
  unconditionally invokes shmem_writepage() via unionfs_writepage().
  Unless, of course, there's some other unionfs magic I am missing.

Thanks, Pekka, yes that made a lot of sense.

 
 In unionfs_writepage() I tried to emulate as best possible what the lower
 f/s will have returned to the VFS.  Since tmpfs's -writepage can return
 AOP_WRITEPAGE_ACTIVATE and re-mark its page as dirty, I did the same in
 unionfs: mark again my page as dirty, and return AOP_WRITEPAGE_ACTIVATE.

I think that's inappropriate.  Why should unionfs_writepage re-mark its
page as dirty when the lower level does so?  Unionfs has successfully
done its write to the lower level, what the lower level then gets up to
(writing then or not) is its own business: needn't be propagated upwards.

The fewer places that supply AOP_WRITEPAGE_ACTIVATE the better.
What I'd like most of all is to eliminate it, in favour of vmscan.c
working out the condition for itself: but I've given that no thought,
it may not be reasonable.

unionfs_writepage also sets AOP_WRITEPAGE_ACTIVATE when it cannot
find_lock_page: that case may be appropriate.  Though I don't really
understand it: seems dangerous to be relying upon the lower level page
just happening to be there already.  Isn't memory pressure then likely
to clog up with lots of upper level dirty pages which cannot get
written out to the lower level?

 
 Should I be doing something different when unionfs stacks on top of tmpfs?

I think not.

 (BTW, this is probably also relevant to ecryptfs.)

You're both agreed on that, but I don't see how: ecryptfs writes the
lower level via vfs_write, it's not using the lower level's writepage,
is it?

Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-22 Thread Hugh Dickins

On Sun, 14 Oct 2007, Erez Zadok wrote:
 In message [EMAIL PROTECTED], Pekka J Enberg writes:
  
  Look at mm/filemap.c:__filemap_fdatawrite_range(). You shouldn't be 
  calling unionfs_writepage() _at all_ if the lower mapping has 
  BDI_CAP_NO_WRITEBACK capability set. Perhaps something like the totally 
  untested patch below?
...

I don't disagree with your unionfs_writepages patch, Pekka, but I think
it should be viewed as an optimization (don't waste time trying to write
a group of pages when we know that nothing will be done) rather than as
essential.

Prior to unionfs's own use of AOP_WRITEPAGE_ACTIVATE, there have only
been ramdisk and shmem generating it.  ramdisk is careful only to
return it in the wbc-for_reclaim case: I think (as in the patch
I sent out before) shmem now ought to do so too for safety.

Back in 2.4 days it was reasonable to assume that -writepage would
only get called from certain places, but things move faster nowadays,
and the unionfs example shows others are liable to start ab/using it.
I'll send Andrew that patch tomorrow (it's simple enough, but I'd
like at least to try to reproduce the page_mapped bug first).

 
 Pekka, with a small change to your patch (to handle time-based cache
 coherency), your patch worked well and passed all my tests.  Thanks.
 
 So now I wonder if we still need the patch to prevent AOP_WRITEPAGE_ACTIVATE
 from being returned to userland.  I guess we still need it, b/c even with
 your patch, generic_writepages() can return AOP_WRITEPAGE_ACTIVATE back to
 the VFS and we need to ensure that doesn't leak outside the kernel.

Can it now?  Current git has a patch from Andrew which bears a striking
resemblance to that from Pekka, stopping the leak from write_cache_pages.

Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-22 Thread Hugh Dickins

On Mon, 15 Oct 2007, Pekka Enberg wrote:
 
 I wonder whether _not setting_ BDI_CAP_NO_WRITEBACK implies that
 -writepage() will never return AOP_WRITEPAGE_ACTIVATE for
 !wbc-for_reclaim case which would explain why we haven't hit this bug
 before. Hugh, Andrew?

Only ramdisk and shmem have been returning AOP_WRITEPAGE_ACTIVATE.
Both of those set BDI_CAP_NO_WRITEBACK.  ramdisk never returned it
if !wbc-for_reclaim.  I contend that shmem shouldn't either: it's
a special code to get the LRU rotation right, not useful elsewhere.
Though Documentation/filesystems/vfs.txt does imply wider use.

I think this is where people use the phrase go figure ;)

Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-22 Thread Pekka Enberg

Hi Hugh,

On 10/22/07, Hugh Dickins [EMAIL PROTECTED] wrote:
 I don't disagree with your unionfs_writepages patch, Pekka, but I think
 it should be viewed as an optimization (don't waste time trying to write
 a group of pages when we know that nothing will be done) rather than as
 essential.

Ok, so tmpfs needs your fix still.

On 10/22/07, Hugh Dickins [EMAIL PROTECTED] wrote:
  So now I wonder if we still need the patch to prevent AOP_WRITEPAGE_ACTIVATE
  from being returned to userland.  I guess we still need it, b/c even with
  your patch, generic_writepages() can return AOP_WRITEPAGE_ACTIVATE back to
  the VFS and we need to ensure that doesn't leak outside the kernel.

 Can it now?  Current git has a patch from Andrew which bears a striking
 resemblance to that from Pekka, stopping the leak from write_cache_pages.

I don't think it can, it looks ok now.

 Pekka
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-22 Thread Pekka Enberg

Hi Hugh,

On Mon, 15 Oct 2007, Pekka Enberg wrote:
  I wonder whether _not setting_ BDI_CAP_NO_WRITEBACK implies that
  -writepage() will never return AOP_WRITEPAGE_ACTIVATE for
  !wbc-for_reclaim case which would explain why we haven't hit this bug
  before. Hugh, Andrew?

On 10/22/07, Hugh Dickins [EMAIL PROTECTED] wrote:
 Only ramdisk and shmem have been returning AOP_WRITEPAGE_ACTIVATE.
 Both of those set BDI_CAP_NO_WRITEBACK.  ramdisk never returned it
 if !wbc-for_reclaim.  I contend that shmem shouldn't either: it's
 a special code to get the LRU rotation right, not useful elsewhere.
 Though Documentation/filesystems/vfs.txt does imply wider use.

 I think this is where people use the phrase go figure ;)

Heh. As far as I can tell, the implication of wider use was added by
Neil in commit 341546f5ad6fce584531f744853a5807a140f2a9 Update some
VFS documentation, so perhaps he might know? Neil?

   Pekka
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-22 Thread Erez Zadok

In message [EMAIL PROTECTED], Hugh Dickins writes:
 On Mon, 15 Oct 2007, Pekka Enberg wrote:
  
  I wonder whether _not setting_ BDI_CAP_NO_WRITEBACK implies that
  -writepage() will never return AOP_WRITEPAGE_ACTIVATE for
  !wbc-for_reclaim case which would explain why we haven't hit this bug
  before. Hugh, Andrew?
 
 Only ramdisk and shmem have been returning AOP_WRITEPAGE_ACTIVATE.
 Both of those set BDI_CAP_NO_WRITEBACK.  ramdisk never returned it
 if !wbc-for_reclaim.  I contend that shmem shouldn't either: it's
 a special code to get the LRU rotation right, not useful elsewhere.
 Though Documentation/filesystems/vfs.txt does imply wider use.

Yes, based on vfs.txt I figured unionfs should return
AOP_WRITEPAGE_ACTIVATE.  But, now that unionfs has -writepages which won't
even call the lower writepage if BDI_CAP_NO_WRITEBACK is on, then perhaps I
no longer need unionfs_writepage to bother checking for
AOP_WRITEPAGE_ACTIVATE, or even return it up?

But, a future file system _could_ return AOP_WRITEPAGE_ACTIVATE w/o setting
BDI_CAP_NO_WRITEBACK, right?  In that case, unionfs will still need to
handle AOP_WRITEPAGE_ACTIVATE in -writepage, right?

 I think this is where people use the phrase go figure ;)
 
 Hugh

Erez.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-22 Thread Erez Zadok

In message [EMAIL PROTECTED], Hugh Dickins writes:
 Sorry for my delay, here are a few replies.
 

  In unionfs_writepage() I tried to emulate as best possible what the lower
  f/s will have returned to the VFS.  Since tmpfs's -writepage can return
  AOP_WRITEPAGE_ACTIVATE and re-mark its page as dirty, I did the same in
  unionfs: mark again my page as dirty, and return AOP_WRITEPAGE_ACTIVATE.
 
 I think that's inappropriate.  Why should unionfs_writepage re-mark its
 page as dirty when the lower level does so?  Unionfs has successfully
 done its write to the lower level, what the lower level then gets up to
 (writing then or not) is its own business: needn't be propagated upwards.

What's the precise semantics of AOP_WRITEPAGE_ACTIVATE?  Is it considered an
error or not?  If it's an error, then I usually feel that it's important for
a stacked f/s to return that error indication upwards.

The unionfs page and the lower page are somewhat tied together, at least
logically.  For unionfs's page to be considered to have been written
successfully, the lower page has to be written successfully.  So again, if
the lower f/s returns AOP_WRITEPAGE_ACTIVATE, should I consider my unionfs
page to have been written successfully or not?  If I don't return
AOP_WRITEPAGE_ACTIVATE up, can there be any chance that some vital data may
never get flushed out?

Anyway, now that unionfs has -writepages that won't bother calling -write
for file systems with BDI_CAP_NO_WRITEBACK, the issue of
AOP_WRITEPAGE_ACTIVATE in -writepage may be less important.

 unionfs_writepage also sets AOP_WRITEPAGE_ACTIVATE when it cannot
 find_lock_page: that case may be appropriate.  Though I don't really
 understand it: seems dangerous to be relying upon the lower level page
 just happening to be there already.  Isn't memory pressure then likely
 to clog up with lots of upper level dirty pages which cannot get
 written out to the lower level?

Based on vfs.txt (which perhaps should be revised :-), I was trying to do
the best I can to ensure that no data is lost if the current page cannot be
written out to the lower f/s.

I used to do grab_cache_page() before, but that caused problems: writepage
is not the right place to _increase_ memory pressure by allocating a new
page...

One solution I thought of is do what ecryptfs does: keep an open struct file
in my inode and call vfs_write(), but I don't see that as a significantly
cleaner/better solution.  (BTW, ecrypfts kinda had to go for vfs_write b/c
it changes the data size and content of what it writes below; unionfs is
simpler in that manner b/c it needs to write the same data to the lower file
at the same offset.)

Another idea we've experimented with before is page pointer flipping.  In
writepage, we temporarily set the page-mapping-host to the lower_inode;
then we call the lower writepage with OUR page; then fix back the
page-mapping-host to the upper inode.  This had two benefits: first we can
guarantee that we always have a page to write below, and second we don't
need to keep both upper and lower pages (reduces memory pressure).  Before
we did this page pointer flipping, we verified that the page is locked so no
other user could be written the page-mapping-host in this transient state,
and we ensured that no lower f/s was somehow caching the temporarily changed
value of page-mapping-host for later use.  But, mucking with the pointers
in this manner is kinda ugly, to say the least.  Still, I'd love to find a
clean and simple way that two layers can share the same struct page and
cleanly pass the upper page to a lower f/s.

If you've got suggestions how I can handle unionfs_write more cleanly, or
comments on the above possibilities, I'd love to hear them.

  Should I be doing something different when unionfs stacks on top of tmpfs?
 
 I think not.
 
  (BTW, this is probably also relevant to ecryptfs.)
 
 You're both agreed on that, but I don't see how: ecryptfs writes the
 lower level via vfs_write, it's not using the lower level's writepage,
 is it?

Yup.  ecryptfs no longer does that: it recently changed things and now it
stores and open struct file in its inode, so it can always pass the file to
vfs_write.  This nicely avoids calling the lower writepage, but one has to
keep an open file for every inode.  Neither the solutions employed currently
by unionfs and ecryptfs seem really satisfactory (clean and efficient).

 Hugh

Thanks,
Erez.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-16 Thread Erez Zadok

In message <[EMAIL PROTECTED]>, "Pekka Enberg" writes:
> Hi,
> 
> On 10/15/07, Erez Zadok <[EMAIL PROTECTED]> wrote:
> > Pekka, with a small change to your patch (to handle time-based cache
> > coherency), your patch worked well and passed all my tests.  Thanks.
> >
> > So now I wonder if we still need the patch to prevent AOP_WRITEPAGE_ACTIVATE
> > from being returned to userland.  I guess we still need it, b/c even with
> > your patch, generic_writepages() can return AOP_WRITEPAGE_ACTIVATE back to
> > the VFS and we need to ensure that doesn't "leak" outside the kernel.
> 
> I wonder whether _not setting_ BDI_CAP_NO_WRITEBACK implies that
> ->writepage() will never return AOP_WRITEPAGE_ACTIVATE for
> !wbc->for_reclaim case which would explain why we haven't hit this bug
> before. Hugh, Andrew?
> 
> And btw, I think we need to fix ecryptfs too.

Yes, ecryptfs needs this fix too (and probably a couple of other mmap fixes
I've made to unionfs recently -- Mike Halcrow already knows :-)

Of course, running ecryptfs on top of tmpfs is somewhat odd and uncommon;
but with unionfs, users use tmpfs as the copyup branch very often.

>Pekka

Erez.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-16 Thread Erez Zadok

In message [EMAIL PROTECTED], Pekka Enberg writes:
 Hi,
 
 On 10/15/07, Erez Zadok [EMAIL PROTECTED] wrote:
  Pekka, with a small change to your patch (to handle time-based cache
  coherency), your patch worked well and passed all my tests.  Thanks.
 
  So now I wonder if we still need the patch to prevent AOP_WRITEPAGE_ACTIVATE
  from being returned to userland.  I guess we still need it, b/c even with
  your patch, generic_writepages() can return AOP_WRITEPAGE_ACTIVATE back to
  the VFS and we need to ensure that doesn't leak outside the kernel.
 
 I wonder whether _not setting_ BDI_CAP_NO_WRITEBACK implies that
 -writepage() will never return AOP_WRITEPAGE_ACTIVATE for
 !wbc-for_reclaim case which would explain why we haven't hit this bug
 before. Hugh, Andrew?
 
 And btw, I think we need to fix ecryptfs too.

Yes, ecryptfs needs this fix too (and probably a couple of other mmap fixes
I've made to unionfs recently -- Mike Halcrow already knows :-)

Of course, running ecryptfs on top of tmpfs is somewhat odd and uncommon;
but with unionfs, users use tmpfs as the copyup branch very often.

Pekka

Erez.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-15 Thread Pekka Enberg

Hi,

On 10/15/07, Erez Zadok <[EMAIL PROTECTED]> wrote:
> Pekka, with a small change to your patch (to handle time-based cache
> coherency), your patch worked well and passed all my tests.  Thanks.
>
> So now I wonder if we still need the patch to prevent AOP_WRITEPAGE_ACTIVATE
> from being returned to userland.  I guess we still need it, b/c even with
> your patch, generic_writepages() can return AOP_WRITEPAGE_ACTIVATE back to
> the VFS and we need to ensure that doesn't "leak" outside the kernel.

I wonder whether _not setting_ BDI_CAP_NO_WRITEBACK implies that
->writepage() will never return AOP_WRITEPAGE_ACTIVATE for
!wbc->for_reclaim case which would explain why we haven't hit this bug
before. Hugh, Andrew?

And btw, I think we need to fix ecryptfs too.

   Pekka
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-15 Thread Pekka Enberg

Hi,

On 10/15/07, Erez Zadok [EMAIL PROTECTED] wrote:
 Pekka, with a small change to your patch (to handle time-based cache
 coherency), your patch worked well and passed all my tests.  Thanks.

 So now I wonder if we still need the patch to prevent AOP_WRITEPAGE_ACTIVATE
 from being returned to userland.  I guess we still need it, b/c even with
 your patch, generic_writepages() can return AOP_WRITEPAGE_ACTIVATE back to
 the VFS and we need to ensure that doesn't leak outside the kernel.

I wonder whether _not setting_ BDI_CAP_NO_WRITEBACK implies that
-writepage() will never return AOP_WRITEPAGE_ACTIVATE for
!wbc-for_reclaim case which would explain why we haven't hit this bug
before. Hugh, Andrew?

And btw, I think we need to fix ecryptfs too.

   Pekka
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-14 Thread Erez Zadok

In message <[EMAIL PROTECTED]>, Pekka J Enberg writes:
> Hi Erez,
> 
> On Sun, 14 Oct 2007, Erez Zadok wrote:
> > In unionfs_writepage() I tried to emulate as best possible what the lower
> > f/s will have returned to the VFS.  Since tmpfs's ->writepage can return
> > AOP_WRITEPAGE_ACTIVATE and re-mark its page as dirty, I did the same in
> > unionfs: mark again my page as dirty, and return AOP_WRITEPAGE_ACTIVATE.
> > 
> > Should I be doing something different when unionfs stacks on top of tmpfs?
> > (BTW, this is probably also relevant to ecryptfs.)
> 
> Look at mm/filemap.c:__filemap_fdatawrite_range(). You shouldn't be 
> calling unionfs_writepage() _at all_ if the lower mapping has 
> BDI_CAP_NO_WRITEBACK capability set. Perhaps something like the totally 
> untested patch below?
> 
>   Pekka
[...]

Pekka, with a small change to your patch (to handle time-based cache
coherency), your patch worked well and passed all my tests.  Thanks.

So now I wonder if we still need the patch to prevent AOP_WRITEPAGE_ACTIVATE
from being returned to userland.  I guess we still need it, b/c even with
your patch, generic_writepages() can return AOP_WRITEPAGE_ACTIVATE back to
the VFS and we need to ensure that doesn't "leak" outside the kernel.

Erez.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-14 Thread Pekka J Enberg

Hi Erez,

On Sun, 14 Oct 2007, Erez Zadok wrote:
> In unionfs_writepage() I tried to emulate as best possible what the lower
> f/s will have returned to the VFS.  Since tmpfs's ->writepage can return
> AOP_WRITEPAGE_ACTIVATE and re-mark its page as dirty, I did the same in
> unionfs: mark again my page as dirty, and return AOP_WRITEPAGE_ACTIVATE.
> 
> Should I be doing something different when unionfs stacks on top of tmpfs?
> (BTW, this is probably also relevant to ecryptfs.)

Look at mm/filemap.c:__filemap_fdatawrite_range(). You shouldn't be 
calling unionfs_writepage() _at all_ if the lower mapping has 
BDI_CAP_NO_WRITEBACK capability set. Perhaps something like the totally 
untested patch below?

Pekka

---
 fs/unionfs/mmap.c |   17 +
 1 file changed, 17 insertions(+)

Index: linux-2.6.23-rc8/fs/unionfs/mmap.c
===
--- linux-2.6.23-rc8.orig/fs/unionfs/mmap.c
+++ linux-2.6.23-rc8/fs/unionfs/mmap.c
@@ -17,6 +17,7 @@
  * published by the Free Software Foundation.
  */
 
+#include 
 #include "union.h"
 
 /*
@@ -144,6 +145,21 @@ out:
return err;
 }
 
+static int unionfs_writepages(struct address_space *mapping,
+ struct writeback_control *wbc)
+{
+   struct inode *lower_inode;
+   struct inode *inode;
+
+   inode = mapping->host;
+   lower_inode = unionfs_lower_inode(inode);
+
+   if (!mapping_cap_writeback_dirty(lower_inode->i_mapping))
+   return 0;
+
+   return generic_writepages(mapping, wbc);
+}
+
 /*
  * readpage is called from generic_page_read and the fault handler.
  * If your file system uses generic_page_read for the read op, it
@@ -371,6 +387,7 @@ out:
 
 struct address_space_operations unionfs_aops = {
.writepage  = unionfs_writepage,
+   .writepages = unionfs_writepages,
.readpage   = unionfs_readpage,
.prepare_write  = unionfs_prepare_write,
.commit_write   = unionfs_commit_write,
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-14 Thread Erez Zadok

In message <[EMAIL PROTECTED]>, "Pekka Enberg" writes:
> Hi Hugh,
> 
> On Sat, 13 Oct 2007, Pekka Enberg wrote:
> > Doesn't msync(2) get to it via mm/page-writeback.c:write_cache_pages()
> > without unionfs even?
> 
> On 10/14/07, Hugh Dickins <[EMAIL PROTECTED]> wrote:
> > I believe not.  Please do double-check my assertions, I've always found
> > the _writepages paths rather twisty; but my belief (supported by the
> > fact that we've not hit shmem_writepage's BUG_ON(page_mapped(page))
> > in five years) is that tmpfs/shmem opts out of all of that with its
> > .capabilities   = BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_WRITEBACK,
> > in shmem_backing_dev_info, which avoids all those _writepages avenues
> > (e.g. through bdi_cap_writeback_dirty tests), and write_cache_pages is
> > just a subfunction of the _writepages.
> 
> Thanks for the explanation, you're obviously correct.
> 
> However, I don't think the mapping_cap_writeback_dirty() check in
> __filemap_fdatawrite_range() works as expected when tmpfs is a lower
> mount for an unionfs mount. There's no BDI_CAP_NO_WRITEBACK capability
> for unionfs mappings so do_fsync() will call write_cache_pages() that
> unconditionally invokes shmem_writepage() via unionfs_writepage().
> Unless, of course, there's some other unionfs magic I am missing.
> 
>Pekka

In unionfs_writepage() I tried to emulate as best possible what the lower
f/s will have returned to the VFS.  Since tmpfs's ->writepage can return
AOP_WRITEPAGE_ACTIVATE and re-mark its page as dirty, I did the same in
unionfs: mark again my page as dirty, and return AOP_WRITEPAGE_ACTIVATE.

Should I be doing something different when unionfs stacks on top of tmpfs?
(BTW, this is probably also relevant to ecryptfs.)

Thanks,
Erez.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-14 Thread Pekka Enberg

Hi Hugh,

On Sat, 13 Oct 2007, Pekka Enberg wrote:
> Doesn't msync(2) get to it via mm/page-writeback.c:write_cache_pages()
> without unionfs even?

On 10/14/07, Hugh Dickins <[EMAIL PROTECTED]> wrote:
> I believe not.  Please do double-check my assertions, I've always found
> the _writepages paths rather twisty; but my belief (supported by the
> fact that we've not hit shmem_writepage's BUG_ON(page_mapped(page))
> in five years) is that tmpfs/shmem opts out of all of that with its
> .capabilities   = BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_WRITEBACK,
> in shmem_backing_dev_info, which avoids all those _writepages avenues
> (e.g. through bdi_cap_writeback_dirty tests), and write_cache_pages is
> just a subfunction of the _writepages.

Thanks for the explanation, you're obviously correct.

However, I don't think the mapping_cap_writeback_dirty() check in
__filemap_fdatawrite_range() works as expected when tmpfs is a lower
mount for an unionfs mount. There's no BDI_CAP_NO_WRITEBACK capability
for unionfs mappings so do_fsync() will call write_cache_pages() that
unconditionally invokes shmem_writepage() via unionfs_writepage().
Unless, of course, there's some other unionfs magic I am missing.

   Pekka
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-14 Thread Hugh Dickins

On Sat, 13 Oct 2007, Pekka Enberg wrote:
> On 10/12/07, Hugh Dickins <[EMAIL PROTECTED]> wrote:
> > But I keep suspecting that the answer might be the patch below (which
> > rather follows what drivers/block/rd.c is doing).  I'm especially
> > worried that, rather than just AOP_WRITEPAGE_ACTIVATE being returned
> > to userspace, bad enough in itself, you might be liable to hit that
> > BUG_ON(page_mapped(page)).  shmem_writepage does not expect to be
> > called by anyone outside mm/vmscan.c, but unionfs can now get to it?
> 
> Doesn't msync(2) get to it via mm/page-writeback.c:write_cache_pages()
> without unionfs even?

I believe not.  Please do double-check my assertions, I've always found
the _writepages paths rather twisty; but my belief (supported by the
fact that we've not hit shmem_writepage's BUG_ON(page_mapped(page))
in five years) is that tmpfs/shmem opts out of all of that with its
.capabilities   = BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_WRITEBACK,
in shmem_backing_dev_info, which avoids all those _writepages avenues
(e.g. through bdi_cap_writeback_dirty tests), and write_cache_pages is
just a subfunction of the _writepages.

So, while I don't disagree with your patch to write_cache_pages (though
it wasn't clear to me whether it should break from or continue the loop
if it ever does meet an AOP_WRITEPAGE_ACTIVATE), I don't think that's
really the root of the problem.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-14 Thread Hugh Dickins

On Sat, 13 Oct 2007, Pekka Enberg wrote:
 On 10/12/07, Hugh Dickins [EMAIL PROTECTED] wrote:
  But I keep suspecting that the answer might be the patch below (which
  rather follows what drivers/block/rd.c is doing).  I'm especially
  worried that, rather than just AOP_WRITEPAGE_ACTIVATE being returned
  to userspace, bad enough in itself, you might be liable to hit that
  BUG_ON(page_mapped(page)).  shmem_writepage does not expect to be
  called by anyone outside mm/vmscan.c, but unionfs can now get to it?
 
 Doesn't msync(2) get to it via mm/page-writeback.c:write_cache_pages()
 without unionfs even?

I believe not.  Please do double-check my assertions, I've always found
the _writepages paths rather twisty; but my belief (supported by the
fact that we've not hit shmem_writepage's BUG_ON(page_mapped(page))
in five years) is that tmpfs/shmem opts out of all of that with its
.capabilities   = BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_WRITEBACK,
in shmem_backing_dev_info, which avoids all those _writepages avenues
(e.g. through bdi_cap_writeback_dirty tests), and write_cache_pages is
just a subfunction of the _writepages.

So, while I don't disagree with your patch to write_cache_pages (though
it wasn't clear to me whether it should break from or continue the loop
if it ever does meet an AOP_WRITEPAGE_ACTIVATE), I don't think that's
really the root of the problem.

Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-14 Thread Pekka Enberg

Hi Hugh,

On Sat, 13 Oct 2007, Pekka Enberg wrote:
 Doesn't msync(2) get to it via mm/page-writeback.c:write_cache_pages()
 without unionfs even?

On 10/14/07, Hugh Dickins [EMAIL PROTECTED] wrote:
 I believe not.  Please do double-check my assertions, I've always found
 the _writepages paths rather twisty; but my belief (supported by the
 fact that we've not hit shmem_writepage's BUG_ON(page_mapped(page))
 in five years) is that tmpfs/shmem opts out of all of that with its
 .capabilities   = BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_WRITEBACK,
 in shmem_backing_dev_info, which avoids all those _writepages avenues
 (e.g. through bdi_cap_writeback_dirty tests), and write_cache_pages is
 just a subfunction of the _writepages.

Thanks for the explanation, you're obviously correct.

However, I don't think the mapping_cap_writeback_dirty() check in
__filemap_fdatawrite_range() works as expected when tmpfs is a lower
mount for an unionfs mount. There's no BDI_CAP_NO_WRITEBACK capability
for unionfs mappings so do_fsync() will call write_cache_pages() that
unconditionally invokes shmem_writepage() via unionfs_writepage().
Unless, of course, there's some other unionfs magic I am missing.

   Pekka
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-14 Thread Erez Zadok

In message [EMAIL PROTECTED], Pekka Enberg writes:
 Hi Hugh,
 
 On Sat, 13 Oct 2007, Pekka Enberg wrote:
  Doesn't msync(2) get to it via mm/page-writeback.c:write_cache_pages()
  without unionfs even?
 
 On 10/14/07, Hugh Dickins [EMAIL PROTECTED] wrote:
  I believe not.  Please do double-check my assertions, I've always found
  the _writepages paths rather twisty; but my belief (supported by the
  fact that we've not hit shmem_writepage's BUG_ON(page_mapped(page))
  in five years) is that tmpfs/shmem opts out of all of that with its
  .capabilities   = BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_WRITEBACK,
  in shmem_backing_dev_info, which avoids all those _writepages avenues
  (e.g. through bdi_cap_writeback_dirty tests), and write_cache_pages is
  just a subfunction of the _writepages.
 
 Thanks for the explanation, you're obviously correct.
 
 However, I don't think the mapping_cap_writeback_dirty() check in
 __filemap_fdatawrite_range() works as expected when tmpfs is a lower
 mount for an unionfs mount. There's no BDI_CAP_NO_WRITEBACK capability
 for unionfs mappings so do_fsync() will call write_cache_pages() that
 unconditionally invokes shmem_writepage() via unionfs_writepage().
 Unless, of course, there's some other unionfs magic I am missing.
 
Pekka

In unionfs_writepage() I tried to emulate as best possible what the lower
f/s will have returned to the VFS.  Since tmpfs's -writepage can return
AOP_WRITEPAGE_ACTIVATE and re-mark its page as dirty, I did the same in
unionfs: mark again my page as dirty, and return AOP_WRITEPAGE_ACTIVATE.

Should I be doing something different when unionfs stacks on top of tmpfs?
(BTW, this is probably also relevant to ecryptfs.)

Thanks,
Erez.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-14 Thread Pekka J Enberg

Hi Erez,

On Sun, 14 Oct 2007, Erez Zadok wrote:
 In unionfs_writepage() I tried to emulate as best possible what the lower
 f/s will have returned to the VFS.  Since tmpfs's -writepage can return
 AOP_WRITEPAGE_ACTIVATE and re-mark its page as dirty, I did the same in
 unionfs: mark again my page as dirty, and return AOP_WRITEPAGE_ACTIVATE.
 
 Should I be doing something different when unionfs stacks on top of tmpfs?
 (BTW, this is probably also relevant to ecryptfs.)

Look at mm/filemap.c:__filemap_fdatawrite_range(). You shouldn't be 
calling unionfs_writepage() _at all_ if the lower mapping has 
BDI_CAP_NO_WRITEBACK capability set. Perhaps something like the totally 
untested patch below?

Pekka

---
 fs/unionfs/mmap.c |   17 +
 1 file changed, 17 insertions(+)

Index: linux-2.6.23-rc8/fs/unionfs/mmap.c
===
--- linux-2.6.23-rc8.orig/fs/unionfs/mmap.c
+++ linux-2.6.23-rc8/fs/unionfs/mmap.c
@@ -17,6 +17,7 @@
  * published by the Free Software Foundation.
  */
 
+#include linux/backing-dev.h
 #include union.h
 
 /*
@@ -144,6 +145,21 @@ out:
return err;
 }
 
+static int unionfs_writepages(struct address_space *mapping,
+ struct writeback_control *wbc)
+{
+   struct inode *lower_inode;
+   struct inode *inode;
+
+   inode = mapping-host;
+   lower_inode = unionfs_lower_inode(inode);
+
+   if (!mapping_cap_writeback_dirty(lower_inode-i_mapping))
+   return 0;
+
+   return generic_writepages(mapping, wbc);
+}
+
 /*
  * readpage is called from generic_page_read and the fault handler.
  * If your file system uses generic_page_read for the read op, it
@@ -371,6 +387,7 @@ out:
 
 struct address_space_operations unionfs_aops = {
.writepage  = unionfs_writepage,
+   .writepages = unionfs_writepages,
.readpage   = unionfs_readpage,
.prepare_write  = unionfs_prepare_write,
.commit_write   = unionfs_commit_write,
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-14 Thread Erez Zadok

In message [EMAIL PROTECTED], Pekka J Enberg writes:
 Hi Erez,
 
 On Sun, 14 Oct 2007, Erez Zadok wrote:
  In unionfs_writepage() I tried to emulate as best possible what the lower
  f/s will have returned to the VFS.  Since tmpfs's -writepage can return
  AOP_WRITEPAGE_ACTIVATE and re-mark its page as dirty, I did the same in
  unionfs: mark again my page as dirty, and return AOP_WRITEPAGE_ACTIVATE.
  
  Should I be doing something different when unionfs stacks on top of tmpfs?
  (BTW, this is probably also relevant to ecryptfs.)
 
 Look at mm/filemap.c:__filemap_fdatawrite_range(). You shouldn't be 
 calling unionfs_writepage() _at all_ if the lower mapping has 
 BDI_CAP_NO_WRITEBACK capability set. Perhaps something like the totally 
 untested patch below?
 
   Pekka
[...]

Pekka, with a small change to your patch (to handle time-based cache
coherency), your patch worked well and passed all my tests.  Thanks.

So now I wonder if we still need the patch to prevent AOP_WRITEPAGE_ACTIVATE
from being returned to userland.  I guess we still need it, b/c even with
your patch, generic_writepages() can return AOP_WRITEPAGE_ACTIVATE back to
the VFS and we need to ensure that doesn't leak outside the kernel.

Erez.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-12 Thread Pekka Enberg

Hi Hugh,

On 10/12/07, Hugh Dickins <[EMAIL PROTECTED]> wrote:
> But I keep suspecting that the answer might be the patch below (which
> rather follows what drivers/block/rd.c is doing).  I'm especially
> worried that, rather than just AOP_WRITEPAGE_ACTIVATE being returned
> to userspace, bad enough in itself, you might be liable to hit that
> BUG_ON(page_mapped(page)).  shmem_writepage does not expect to be
> called by anyone outside mm/vmscan.c, but unionfs can now get to it?

Doesn't msync(2) get to it via mm/page-writeback.c:write_cache_pages()
without unionfs even?

   Pekka
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-12 Thread Pekka Enberg

Hi Hugh,

On 10/12/07, Hugh Dickins [EMAIL PROTECTED] wrote:
 But I keep suspecting that the answer might be the patch below (which
 rather follows what drivers/block/rd.c is doing).  I'm especially
 worried that, rather than just AOP_WRITEPAGE_ACTIVATE being returned
 to userspace, bad enough in itself, you might be liable to hit that
 BUG_ON(page_mapped(page)).  shmem_writepage does not expect to be
 called by anyone outside mm/vmscan.c, but unionfs can now get to it?

Doesn't msync(2) get to it via mm/page-writeback.c:write_cache_pages()
without unionfs even?

   Pekka
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-11 Thread Hugh Dickins

On Thu, 11 Oct 2007, Ryan Finnie wrote:
> On 10/11/07, Andrew Morton <[EMAIL PROTECTED]> wrote:
> > shit.  That's a nasty bug.  Really userspace should be testing for -1, but
> > the msync() library function should only ever return 0 or -1.
> >
> > Does this fix it?
> >
> > --- a/mm/page-writeback.c~a
> > +++ a/mm/page-writeback.c
> > @@ -850,8 +850,10 @@ retry:
> >
> > ret = (*writepage)(page, wbc, data);
> >
> > -   if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE))
> > +   if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
> > unlock_page(page);
> > +   ret = 0;
> > +   }
> > if (ret || (--(wbc->nr_to_write) <= 0))
> > done = 1;
> > if (wbc->nonblocking && bdi_write_congested(bdi)) {
> > _
> >
> 
> Pekka Enberg replied with an identical patch a few days ago, but for
> some reason the same condition flows up to msync as -1 EIO instead of
> AOP_WRITEPAGE_ACTIVATE with that patch applied.  The last part of the
> thread is below.  Thanks.

Each time I sit down to follow what's going on with writepage and
unionfs and msync, I get distracted: I really haven't researched
this properly.

But I keep suspecting that the answer might be the patch below (which
rather follows what drivers/block/rd.c is doing).  I'm especially
worried that, rather than just AOP_WRITEPAGE_ACTIVATE being returned
to userspace, bad enough in itself, you might be liable to hit that
BUG_ON(page_mapped(page)).  shmem_writepage does not expect to be
called by anyone outside mm/vmscan.c, but unionfs can now get to it?

Please let us know if this patch does fix it:
then I'll try harder to work out what goes on.

Thanks,
Hugh

--- 2.6.23/mm/shmem.c   2007-10-09 21:31:38.0 +0100
+++ linux/mm/shmem.c2007-10-12 01:25:46.0 +0100
@@ -916,6 +916,11 @@ static int shmem_writepage(struct page *
struct inode *inode;
 
BUG_ON(!PageLocked(page));
+   if (!wbc->for_reclaim) {
+   set_page_dirty(page);
+   unlock_page(page);
+   return 0;
+   }
BUG_ON(page_mapped(page));
 
mapping = page->mapping;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-11 Thread Ryan Finnie

On 10/11/07, Andrew Morton <[EMAIL PROTECTED]> wrote:
> shit.  That's a nasty bug.  Really userspace should be testing for -1, but
> the msync() library function should only ever return 0 or -1.
>
> Does this fix it?
>
> --- a/mm/page-writeback.c~a
> +++ a/mm/page-writeback.c
> @@ -850,8 +850,10 @@ retry:
>
> ret = (*writepage)(page, wbc, data);
>
> -   if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE))
> +   if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
> unlock_page(page);
> +   ret = 0;
> +   }
> if (ret || (--(wbc->nr_to_write) <= 0))
> done = 1;
> if (wbc->nonblocking && bdi_write_congested(bdi)) {
> _
>

Pekka Enberg replied with an identical patch a few days ago, but for
some reason the same condition flows up to msync as -1 EIO instead of
AOP_WRITEPAGE_ACTIVATE with that patch applied.  The last part of the
thread is below.  Thanks.

Ryan

On 10/7/07, Ryan Finnie <[EMAIL PROTECTED]> wrote:
> On 10/7/07, Pekka J Enberg <[EMAIL PROTECTED]> wrote:
> > On 10/7/07, Erez Zadok <[EMAIL PROTECTED]> wrote:
> > > Anyway, some Ubuntu users of Unionfs reported that msync(2) sometimes
> > > returns AOP_WRITEPAGE_ACTIVATE (decimal 524288) back to userland.
> > > Therefore, some user programs fail, esp. if they're written such as
> > > this:
> >
> ...
> > It's a kernel bug. AOP_WRITEPAGE_ACTIVATE is a hint to the VM to avoid
> > writeback of the page in the near future. I wonder if it's enough that we
> > change the return value to zero from
> > mm/page-writeback.c:write_cache_pages() in case we hit 
> > AOP_WRITEPAGE_ACTIVE...
>
> Doesn't appear to be enough.  I can't figure out why (since it appears
> write_cache_pages bubbles up directly to sys_msync), but with that
> patch applied, in my test case[1], msync returns -1 EIO.  However,
> with the exact same kernel without that patch applied, msync returns
> 524288 (AOP_WRITEPAGE_ACTIVATE).  But as your patch specifically flips
> 524288 to 0, I can't figure out how it eventually returns  -1 EIO.
>
> Ryan
>
> [1] "apt-get check" on a unionfs2 mount backed by tmpfs over cdrom,
> standard livecd setup
>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-11 Thread Andrew Morton

On Sun, 7 Oct 2007 15:20:19 -0400
Erez Zadok <[EMAIL PROTECTED]> wrote:

> According to vfs.txt, ->writepage() may return AOP_WRITEPAGE_ACTIVATE back
> to the VFS/VM.  Indeed some filesystems such as tmpfs can return
> AOP_WRITEPAGE_ACTIVATE; and stackable file systems (e.g., Unionfs) also
> return AOP_WRITEPAGE_ACTIVATE if the lower f/s returned it.
> 
> Anyway, some Ubuntu users of Unionfs reported that msync(2) sometimes
> returns AOP_WRITEPAGE_ACTIVATE (decimal 524288) back to userland.
> Therefore, some user programs fail, esp. if they're written such as this:
> 
>  err = msync(...);
>  if (err != 0)
>   // fail
> 
> They temporarily fixed the specific program in question (apt-get) to check
> 
>  if (err < 0)
>   // fail
> 
> Is this a bug indeed, or are user programs supposed to handle
> AOP_WRITEPAGE_ACTIVATE (I hope not the latter).  If it's a kernel bug, what
> should the kernel return: a zero, or an -errno (and which one)?
> 

shit.  That's a nasty bug.  Really userspace should be testing for -1, but
the msync() library function should only ever return 0 or -1.

Does this fix it?

--- a/mm/page-writeback.c~a
+++ a/mm/page-writeback.c
@@ -850,8 +850,10 @@ retry:
 
ret = (*writepage)(page, wbc, data);
 
-   if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE))
+   if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
unlock_page(page);
+   ret = 0;
+   }
if (ret || (--(wbc->nr_to_write) <= 0))
done = 1;
if (wbc->nonblocking && bdi_write_congested(bdi)) {
_

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-11 Thread Andrew Morton

On Sun, 7 Oct 2007 15:20:19 -0400
Erez Zadok [EMAIL PROTECTED] wrote:

 According to vfs.txt, -writepage() may return AOP_WRITEPAGE_ACTIVATE back
 to the VFS/VM.  Indeed some filesystems such as tmpfs can return
 AOP_WRITEPAGE_ACTIVATE; and stackable file systems (e.g., Unionfs) also
 return AOP_WRITEPAGE_ACTIVATE if the lower f/s returned it.
 
 Anyway, some Ubuntu users of Unionfs reported that msync(2) sometimes
 returns AOP_WRITEPAGE_ACTIVATE (decimal 524288) back to userland.
 Therefore, some user programs fail, esp. if they're written such as this:
 
  err = msync(...);
  if (err != 0)
   // fail
 
 They temporarily fixed the specific program in question (apt-get) to check
 
  if (err  0)
   // fail
 
 Is this a bug indeed, or are user programs supposed to handle
 AOP_WRITEPAGE_ACTIVATE (I hope not the latter).  If it's a kernel bug, what
 should the kernel return: a zero, or an -errno (and which one)?
 

shit.  That's a nasty bug.  Really userspace should be testing for -1, but
the msync() library function should only ever return 0 or -1.

Does this fix it?

--- a/mm/page-writeback.c~a
+++ a/mm/page-writeback.c
@@ -850,8 +850,10 @@ retry:
 
ret = (*writepage)(page, wbc, data);
 
-   if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE))
+   if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
unlock_page(page);
+   ret = 0;
+   }
if (ret || (--(wbc-nr_to_write) = 0))
done = 1;
if (wbc-nonblocking  bdi_write_congested(bdi)) {
_

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-11 Thread Ryan Finnie

On 10/11/07, Andrew Morton [EMAIL PROTECTED] wrote:
 shit.  That's a nasty bug.  Really userspace should be testing for -1, but
 the msync() library function should only ever return 0 or -1.

 Does this fix it?

 --- a/mm/page-writeback.c~a
 +++ a/mm/page-writeback.c
 @@ -850,8 +850,10 @@ retry:

 ret = (*writepage)(page, wbc, data);

 -   if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE))
 +   if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
 unlock_page(page);
 +   ret = 0;
 +   }
 if (ret || (--(wbc-nr_to_write) = 0))
 done = 1;
 if (wbc-nonblocking  bdi_write_congested(bdi)) {
 _


Pekka Enberg replied with an identical patch a few days ago, but for
some reason the same condition flows up to msync as -1 EIO instead of
AOP_WRITEPAGE_ACTIVATE with that patch applied.  The last part of the
thread is below.  Thanks.

Ryan

On 10/7/07, Ryan Finnie [EMAIL PROTECTED] wrote:
 On 10/7/07, Pekka J Enberg [EMAIL PROTECTED] wrote:
  On 10/7/07, Erez Zadok [EMAIL PROTECTED] wrote:
   Anyway, some Ubuntu users of Unionfs reported that msync(2) sometimes
   returns AOP_WRITEPAGE_ACTIVATE (decimal 524288) back to userland.
   Therefore, some user programs fail, esp. if they're written such as
   this:
 
 ...
  It's a kernel bug. AOP_WRITEPAGE_ACTIVATE is a hint to the VM to avoid
  writeback of the page in the near future. I wonder if it's enough that we
  change the return value to zero from
  mm/page-writeback.c:write_cache_pages() in case we hit 
  AOP_WRITEPAGE_ACTIVE...

 Doesn't appear to be enough.  I can't figure out why (since it appears
 write_cache_pages bubbles up directly to sys_msync), but with that
 patch applied, in my test case[1], msync returns -1 EIO.  However,
 with the exact same kernel without that patch applied, msync returns
 524288 (AOP_WRITEPAGE_ACTIVATE).  But as your patch specifically flips
 524288 to 0, I can't figure out how it eventually returns  -1 EIO.

 Ryan

 [1] apt-get check on a unionfs2 mount backed by tmpfs over cdrom,
 standard livecd setup

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-11 Thread Hugh Dickins

On Thu, 11 Oct 2007, Ryan Finnie wrote:
 On 10/11/07, Andrew Morton [EMAIL PROTECTED] wrote:
  shit.  That's a nasty bug.  Really userspace should be testing for -1, but
  the msync() library function should only ever return 0 or -1.
 
  Does this fix it?
 
  --- a/mm/page-writeback.c~a
  +++ a/mm/page-writeback.c
  @@ -850,8 +850,10 @@ retry:
 
  ret = (*writepage)(page, wbc, data);
 
  -   if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE))
  +   if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
  unlock_page(page);
  +   ret = 0;
  +   }
  if (ret || (--(wbc-nr_to_write) = 0))
  done = 1;
  if (wbc-nonblocking  bdi_write_congested(bdi)) {
  _
 
 
 Pekka Enberg replied with an identical patch a few days ago, but for
 some reason the same condition flows up to msync as -1 EIO instead of
 AOP_WRITEPAGE_ACTIVATE with that patch applied.  The last part of the
 thread is below.  Thanks.

Each time I sit down to follow what's going on with writepage and
unionfs and msync, I get distracted: I really haven't researched
this properly.

But I keep suspecting that the answer might be the patch below (which
rather follows what drivers/block/rd.c is doing).  I'm especially
worried that, rather than just AOP_WRITEPAGE_ACTIVATE being returned
to userspace, bad enough in itself, you might be liable to hit that
BUG_ON(page_mapped(page)).  shmem_writepage does not expect to be
called by anyone outside mm/vmscan.c, but unionfs can now get to it?

Please let us know if this patch does fix it:
then I'll try harder to work out what goes on.

Thanks,
Hugh

--- 2.6.23/mm/shmem.c   2007-10-09 21:31:38.0 +0100
+++ linux/mm/shmem.c2007-10-12 01:25:46.0 +0100
@@ -916,6 +916,11 @@ static int shmem_writepage(struct page *
struct inode *inode;
 
BUG_ON(!PageLocked(page));
+   if (!wbc-for_reclaim) {
+   set_page_dirty(page);
+   unlock_page(page);
+   return 0;
+   }
BUG_ON(page_mapped(page));
 
mapping = page-mapping;
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-08 Thread Pekka Enberg

Hi Ryan,

On 10/8/07, Ryan Finnie <[EMAIL PROTECTED]> wrote:
> Doesn't appear to be enough.  I can't figure out why (since it appears
> write_cache_pages bubbles up directly to sys_msync), but with that
> patch applied, in my test case[1], msync returns -1 EIO.  However,
> with the exact same kernel without that patch applied, msync returns
> 524288 (AOP_WRITEPAGE_ACTIVATE).  But as your patch specifically flips
> 524288 to 0, I can't figure out how it eventually returns  -1 EIO.
>
> [1] "apt-get check" on a unionfs2 mount backed by tmpfs over cdrom,
> standard livecd setup

You have swap device disabled, right? If so, I can't see any reason
why msync(2) on tmpfs would return -EIO. Can you please send a strace
log for your test case?

   Pekka
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-08 Thread Pekka Enberg

Hi Ryan,

On 10/8/07, Ryan Finnie [EMAIL PROTECTED] wrote:
 Doesn't appear to be enough.  I can't figure out why (since it appears
 write_cache_pages bubbles up directly to sys_msync), but with that
 patch applied, in my test case[1], msync returns -1 EIO.  However,
 with the exact same kernel without that patch applied, msync returns
 524288 (AOP_WRITEPAGE_ACTIVATE).  But as your patch specifically flips
 524288 to 0, I can't figure out how it eventually returns  -1 EIO.

 [1] apt-get check on a unionfs2 mount backed by tmpfs over cdrom,
 standard livecd setup

You have swap device disabled, right? If so, I can't see any reason
why msync(2) on tmpfs would return -EIO. Can you please send a strace
log for your test case?

   Pekka
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-07 Thread Ryan Finnie

On 10/7/07, Pekka J Enberg <[EMAIL PROTECTED]> wrote:
> On 10/7/07, Erez Zadok <[EMAIL PROTECTED]> wrote:
> > Anyway, some Ubuntu users of Unionfs reported that msync(2) sometimes
> > returns AOP_WRITEPAGE_ACTIVATE (decimal 524288) back to userland.
> > Therefore, some user programs fail, esp. if they're written such as
> > this:
>
...
> It's a kernel bug. AOP_WRITEPAGE_ACTIVATE is a hint to the VM to avoid
> writeback of the page in the near future. I wonder if it's enough that we
> change the return value to zero from
> mm/page-writeback.c:write_cache_pages() in case we hit AOP_WRITEPAGE_ACTIVE...

Doesn't appear to be enough.  I can't figure out why (since it appears
write_cache_pages bubbles up directly to sys_msync), but with that
patch applied, in my test case[1], msync returns -1 EIO.  However,
with the exact same kernel without that patch applied, msync returns
524288 (AOP_WRITEPAGE_ACTIVATE).  But as your patch specifically flips
524288 to 0, I can't figure out how it eventually returns  -1 EIO.

Ryan

[1] "apt-get check" on a unionfs2 mount backed by tmpfs over cdrom,
standard livecd setup
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-07 Thread Pekka J Enberg

Hi Erez,

On 10/7/07, Erez Zadok <[EMAIL PROTECTED]> wrote:
> Anyway, some Ubuntu users of Unionfs reported that msync(2) sometimes
> returns AOP_WRITEPAGE_ACTIVATE (decimal 524288) back to userland.
> Therefore, some user programs fail, esp. if they're written such as 
> this:

[snip]

On 10/7/07, Erez Zadok <[EMAIL PROTECTED]> wrote:
> Is this a bug indeed, or are user programs supposed to handleÂ 
> AOP_WRITEPAGE_ACTIVATE (I hope not the latter). If it's a kernel bug, 
> what should the kernel return: a zero, or an -errno (and which one)?

It's a kernel bug. AOP_WRITEPAGE_ACTIVATE is a hint to the VM to avoid 
writeback of the page in the near future. I wonder if it's enough that we 
change the return value to zero from 
mm/page-writeback.c:write_cache_pages() in case we hit AOP_WRITEPAGE_ACTIVE...

Pekka 

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 63512a9..717f341 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -672,8 +672,10 @@ retry:
 
ret = (*writepage)(page, wbc, data);
 
-   if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE))
+   if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
unlock_page(page);
+   ret = 0;
+   }
if (ret || (--(wbc->nr_to_write) <= 0))
done = 1;
if (wbc->nonblocking && bdi_write_congested(bdi)) {

msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-07 Thread Erez Zadok

According to vfs.txt, ->writepage() may return AOP_WRITEPAGE_ACTIVATE back
to the VFS/VM.  Indeed some filesystems such as tmpfs can return
AOP_WRITEPAGE_ACTIVATE; and stackable file systems (e.g., Unionfs) also
return AOP_WRITEPAGE_ACTIVATE if the lower f/s returned it.

Anyway, some Ubuntu users of Unionfs reported that msync(2) sometimes
returns AOP_WRITEPAGE_ACTIVATE (decimal 524288) back to userland.
Therefore, some user programs fail, esp. if they're written such as this:

 err = msync(...);
 if (err != 0)
// fail

They temporarily fixed the specific program in question (apt-get) to check

 if (err < 0)
// fail

Is this a bug indeed, or are user programs supposed to handle
AOP_WRITEPAGE_ACTIVATE (I hope not the latter).  If it's a kernel bug, what
should the kernel return: a zero, or an -errno (and which one)?

Thanks,
Erez.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-07 Thread Erez Zadok

According to vfs.txt, -writepage() may return AOP_WRITEPAGE_ACTIVATE back
to the VFS/VM.  Indeed some filesystems such as tmpfs can return
AOP_WRITEPAGE_ACTIVATE; and stackable file systems (e.g., Unionfs) also
return AOP_WRITEPAGE_ACTIVATE if the lower f/s returned it.

Anyway, some Ubuntu users of Unionfs reported that msync(2) sometimes
returns AOP_WRITEPAGE_ACTIVATE (decimal 524288) back to userland.
Therefore, some user programs fail, esp. if they're written such as this:

 err = msync(...);
 if (err != 0)
// fail

They temporarily fixed the specific program in question (apt-get) to check

 if (err  0)
// fail

Is this a bug indeed, or are user programs supposed to handle
AOP_WRITEPAGE_ACTIVATE (I hope not the latter).  If it's a kernel bug, what
should the kernel return: a zero, or an -errno (and which one)?

Thanks,
Erez.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-07 Thread Pekka J Enberg

Hi Erez,

On 10/7/07, Erez Zadok [EMAIL PROTECTED] wrote:
 Anyway, some Ubuntu users of Unionfs reported that msync(2) sometimes
 returns AOP_WRITEPAGE_ACTIVATE (decimal 524288) back to userland.
 Therefore, some user programs fail, esp. if they're written such as 
 this:

[snip]

On 10/7/07, Erez Zadok [EMAIL PROTECTED] wrote:
 Is this a bug indeed, or are user programs supposed to handleÂ 
 AOP_WRITEPAGE_ACTIVATE (I hope not the latter). If it's a kernel bug, 
 what should the kernel return: a zero, or an -errno (and which one)?

It's a kernel bug. AOP_WRITEPAGE_ACTIVATE is a hint to the VM to avoid 
writeback of the page in the near future. I wonder if it's enough that we 
change the return value to zero from 
mm/page-writeback.c:write_cache_pages() in case we hit AOP_WRITEPAGE_ACTIVE...

Pekka 

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 63512a9..717f341 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -672,8 +672,10 @@ retry:
 
ret = (*writepage)(page, wbc, data);
 
-   if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE))
+   if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
unlock_page(page);
+   ret = 0;
+   }
if (ret || (--(wbc-nr_to_write) = 0))
done = 1;
if (wbc-nonblocking  bdi_write_congested(bdi)) {

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-07 Thread Ryan Finnie

On 10/7/07, Pekka J Enberg [EMAIL PROTECTED] wrote:
 On 10/7/07, Erez Zadok [EMAIL PROTECTED] wrote:
  Anyway, some Ubuntu users of Unionfs reported that msync(2) sometimes
  returns AOP_WRITEPAGE_ACTIVATE (decimal 524288) back to userland.
  Therefore, some user programs fail, esp. if they're written such as
  this:

...
 It's a kernel bug. AOP_WRITEPAGE_ACTIVATE is a hint to the VM to avoid
 writeback of the page in the near future. I wonder if it's enough that we
 change the return value to zero from
 mm/page-writeback.c:write_cache_pages() in case we hit AOP_WRITEPAGE_ACTIVE...

Doesn't appear to be enough.  I can't figure out why (since it appears
write_cache_pages bubbles up directly to sys_msync), but with that
patch applied, in my test case[1], msync returns -1 EIO.  However,
with the exact same kernel without that patch applied, msync returns
524288 (AOP_WRITEPAGE_ACTIVATE).  But as your patch specifically flips
524288 to 0, I can't figure out how it eventually returns  -1 EIO.

Ryan

[1] apt-get check on a unionfs2 mount backed by tmpfs over cdrom,
standard livecd setup
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

90 matches

Mail list logo