Re: [patch 9/9] mm: fix pagecache write deadlocks

2007-02-06 Thread Anton Altaparmakov
On Tue, 2007-02-06 at 03:09 +0100, Nick Piggin wrote:
> On Sun, Feb 04, 2007 at 05:40:35PM +, Anton Altaparmakov wrote:
> > On Sun, 4 Feb 2007, Andrew Morton wrote:
> > > truncate's OK: we're holding i_mutex.
> > 
> > How about excluding readpage() (in addition to truncate if Nick is right  
> > and some cases of truncate do not hold i_mutex) with an extra page flag as
> > I proposed for truncate exclusion?  Then it would not matter that
> > prepare_write might have allocated blocks and might expose stale data.
> > It would go to sleep and wait on the bit to be cleared instead of trying  
> > to bring the page uptodate.  It can then lock the page and either find it 
> > uptodate (because commit_write did it) or not and then bring it uptodate.
> > 
> > Then we could safely fault in the page, copy from it into a temporary 
> > page, then lock the destination page again and copy into it.
> > 
> > This is getting more involved as a patch again...  )-:  But at least it   
> > does not affect the common case except for having to check the new page 
> > flag in every readpage() and truncate() call.  But at least the checks 
> > could be with an "if (unlikely(newpageflag()))" so should not be too bad.
> > 
> > Have I missed anything this time?
> 
> Yes. If you have a flag to exclude readpage(), then you must also
> exclude filemap_nopage, in which case it is still deadlocky.

Ouch, you are of course right.  )-:

Best regards,

Anton
-- 
Anton Altaparmakov  (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 9/9] mm: fix pagecache write deadlocks

2007-02-05 Thread Nick Piggin
On Tue, Feb 06, 2007 at 06:49:05AM +0100, Nick Piggin wrote:
> > - If the get_user() doesn't fault, and if we're copying from and to the
> >   same page, we know that we've locked it, so nobody will be able to unmap
> >   it while we're copying from it.
> > 
> > Close, but no cigar!  This is still vulnerable to Hugh's ab/ba deadlock
> > scenario.
> 
> Yes I considered this too. Hard isn't it?

BTW, there are two different abba deadlocks. It's all documented in the
patch 9/9 description.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 9/9] mm: fix pagecache write deadlocks

2007-02-05 Thread Nick Piggin
On Mon, Feb 05, 2007 at 09:30:06PM -0800, Andrew Morton wrote:
> On Tue, 6 Feb 2007 05:41:46 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote:
> > > 
> > > Not necessarily -- they could read from one part of a page and write to
> > > another. I see this as the biggest data corruption problem.
> 
> The kernel gets that sort of thing wrong anyway, and always has, because
> it uses memcpy()-style copying and not memmove()-style.
> 
> I can't imagine what sort of application you're envisaging here.  The
> problem was only ever observed from userspace by an artificial stress-test
> thing.

No, I'm not talking about writing into a page with memory from the same
page. I'm talking about one process writing to part of a file, and another
reading from that same file (different offset).

If they happen to be covered by the same page, then the reader can see
zeroes.

I'm not envisaging any sort of application, all I know is that there are
several (related) data corruption bugs and I'm trying to fix them (and
fix these deadlock problems without introducing more).

> > And in fact, it is not just transient errors either. This problem can
> > add permanent corruption into the pagecache and onto disk, and it doesn't
> > even require two processes to race.
> > 
> > After zeroing out the uncopied part of the page, and attempting to loop
> > again, we might bail out of the loop for any reason before completing the
> > rest of the copy, leaving the pagecache corrupted, which will soon go out
> > to disk.
> > 
> 
> Only because ->commit_write() went and incorrectly marked parts of the page
> as up-to-date.
> 
> Zeroing out the fag end of the copy_from_user() on fault is actually 
> incorrect. 

Yes, I know.

> What we _should_ do is to bring those uncopyable, non-uptodate parts of the
> page uptodate rather than zeroing them.  ->readpage() does that.
> 
> So...  what happens if we do
> 
>   lock_page()
>   prepare_write()
>   if (copy_from_user_atomic()) {
>   readpage()
>   wait_on_page()
>   lock_page()
>   }
>   commit_write()
>   unlock_page()
> 
> - If the page has no buffers then it is either fully uptodate or fully
>   not uptodate.  In the former case, don't call readpage at all.  In the
>   latter case, readpage() is the correct thing to do.
> 
> - If the page had buffers, then readpage() won't touch the uptodate ones
>   and will bring the non-uptodate ones up to date from disk.
> 
>   Some of the data which we copied from userspace may get overwritten
>   from backing store, but that's OK.
> 
> seems crazy, but it's close.  We do have the minor problem that readpage
> went and unlocked the page so we need to relock it.  I bet there are holes
> in there.

Yes, I tried doing this as well and there are holes in it. Even supposing
that we add a readpage_dontunlock, there is still the issue of breaking
the filesystem API from nesting readpage inside prepare_write. You also
do need to zero newly allocated blocks, for example.

> Idea #42: after we've locked the pagecache page, do an atomic get_user()
> against the source page(s) before attempting the copy_from_user().  If that
> faults, don't run prepare_write or anything else: drop the page lock and
> try again.
> 
> Because
> 
> - If the get_user() faults, it might be because the page we're copying
>   from and to is the same page, and someone went and unmapped it: deadlock.
> 
> - If the get_user() doesn't fault, and if we're copying from and to the
>   same page, we know that we've locked it, so nobody will be able to unmap
>   it while we're copying from it.
> 
> Close, but no cigar!  This is still vulnerable to Hugh's ab/ba deadlock
> scenario.

Yes I considered this too. Hard isn't it?

> btw, to fix the writev() performance problem we may need to go off and run
> get_user() against up to 1024 separate user pages before locking the
> pagecache page, which sounds fairly idiotic.  Are you doing that in the
> implemetnations which you've been working on?  I forget...

No, because in my fix it can do non-atomic usercopies for !uptodate pages.

For uptodate pages, yes there is a possibility that it may do a short copy
and have to retry, but it is probably safe to bet that the source data is
fairly recently accessed in most cases, so a short copy will be unlikely.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 9/9] mm: fix pagecache write deadlocks

2007-02-05 Thread Andrew Morton
On Tue, 6 Feb 2007 05:41:46 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote:

> On Tue, Feb 06, 2007 at 03:25:49AM +0100, Nick Piggin wrote:
> > On Sun, Feb 04, 2007 at 10:36:20AM -0800, Andrew Morton wrote:
> > > On Sun, 4 Feb 2007 16:10:51 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote:
> > > 
> > > > They're not likely to hit the deadlocks, either. Probability gets more
> > > > likely after my patch to lock the page in the fault path. But 
> > > > practially,
> > > > we could live without that too, because the data corruption it fixes is
> > > > very rare as well. Which is exactly what we've been doing quite happily
> > > > for most of 2.6, including all distro kernels (I think).
> > > 
> > > Thing is, an application which is relying on the contents of that page is
> > > already unreliable (or really peculiar), because it can get indeterminate
> > > results anyway.
> > 
> > Not necessarily -- they could read from one part of a page and write to
> > another. I see this as the biggest data corruption problem.

The kernel gets that sort of thing wrong anyway, and always has, because
it uses memcpy()-style copying and not memmove()-style.

I can't imagine what sort of application you're envisaging here.  The
problem was only ever observed from userspace by an artificial stress-test
thing.

> And in fact, it is not just transient errors either. This problem can
> add permanent corruption into the pagecache and onto disk, and it doesn't
> even require two processes to race.
> 
> After zeroing out the uncopied part of the page, and attempting to loop
> again, we might bail out of the loop for any reason before completing the
> rest of the copy, leaving the pagecache corrupted, which will soon go out
> to disk.
> 

Only because ->commit_write() went and incorrectly marked parts of the page
as up-to-date.

Zeroing out the fag end of the copy_from_user() on fault is actually incorrect. 
What we _should_ do is to bring those uncopyable, non-uptodate parts of the
page uptodate rather than zeroing them.  ->readpage() does that.

So...  what happens if we do

lock_page()
prepare_write()
if (copy_from_user_atomic()) {
readpage()
wait_on_page()
lock_page()
}
commit_write()
unlock_page()

- If the page has no buffers then it is either fully uptodate or fully
  not uptodate.  In the former case, don't call readpage at all.  In the
  latter case, readpage() is the correct thing to do.

- If the page had buffers, then readpage() won't touch the uptodate ones
  and will bring the non-uptodate ones up to date from disk.

  Some of the data which we copied from userspace may get overwritten
  from backing store, but that's OK.

seems crazy, but it's close.  We do have the minor problem that readpage
went and unlocked the page so we need to relock it.  I bet there are holes
in there.




Idea #42: after we've locked the pagecache page, do an atomic get_user()
against the source page(s) before attempting the copy_from_user().  If that
faults, don't run prepare_write or anything else: drop the page lock and
try again.

Because

- If the get_user() faults, it might be because the page we're copying
  from and to is the same page, and someone went and unmapped it: deadlock.

- If the get_user() doesn't fault, and if we're copying from and to the
  same page, we know that we've locked it, so nobody will be able to unmap
  it while we're copying from it.

Close, but no cigar!  This is still vulnerable to Hugh's ab/ba deadlock
scenario.


btw, to fix the writev() performance problem we may need to go off and run
get_user() against up to 1024 separate user pages before locking the
pagecache page, which sounds fairly idiotic.  Are you doing that in the
implemetnations which you've been working on?  I forget...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 9/9] mm: fix pagecache write deadlocks

2007-02-05 Thread Nick Piggin
On Tue, Feb 06, 2007 at 03:25:49AM +0100, Nick Piggin wrote:
> On Sun, Feb 04, 2007 at 10:36:20AM -0800, Andrew Morton wrote:
> > On Sun, 4 Feb 2007 16:10:51 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote:
> > 
> > > They're not likely to hit the deadlocks, either. Probability gets more
> > > likely after my patch to lock the page in the fault path. But practially,
> > > we could live without that too, because the data corruption it fixes is
> > > very rare as well. Which is exactly what we've been doing quite happily
> > > for most of 2.6, including all distro kernels (I think).
> > 
> > Thing is, an application which is relying on the contents of that page is
> > already unreliable (or really peculiar), because it can get indeterminate
> > results anyway.
> 
> Not necessarily -- they could read from one part of a page and write to
> another. I see this as the biggest data corruption problem.

And in fact, it is not just transient errors either. This problem can
add permanent corruption into the pagecache and onto disk, and it doesn't
even require two processes to race.

After zeroing out the uncopied part of the page, and attempting to loop
again, we might bail out of the loop for any reason before completing the
rest of the copy, leaving the pagecache corrupted, which will soon go out
to disk.

Nick
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 9/9] mm: fix pagecache write deadlocks

2007-02-05 Thread Nick Piggin
On Sun, Feb 04, 2007 at 10:36:20AM -0800, Andrew Morton wrote:
> On Sun, 4 Feb 2007 16:10:51 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote:
> 
> > They're not likely to hit the deadlocks, either. Probability gets more
> > likely after my patch to lock the page in the fault path. But practially,
> > we could live without that too, because the data corruption it fixes is
> > very rare as well. Which is exactly what we've been doing quite happily
> > for most of 2.6, including all distro kernels (I think).
> 
> Thing is, an application which is relying on the contents of that page is
> already unreliable (or really peculiar), because it can get indeterminate
> results anyway.

Not necessarily -- they could read from one part of a page and write to
another. I see this as the biggest data corruption problem.

But even in the case where they can get indeterminate results, they can
still determine what the results *won't* be. Eg. they might use a single
byte for a flag or something.

> > ...
> > 
> > On a P4 Xeon, SMP kernel, on a tmpfs filesystem, a 1GB dd if=/dev/zero write
> > had the following performance (higher is worse):
> > 
> > Orig kernel New kernel
> > new file (no pagecache)
> > 4K  blocks 1.280s   1.287s (+0.5%)
> > 64K blocks 1.090s   1.105s (+1.4%)
> > notrunc (uptodate pagecache)
> > 4K  blocks 0.976s   1.001s (+0.5%)
> > 64K blocks 0.780s   0.792s (+1.5%)
> > 
> > [numbers are better than +/- 0.005]
> > 
> > So we lose somewhere between half and one and a half of one percent
> > performance in a pagecache write intensive workload.
> 
> That's not too bad - caches are fast.  Did you look at optimising the
> handling of that temp page, ensure that we always use the same page?  I
> guess the page allocator per-cpu-pages thing is being good here.

Yeah it should be doing a reasonable job.

> I'm not sure how, though.  Park a copy in the task_struct, just as an
> experiment.  But that'd de-optimise multiple-tasks-writing-on-the-same-cpu.
> Maybe a per-cpu thing?  Largely duplicates the page allocator's per-cpu-pages.

Putting a copy in the task_struct won't do much I figure, except saving
a copule of interrupt enable/disable, and being more wasteful of memory
and cache-hotness.

Per-cpu doesn't work because we can't hold preempt off over the usercopy
(well, we *could* do it in a loop together with fault_in_pages, but that
just adds to the icache bloat).

> 
> Of course, we're also increasing caceh footprint, which this test won't
> show.

We are indeed. At least we release the hot page back to the allocator
very quickly that it can be reused.

The upshot is that your writev performance will be improved :)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 9/9] mm: fix pagecache write deadlocks

2007-02-05 Thread Nick Piggin
On Sun, Feb 04, 2007 at 05:40:35PM +, Anton Altaparmakov wrote:
> On Sun, 4 Feb 2007, Andrew Morton wrote:
> > truncate's OK: we're holding i_mutex.
> 
> How about excluding readpage() (in addition to truncate if Nick is right  
> and some cases of truncate do not hold i_mutex) with an extra page flag as
> I proposed for truncate exclusion?  Then it would not matter that
> prepare_write might have allocated blocks and might expose stale data.
> It would go to sleep and wait on the bit to be cleared instead of trying  
> to bring the page uptodate.  It can then lock the page and either find it 
> uptodate (because commit_write did it) or not and then bring it uptodate.
> 
> Then we could safely fault in the page, copy from it into a temporary 
> page, then lock the destination page again and copy into it.
> 
> This is getting more involved as a patch again...  )-:  But at least it   
> does not affect the common case except for having to check the new page 
> flag in every readpage() and truncate() call.  But at least the checks 
> could be with an "if (unlikely(newpageflag()))" so should not be too bad.
> 
> Have I missed anything this time?

Yes. If you have a flag to exclude readpage(), then you must also
exclude filemap_nopage, in which case it is still deadlocky.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 9/9] mm: fix pagecache write deadlocks

2007-02-04 Thread Andrew Morton
On Sun, 4 Feb 2007 16:10:51 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote:

> On Sun, Feb 04, 2007 at 03:15:49AM -0800, Andrew Morton wrote:
> > On Sun, 4 Feb 2007 12:03:17 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote:
> > 
> > > On Sun, Feb 04, 2007 at 02:56:02AM -0800, Andrew Morton wrote:
> > > > On Sun, 4 Feb 2007 11:46:09 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote:
> > > > 
> > > > If that recollection is right, I think we could afford to reintroduce 
> > > > that
> > > > problem, frankly.  Especially as it only happens in the incredibly rare
> > > > case of that get_user()ed page getting unmapped under our feet.
> > > 
> > > Dang. I was hoping to fix it without introducing data corruption.
> > 
> > Well.  It's a compromise.  Being practical about it, I raly doubt that
> > anyone will hit this combination of circumstances.
> 
> They're not likely to hit the deadlocks, either. Probability gets more
> likely after my patch to lock the page in the fault path. But practially,
> we could live without that too, because the data corruption it fixes is
> very rare as well. Which is exactly what we've been doing quite happily
> for most of 2.6, including all distro kernels (I think).

Thing is, an application which is relying on the contents of that page is
already unreliable (or really peculiar), because it can get indeterminate
results anyway.

> ...
> 
> On a P4 Xeon, SMP kernel, on a tmpfs filesystem, a 1GB dd if=/dev/zero write
> had the following performance (higher is worse):
> 
> Orig kernel   New kernel
> new file (no pagecache)
> 4K  blocks 1.280s 1.287s (+0.5%)
> 64K blocks 1.090s 1.105s (+1.4%)
> notrunc (uptodate pagecache)
> 4K  blocks 0.976s 1.001s (+0.5%)
> 64K blocks 0.780s 0.792s (+1.5%)
> 
> [numbers are better than +/- 0.005]
> 
> So we lose somewhere between half and one and a half of one percent
> performance in a pagecache write intensive workload.

That's not too bad - caches are fast.  Did you look at optimising the
handling of that temp page, ensure that we always use the same page?  I
guess the page allocator per-cpu-pages thing is being good here.

I'm not sure how, though.  Park a copy in the task_struct, just as an
experiment.  But that'd de-optimise multiple-tasks-writing-on-the-same-cpu.
Maybe a per-cpu thing?  Largely duplicates the page allocator's per-cpu-pages.

Of course, we're also increasing caceh footprint, which this test won't
show.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 9/9] mm: fix pagecache write deadlocks

2007-02-04 Thread Anton Altaparmakov
On Sun, 4 Feb 2007, Andrew Morton wrote:
> On Sun, 4 Feb 2007 10:59:58 + (GMT) Anton Altaparmakov <[EMAIL 
> PROTECTED]> wrote:
> > On Sun, 4 Feb 2007, Andrew Morton wrote:
> > > On Sun,  4 Feb 2007 09:51:07 +0100 (CET) Nick Piggin <[EMAIL PROTECTED]> 
> > > wrote:
> > > > 2.  If we find the destination page is non uptodate, unlock it (this 
> > > > could be
> > > > made slightly more optimal), then find and pin the source page with
> > > > get_user_pages. Relock the destination page and continue with the 
> > > > copy.
> > > > However, instead of a usercopy (which might take a fault), copy the 
> > > > data
> > > > via the kernel address space.
> > > 
> > > argh.  We just can't go adding all this gunk into the write() path. 
> > > 
> > > mmap_sem, a full pte-walk, taking of pte-page locks, etc.  For every 
> > > page. 
> > > Even single-process write() will suffer, let along multithreaded stuff,
> > > where mmap_sem contention may be the bigger problem.
> > > 
> > > I was going to do some quick measurements of this, but the code oopses
> > > on power4 (http://userweb.kernel.org/~akpm/s5000402.jpg)
> > > 
> > > We need to think different.
> > 
> > How about leaving the existing code with the following minor 
> > modifications:
> > 
> > Instead of calling filemap_copy_from_user{,_iovec}() do only the atomic 
> > bit with pagefaults disabled, i.e. instead of filemap_copy_from_user() we 
> > would do (could of course move into a helper function of course):
> > 
> > pagefault_disable()
> > kaddr = kmap_atomic(page, KM_USER0);
> > left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
> > kunmap_atomic(kaddr, KM_USER0);
> > pagefault_enable()
> > 
> > if (unlikely(left)) {
> > /* The user space page got unmapped before we got to copy it. */
> > ...
> > }
> > 
> > Thus the 99.999% (or more!) of the time the code would just work as it 
> > always has and there is no bug and no speed impact.  Only in the very rare 
> > and hard to trigger race condition that the user space page after being 
> > faulted in got thrown out again before we did the atomic memory copy do we 
> > run into the above "..." code path.
> 
> Right.  And what I wanted to do here is to zero out the uncopied part of
> the page (if it wasn't uptodate), then run commit_write(), then retry the
> whole thing.
> 
> iirc, we ruled that out because those temporary zeroes are exposed to
> userspace.  But the kernel used to do that anyway for a long time (years)
> until someone noticed, and we'll only do it in your 0.0001% case anyway.
> 
> (Actually, perhaps we can prevent it by not marking the page uptodate in
> this case.  But that'll cause a read()er to try to bring it uptodate...)

My thinking was not marking the page uptodate.  But yes that causes the 
problem of a concurrent readpage reading uninitialized disk blocks that 
prepare_write allocated.

> > I would propose to call out a different function altogether which could do 
> > a multitude of things including drop the lock on the destination page 
> > (maintaining a reference on the page!), allocate a temporary page, copy 
> > from the user space page into the temporary page, then lock the 
> > destination page again, and copy from the temporary page into the 
> > destination page.
> 
> The problem with all these things is that as soon as we unlock the page,
> it's visible to read().  And in fact, as soon as we mark it uptodate it's
> visible to mmap, even if it's still locked.

Yes we definitely cannot mark it uptodate before it really is uptodate.
Either we have to not mark it uptodate or we have to zero it or we have to
think of some other magic...

> > This would be slow but who cares given it would only happen incredibly 
> > rarely and on majority of machines it would never happen at all.
> > 
> > The only potential problem I can see is that the destination page could be 
> > truncated whilst it is unlocked.  I can see two possible solutions to 
> > this:
> 
> truncate's OK: we're holding i_mutex.

How about excluding readpage() (in addition to truncate if Nick is right  
and some cases of truncate do not hold i_mutex) with an extra page flag as
I proposed for truncate exclusion?  Then it would not matter that
prepare_write might have allocated blocks and might expose stale data.
It would go to sleep and wait on the bit to be cleared instead of trying  
to bring the page uptodate.  It can then lock the page and either find it 
uptodate (because commit_write did it) or not and then bring it uptodate.

Then we could safely fault in the page, copy from it into a temporary 
page, then lock the destination page again and copy into it.

This is getting more involved as a patch again...  )-:  But at least it   
does not affect the common case except for having to check the new page 
flag in every readpage() and truncate() call.  But at least the checks 
could be with an "if (unlikely(newpageflag()))" so should not be too bad.

Have I missed anything

Re: [patch 9/9] mm: fix pagecache write deadlocks

2007-02-04 Thread Nick Piggin
On Sun, Feb 04, 2007 at 03:15:49AM -0800, Andrew Morton wrote:
> On Sun, 4 Feb 2007 12:03:17 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote:
> 
> > On Sun, Feb 04, 2007 at 02:56:02AM -0800, Andrew Morton wrote:
> > > On Sun, 4 Feb 2007 11:46:09 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote:
> > > 
> > > If that recollection is right, I think we could afford to reintroduce that
> > > problem, frankly.  Especially as it only happens in the incredibly rare
> > > case of that get_user()ed page getting unmapped under our feet.
> > 
> > Dang. I was hoping to fix it without introducing data corruption.
> 
> Well.  It's a compromise.  Being practical about it, I raly doubt that
> anyone will hit this combination of circumstances.

They're not likely to hit the deadlocks, either. Probability gets more
likely after my patch to lock the page in the fault path. But practially,
we could live without that too, because the data corruption it fixes is
very rare as well. Which is exactly what we've been doing quite happily
for most of 2.6, including all distro kernels (I think).

But (sadly) for me, there is no compromise. I may be known as the clown
who tries outlandish things to shave a few atomic ops and interrupt flag
changes in the page allocator, or make the pagecache lockless. However
I can't be happy even making something faster if correctness is < 100.0%,
even if less likely than hardware failure.

> > > > > > but you introduce the theoretical memory deadlock
> > > > > > where a task cannot reclaim its own memory.
> > > > > 
> > > > > Nah, that'll never happen - both pages are already allocated.
> > > > 
> > > > Both pages? I don't get it.
> > > > 
> > > > You set the don't-reclaim vma flag, then run get_user, which takes a
> > > > page fault and potentially has to allocate N pages for pagetables,
> > > > pagecache readahead, buffers and fs private data and pagecache radix
> > > > tree nodes for all of the pages read in.
> > > 
> > > Oh, OK.  Need to do the get_user() twice then.  Once before taking that 
> > > new
> > > rwsem.
> > 
> > Race condition remains.
> 
> No, not in a million years.

There is a huge window. Think about what other activity will be holding
that very rwsem for write, that you'll have to wait for in the race window.
But you could say that's also a question of practicality, because it is
pretty unlikely to do anything bad even if you do hit the race.

Well if fixing this is just going to be flat-out vetoed on performance
reasons then I can't argue, because it does impact performance.

Thinking about the numbers, if your kernel's reliability is already the
same order of magnitude as reliability of commodity hardware, then
trading a bit of performance for a bit of reliability is a BAD tradeoff
if you are at all interested in performance on commodity hardware. That
is especially true if you have a massive redundant cluster of commodity
systems, which I understand is a fairly big market for Linux. The 
X-nines guys who would disagree are probably a tiny niche for Linux.

So I do understand your argument for praticality, even if I don't agree.

For the record, here is the "temporary page" fix that should fix it
properly. And some numbers.

Nick

--

Modify the core write() code so that it won't take a pagefault while holding a
lock on the pagecache page. There are a number of different deadlocks possible
if we try to do such a thing:

1.  generic_buffered_write
2.   lock_page
3.prepare_write
4. unlock_page+vmtruncate
5. copy_from_user
6.  mmap_sem(r)
7.   handle_mm_fault
8.lock_page (filemap_nopage)
9.commit_write
10.  unlock_page

a. sys_munmap / sys_mlock / others
b.  mmap_sem(w)
c.   make_pages_present
d.get_user_pages
e. handle_mm_fault
f.  lock_page (filemap_nopage)

2,8 - recursive deadlock if page is same
2,8;2,8 - ABBA deadlock is page is different
2,6;b,f - ABBA deadlock if page is same

The solution is as follows:
1.  If we find the destination page is uptodate, continue as normal, but use
atomic usercopies which do not take pagefaults and do not zero the uncopied
tail of the destination. The destination is already uptodate, so we can
commit_write the full length even if there was a partial copy: it does not
matter that the tail was not modified, because if it is dirtied and written
back to disk it will not cause any problems (uptodate *means* that the
destination page is as new or newer than the copy on disk).

1a. The above requires that fault_in_pages_readable correctly returns access
information, because atomic usercopies cannot distinguish between
non-present pages in a readable mapping, from lack of a readable mapping.

2.  If we find the destination page is non uptodate, unlock it (this could be
made slightly more optimal), then allocate a temporary page to copy the
source data into. Relock the destination page and continue with the copy.
However, instead of a usercopy (which might take a fa

Re: [patch 9/9] mm: fix pagecache write deadlocks

2007-02-04 Thread Nick Piggin
On Sun, Feb 04, 2007 at 03:10:39AM -0800, Andrew Morton wrote:
> On Sun, 4 Feb 2007 10:59:58 + (GMT) Anton Altaparmakov <[EMAIL 
> PROTECTED]> wrote:
> > 
> > How about leaving the existing code with the following minor 
> > modifications:
> > 
> > Instead of calling filemap_copy_from_user{,_iovec}() do only the atomic 
> > bit with pagefaults disabled, i.e. instead of filemap_copy_from_user() we 
> > would do (could of course move into a helper function of course):
> > 
> > pagefault_disable()
> > kaddr = kmap_atomic(page, KM_USER0);
> > left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
> > kunmap_atomic(kaddr, KM_USER0);
> > pagefault_enable()
> > 
> > if (unlikely(left)) {
> > /* The user space page got unmapped before we got to copy it. */
> > ...
> > }
> > 
> > Thus the 99.999% (or more!) of the time the code would just work as it 
> > always has and there is no bug and no speed impact.  Only in the very rare 
> > and hard to trigger race condition that the user space page after being 
> > faulted in got thrown out again before we did the atomic memory copy do we 
> > run into the above "..." code path.
> 
> Right.  And what I wanted to do here is to zero out the uncopied part of
> the page (if it wasn't uptodate), then run commit_write(), then retry the
> whole thing.
> 
> iirc, we ruled that out because those temporary zeroes are exposed to
> userspace.  But the kernel used to do that anyway for a long time (years)
> until someone noticed, and we'll only do it in your 0.0001% case anyway.

Serious? I'd rather leave the deadlock in there than introduce a
very hard to reproduce data corruption bug to fix it. At least the
deadlock is fail-stop and you can tell exactly what happened when
you hit it (assuming you can get a trace).

Then again, we've got lots more similar little correctness corner
cases like this that most people don't notice most of the time. Am
I aiming too high?

> (Actually, perhaps we can prevent it by not marking the page uptodate in
> this case.  But that'll cause a read()er to try to bring it uptodate...)

We have to write something back to the filesystem because it may have
allocated blocks at this point.

> > I would propose to call out a different function altogether which could do 
> > a multitude of things including drop the lock on the destination page 
> > (maintaining a reference on the page!), allocate a temporary page, copy 
> > from the user space page into the temporary page, then lock the 
> > destination page again, and copy from the temporary page into the 
> > destination page.
> 
> The problem with all these things is that as soon as we unlock the page,
> it's visible to read().  And in fact, as soon as we mark it uptodate it's
> visible to mmap, even if it's still locked.
> 
> > This would be slow but who cares given it would only happen incredibly 
> > rarely and on majority of machines it would never happen at all.
> > 
> > The only potential problem I can see is that the destination page could be 
> > truncated whilst it is unlocked.  I can see two possible solutions to 
> > this:
> 
> truncate's OK: we're holding i_mutex.

Not all truncates hold i_mutex. Neither do all invalidates, for that
matter.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 9/9] mm: fix pagecache write deadlocks

2007-02-04 Thread Andrew Morton
On Sun, 4 Feb 2007 12:03:17 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote:

> On Sun, Feb 04, 2007 at 02:56:02AM -0800, Andrew Morton wrote:
> > On Sun, 4 Feb 2007 11:46:09 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote:
> > 
> > > On Sun, Feb 04, 2007 at 02:30:55AM -0800, Andrew Morton wrote:
> > > > On Sun, 4 Feb 2007 11:15:29 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote:
> > > > 
> > > > > The write path is broken. I prefer my kernels slow, than buggy.
> > > > 
> > > > That won't fly.
> > > 
> > > What won't fly?
> > 
> > I suspect the performance cost of this approach would force us to redo it
> > all.
> 
> That's the idea. But at least in the meantime we're correct.

There's no way I'd support merging a change which we know we'll have to
redo only we have no clue how.

> > If that recollection is right, I think we could afford to reintroduce that
> > problem, frankly.  Especially as it only happens in the incredibly rare
> > case of that get_user()ed page getting unmapped under our feet.
> 
> Dang. I was hoping to fix it without introducing data corruption.

Well.  It's a compromise.  Being practical about it, I raly doubt that
anyone will hit this combination of circumstances.

> > > > > but you introduce the theoretical memory deadlock
> > > > > where a task cannot reclaim its own memory.
> > > > 
> > > > Nah, that'll never happen - both pages are already allocated.
> > > 
> > > Both pages? I don't get it.
> > > 
> > > You set the don't-reclaim vma flag, then run get_user, which takes a
> > > page fault and potentially has to allocate N pages for pagetables,
> > > pagecache readahead, buffers and fs private data and pagecache radix
> > > tree nodes for all of the pages read in.
> > 
> > Oh, OK.  Need to do the get_user() twice then.  Once before taking that new
> > rwsem.
> 
> Race condition remains.

No, not in a million years.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 9/9] mm: fix pagecache write deadlocks

2007-02-04 Thread Andrew Morton
On Sun, 4 Feb 2007 10:59:58 + (GMT) Anton Altaparmakov <[EMAIL PROTECTED]> 
wrote:

> On Sun, 4 Feb 2007, Andrew Morton wrote:
> > On Sun,  4 Feb 2007 09:51:07 +0100 (CET) Nick Piggin <[EMAIL PROTECTED]> 
> > wrote:
> > > 2.  If we find the destination page is non uptodate, unlock it (this 
> > > could be
> > > made slightly more optimal), then find and pin the source page with
> > > get_user_pages. Relock the destination page and continue with the 
> > > copy.
> > > However, instead of a usercopy (which might take a fault), copy the 
> > > data
> > > via the kernel address space.
> > 
> > argh.  We just can't go adding all this gunk into the write() path. 
> > 
> > mmap_sem, a full pte-walk, taking of pte-page locks, etc.  For every page. 
> > Even single-process write() will suffer, let along multithreaded stuff,
> > where mmap_sem contention may be the bigger problem.
> > 
> > I was going to do some quick measurements of this, but the code oopses
> > on power4 (http://userweb.kernel.org/~akpm/s5000402.jpg)
> > 
> > We need to think different.
> 
> How about leaving the existing code with the following minor 
> modifications:
> 
> Instead of calling filemap_copy_from_user{,_iovec}() do only the atomic 
> bit with pagefaults disabled, i.e. instead of filemap_copy_from_user() we 
> would do (could of course move into a helper function of course):
> 
> pagefault_disable()
> kaddr = kmap_atomic(page, KM_USER0);
> left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
> kunmap_atomic(kaddr, KM_USER0);
> pagefault_enable()
> 
> if (unlikely(left)) {
>   /* The user space page got unmapped before we got to copy it. */
>   ...
> }
> 
> Thus the 99.999% (or more!) of the time the code would just work as it 
> always has and there is no bug and no speed impact.  Only in the very rare 
> and hard to trigger race condition that the user space page after being 
> faulted in got thrown out again before we did the atomic memory copy do we 
> run into the above "..." code path.

Right.  And what I wanted to do here is to zero out the uncopied part of
the page (if it wasn't uptodate), then run commit_write(), then retry the
whole thing.

iirc, we ruled that out because those temporary zeroes are exposed to
userspace.  But the kernel used to do that anyway for a long time (years)
until someone noticed, and we'll only do it in your 0.0001% case anyway.

(Actually, perhaps we can prevent it by not marking the page uptodate in
this case.  But that'll cause a read()er to try to bring it uptodate...)

> I would propose to call out a different function altogether which could do 
> a multitude of things including drop the lock on the destination page 
> (maintaining a reference on the page!), allocate a temporary page, copy 
> from the user space page into the temporary page, then lock the 
> destination page again, and copy from the temporary page into the 
> destination page.

The problem with all these things is that as soon as we unlock the page,
it's visible to read().  And in fact, as soon as we mark it uptodate it's
visible to mmap, even if it's still locked.

> This would be slow but who cares given it would only happen incredibly 
> rarely and on majority of machines it would never happen at all.
> 
> The only potential problem I can see is that the destination page could be 
> truncated whilst it is unlocked.  I can see two possible solutions to 
> this:

truncate's OK: we're holding i_mutex.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 9/9] mm: fix pagecache write deadlocks

2007-02-04 Thread Nick Piggin
On Sun, Feb 04, 2007 at 02:56:02AM -0800, Andrew Morton wrote:
> On Sun, 4 Feb 2007 11:46:09 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote:
> 
> > On Sun, Feb 04, 2007 at 02:30:55AM -0800, Andrew Morton wrote:
> > > On Sun, 4 Feb 2007 11:15:29 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote:
> > > 
> > > > The write path is broken. I prefer my kernels slow, than buggy.
> > > 
> > > That won't fly.
> > 
> > What won't fly?
> 
> I suspect the performance cost of this approach would force us to redo it
> all.

That's the idea. But at least in the meantime we're correct.

> > > > That was my second idea.
> > > 
> > > Coulda sworn it was mine ;) I thought you ended up deciding it wasn't
> > > practical because of the games we needed to play with ->commit_write.
> > 
> > Maybe I misunderstood what you meant, above.
> 
> The original set of half-written patches I sent you.  Do an atomic 
> copy_from_user()
> inside the page lock and if that fails, zero out the remainder of the page, 
> run
> commit_write() and then redo the whole thing.

Oh that. Data corruption, transient zeroes.

> > I have an alterative fix
> > where a temporary page is allocated if the write enncounters a non
> > uptodate page. The usercopy then goes into that page, and from there
> > into the target page after we have opened the prepare_write().
> 
> Remember that a non-uptodate page is the common case.

Yes.

> 
> > My *first* idea to fix this was to do the atomic copy into a non-uptodate
> > page and then calling a zero-length commit_write if it failed. I pretty
> > carefully constructed all these good arguments as to why each case works
> > properly, but in the end it just didn't fly because it broke lots of
> > filesystems.
> 
> I forget the details now.  I think we did have a workable-looking solution
> based on the atomic copy_from_user() but it would have re-exposed the old
> problem wherein a page would fleetingly have a bunch of zeroes in the
> middle of it, if someone looked at it during the write.
> 
> If that recollection is right, I think we could afford to reintroduce that
> problem, frankly.  Especially as it only happens in the incredibly rare
> case of that get_user()ed page getting unmapped under our feet.

Dang. I was hoping to fix it without introducing data corruption.

> > > > but you introduce the theoretical memory deadlock
> > > > where a task cannot reclaim its own memory.
> > > 
> > > Nah, that'll never happen - both pages are already allocated.
> > 
> > Both pages? I don't get it.
> > 
> > You set the don't-reclaim vma flag, then run get_user, which takes a
> > page fault and potentially has to allocate N pages for pagetables,
> > pagecache readahead, buffers and fs private data and pagecache radix
> > tree nodes for all of the pages read in.
> 
> Oh, OK.  Need to do the get_user() twice then.  Once before taking that new
> rwsem.

Race condition remains.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 9/9] mm: fix pagecache write deadlocks

2007-02-04 Thread Anton Altaparmakov
On Sun, 4 Feb 2007, Andrew Morton wrote:
> On Sun,  4 Feb 2007 09:51:07 +0100 (CET) Nick Piggin <[EMAIL PROTECTED]> 
> wrote:
> > 2.  If we find the destination page is non uptodate, unlock it (this could 
> > be
> > made slightly more optimal), then find and pin the source page with
> > get_user_pages. Relock the destination page and continue with the copy.
> > However, instead of a usercopy (which might take a fault), copy the data
> > via the kernel address space.
> 
> argh.  We just can't go adding all this gunk into the write() path. 
> 
> mmap_sem, a full pte-walk, taking of pte-page locks, etc.  For every page. 
> Even single-process write() will suffer, let along multithreaded stuff,
> where mmap_sem contention may be the bigger problem.
> 
> I was going to do some quick measurements of this, but the code oopses
> on power4 (http://userweb.kernel.org/~akpm/s5000402.jpg)
> 
> We need to think different.

How about leaving the existing code with the following minor 
modifications:

Instead of calling filemap_copy_from_user{,_iovec}() do only the atomic 
bit with pagefaults disabled, i.e. instead of filemap_copy_from_user() we 
would do (could of course move into a helper function of course):

pagefault_disable()
kaddr = kmap_atomic(page, KM_USER0);
left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
kunmap_atomic(kaddr, KM_USER0);
pagefault_enable()

if (unlikely(left)) {
/* The user space page got unmapped before we got to copy it. */
...
}

Thus the 99.999% (or more!) of the time the code would just work as it 
always has and there is no bug and no speed impact.  Only in the very rare 
and hard to trigger race condition that the user space page after being 
faulted in got thrown out again before we did the atomic memory copy do we 
run into the above "..." code path.

I would propose to call out a different function altogether which could do 
a multitude of things including drop the lock on the destination page 
(maintaining a reference on the page!), allocate a temporary page, copy 
from the user space page into the temporary page, then lock the 
destination page again, and copy from the temporary page into the 
destination page.

This would be slow but who cares given it would only happen incredibly 
rarely and on majority of machines it would never happen at all.

The only potential problem I can see is that the destination page could be 
truncated whilst it is unlocked.  I can see two possible solutions to 
this:

1) Invent a new page flag and run "SetDoNotTruncatePage()" before dropping 
the page lock, then modify truncate code paths to check for page locked 
and the new flag and not truncate a page if the new flag is set, just like 
they would if it was locked.  Basically treat the two bits as equivalent 
for truncate purposes.

I think 1) would be best but a possible alternative would be:

2) Make commit write check whether a page has been truncated and if so 
make it abort the operation/transaction (or non-journalling file systems 
can probably just ignore the call completely).  Then simply restart the 
generic_file_buffered_write() call by faulting the user space page in 
again, etc.  And this time round things will just work again.  I guess it 
would be possible that we hit the same race condition again and again and 
again so we would loop for ever but the chances for that are even less 
than the original race condition.  I suppose to guard against it we could 
have a counter that say limited to X retries and if that failed the write 
would be aborted with -EDEADLOCK or something.  As I said I think 1) would 
be nicer...

What do you think?

Best regards,

Anton
-- 
Anton Altaparmakov  (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 9/9] mm: fix pagecache write deadlocks

2007-02-04 Thread Andrew Morton
On Sun, 4 Feb 2007 11:46:09 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote:

> On Sun, Feb 04, 2007 at 02:30:55AM -0800, Andrew Morton wrote:
> > On Sun, 4 Feb 2007 11:15:29 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote:
> > 
> > > The write path is broken. I prefer my kernels slow, than buggy.
> > 
> > That won't fly.
> 
> What won't fly?

I suspect the performance cost of this approach would force us to redo it
all.

> > > > What happened to the idea of doing an atomic copy into the non-uptodate
> > > > page and handling it somehow?
> > > 
> > > That was my second idea.
> > 
> > Coulda sworn it was mine ;) I thought you ended up deciding it wasn't
> > practical because of the games we needed to play with ->commit_write.
> 
> Maybe I misunderstood what you meant, above.

The original set of half-written patches I sent you.  Do an atomic 
copy_from_user()
inside the page lock and if that fails, zero out the remainder of the page, run
commit_write() and then redo the whole thing.

> I have an alterative fix
> where a temporary page is allocated if the write enncounters a non
> uptodate page. The usercopy then goes into that page, and from there
> into the target page after we have opened the prepare_write().

Remember that a non-uptodate page is the common case.

> My *first* idea to fix this was to do the atomic copy into a non-uptodate
> page and then calling a zero-length commit_write if it failed. I pretty
> carefully constructed all these good arguments as to why each case works
> properly, but in the end it just didn't fly because it broke lots of
> filesystems.

I forget the details now.  I think we did have a workable-looking solution
based on the atomic copy_from_user() but it would have re-exposed the old
problem wherein a page would fleetingly have a bunch of zeroes in the
middle of it, if someone looked at it during the write.

If that recollection is right, I think we could afford to reintroduce that
problem, frankly.  Especially as it only happens in the incredibly rare
case of that get_user()ed page getting unmapped under our feet.

> > > but you introduce the theoretical memory deadlock
> > > where a task cannot reclaim its own memory.
> > 
> > Nah, that'll never happen - both pages are already allocated.
> 
> Both pages? I don't get it.
> 
> You set the don't-reclaim vma flag, then run get_user, which takes a
> page fault and potentially has to allocate N pages for pagetables,
> pagecache readahead, buffers and fs private data and pagecache radix
> tree nodes for all of the pages read in.

Oh, OK.  Need to do the get_user() twice then.  Once before taking that new
rwsem.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 9/9] mm: fix pagecache write deadlocks

2007-02-04 Thread Nick Piggin
On Sun, Feb 04, 2007 at 11:46:09AM +0100, Nick Piggin wrote:
> 
> > It's better than taking mmap_sem and walking pagetables...
> 
> I'm not convinced.

Though I am more convinced that looking at mm *at all* (either to
take the mmap_sem and find the vma, or to take the mmap_sem and
run get_user_pages) is going to hurt.

We'd have to special case kernel threads, which don't even have an
mm, let alone the vmas... too ugly.

I'll revert to my temporary-page approach: at least that will
fix the problem.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 9/9] mm: fix pagecache write deadlocks

2007-02-04 Thread Nick Piggin
On Sun, Feb 04, 2007 at 02:30:55AM -0800, Andrew Morton wrote:
> On Sun, 4 Feb 2007 11:15:29 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote:
> 
> > The write path is broken. I prefer my kernels slow, than buggy.
> 
> That won't fly.

What won't fly?

> 
> > > There's a build error in filemap_xip.c btw.
> 
> ?

Thanks?

> > > What happened to the idea of doing an atomic copy into the non-uptodate
> > > page and handling it somehow?
> > 
> > That was my second idea.
> 
> Coulda sworn it was mine ;) I thought you ended up deciding it wasn't
> practical because of the games we needed to play with ->commit_write.

Maybe I misunderstood what you meant, above. I have an alterative fix
where a temporary page is allocated if the write enncounters a non
uptodate page. The usercopy then goes into that page, and from there
into the target page after we have opened the prepare_write().

My *first* idea to fix this was to do the atomic copy into a non-uptodate
page and then calling a zero-length commit_write if it failed. I pretty
carefully constructed all these good arguments as to why each case works
properly, but in the end it just didn't fly because it broke lots of
filesystems.

> > > Another option might be to effectively pin the whole mm during the copy:
> > > 
> > >   down_read(¤t->mm->unpaging_lock);
> > >   get_user(addr); /* Fault the page in */
> > >   ...
> > >   copy_from_user()
> > >   up_read(¤t->mm->unpaging_lock);
> > > 
> > > then, anyone who wants to unmap pages from this mm requires
> > > write_lock(unpaging_lock).  So we know the results of that get_user()
> > > cannot be undone.
> > 
> > Fugly.
> 
> I invited you to think different - don't just fixate on one random
> tossed-out-there suggestion.

I've thought. Quite a lot. I have 2 other approaches that don't require
mmap_sem, and 1 which is actually possible to implement without breaking
filesystems.

> > but you introduce the theoretical memory deadlock
> > where a task cannot reclaim its own memory.
> 
> Nah, that'll never happen - both pages are already allocated.

Both pages? I don't get it.

You set the don't-reclaim vma flag, then run get_user, which takes a
page fault and potentially has to allocate N pages for pagetables,
pagecache readahead, buffers and fs private data and pagecache radix
tree nodes for all of the pages read in.

> It's better than taking mmap_sem and walking pagetables...

I'm not convinced.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 9/9] mm: fix pagecache write deadlocks

2007-02-04 Thread Andrew Morton
On Sun, 4 Feb 2007 11:15:29 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote:

> On Sun, Feb 04, 2007 at 01:44:45AM -0800, Andrew Morton wrote:
> > On Sun,  4 Feb 2007 09:51:07 +0100 (CET) Nick Piggin <[EMAIL PROTECTED]> 
> > wrote:
> > 
> > > 2.  If we find the destination page is non uptodate, unlock it (this 
> > > could be
> > > made slightly more optimal), then find and pin the source page with
> > > get_user_pages. Relock the destination page and continue with the 
> > > copy.
> > > However, instead of a usercopy (which might take a fault), copy the 
> > > data
> > > via the kernel address space.
> > 
> > argh.  We just can't go adding all this gunk into the write() path. 
> > 
> > mmap_sem, a full pte-walk, taking of pte-page locks, etc.  For every page. 
> > Even single-process write() will suffer, let along multithreaded stuff,
> > where mmap_sem contention may be the bigger problem.
> 
> The write path is broken. I prefer my kernels slow, than buggy.

That won't fly.

> > There's a build error in filemap_xip.c btw.

?

> > 
> > We need to think different.
> > 
> > What happened to the idea of doing an atomic copy into the non-uptodate
> > page and handling it somehow?
> 
> That was my second idea.

Coulda sworn it was mine ;) I thought you ended up deciding it wasn't
practical because of the games we needed to play with ->commit_write.

> I didn't get any feedback on that patchset
> except to try this method, so I assume everyone hated it.
> 
> I actually liked it, because it didn't have to do the writev
> segment-at-a-time for !uptodate pages like this one does. Considering
> this code gets called from mm-less contexts, maybe I'll have to go back
> to this approach.

OK.

> > Another option might be to effectively pin the whole mm during the copy:
> > 
> > down_read(¤t->mm->unpaging_lock);
> > get_user(addr); /* Fault the page in */
> > ...
> > copy_from_user()
> > up_read(¤t->mm->unpaging_lock);
> > 
> > then, anyone who wants to unmap pages from this mm requires
> > write_lock(unpaging_lock).  So we know the results of that get_user()
> > cannot be undone.
> 
> Fugly.

I invited you to think different - don't just fixate on one random
tossed-out-there suggestion.

> but you introduce the theoretical memory deadlock
> where a task cannot reclaim its own memory.

Nah, that'll never happen - both pages are already allocated.

It's better than taking mmap_sem and walking pagetables...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 9/9] mm: fix pagecache write deadlocks

2007-02-04 Thread Christoph Hellwig
On Sun, Feb 04, 2007 at 11:15:29AM +0100, Nick Piggin wrote:
> Cool, a kernel thread is calling sys_write. Fun.

There are tons of places where we possible call into ->write from
either kernel threads or at least with a kernel pointer  and set_fs/set_ds
magic.  Anything in the buffer write path that tries to touch page tables
can't work.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 9/9] mm: fix pagecache write deadlocks

2007-02-04 Thread Nick Piggin
On Sun, Feb 04, 2007 at 01:44:45AM -0800, Andrew Morton wrote:
> On Sun,  4 Feb 2007 09:51:07 +0100 (CET) Nick Piggin <[EMAIL PROTECTED]> 
> wrote:
> 
> > 2.  If we find the destination page is non uptodate, unlock it (this could 
> > be
> > made slightly more optimal), then find and pin the source page with
> > get_user_pages. Relock the destination page and continue with the copy.
> > However, instead of a usercopy (which might take a fault), copy the data
> > via the kernel address space.
> 
> argh.  We just can't go adding all this gunk into the write() path. 
> 
> mmap_sem, a full pte-walk, taking of pte-page locks, etc.  For every page. 
> Even single-process write() will suffer, let along multithreaded stuff,
> where mmap_sem contention may be the bigger problem.

The write path is broken. I prefer my kernels slow, than buggy.

As I said, I'm working on a replacement API so that the filesystems
that care, can be correct *and* fast.

> I was going to do some quick measurements of this, but the code oopses
> on power4 (http://userweb.kernel.org/~akpm/s5000402.jpg)

Cool, a kernel thread is calling sys_write. Fun.

I guess I should be able to reproduce this if using initramfs. Thanks.

> There's a build error in filemap_xip.c btw.
> 
> 
> 
> We need to think different.
> 
> What happened to the idea of doing an atomic copy into the non-uptodate
> page and handling it somehow?

That was my second idea. I didn't get any feedback on that patchset
except to try this method, so I assume everyone hated it.

I actually liked it, because it didn't have to do the writev
segment-at-a-time for !uptodate pages like this one does. Considering
this code gets called from mm-less contexts, maybe I'll have to go back
to this approach.

> Another option might be to effectively pin the whole mm during the copy:
> 
>   down_read(¤t->mm->unpaging_lock);
>   get_user(addr); /* Fault the page in */
>   ...
>   copy_from_user()
>   up_read(¤t->mm->unpaging_lock);
> 
> then, anyone who wants to unmap pages from this mm requires
> write_lock(unpaging_lock).  So we know the results of that get_user()
> cannot be undone.

Fugly. Don't know whether there are any lock order problems making it
hard to implement, but you introduce the theoretical memory deadlock
where a task cannot reclaim its own memory.

> Or perhaps something like this can be done on a per-vma basis.  Just
> something to tell the VM "hey, you're not allowed to unmap this page right
> now"?

Same memory deadlock problem.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 9/9] mm: fix pagecache write deadlocks

2007-02-04 Thread Andrew Morton
On Sun,  4 Feb 2007 09:51:07 +0100 (CET) Nick Piggin <[EMAIL PROTECTED]> wrote:

> 2.  If we find the destination page is non uptodate, unlock it (this could be
> made slightly more optimal), then find and pin the source page with
> get_user_pages. Relock the destination page and continue with the copy.
> However, instead of a usercopy (which might take a fault), copy the data
> via the kernel address space.

argh.  We just can't go adding all this gunk into the write() path. 

mmap_sem, a full pte-walk, taking of pte-page locks, etc.  For every page. 
Even single-process write() will suffer, let along multithreaded stuff,
where mmap_sem contention may be the bigger problem.

I was going to do some quick measurements of this, but the code oopses
on power4 (http://userweb.kernel.org/~akpm/s5000402.jpg)

There's a build error in filemap_xip.c btw.



We need to think different.

What happened to the idea of doing an atomic copy into the non-uptodate
page and handling it somehow?



Another option might be to effectively pin the whole mm during the copy:

down_read(¤t->mm->unpaging_lock);
get_user(addr); /* Fault the page in */
...
copy_from_user()
up_read(¤t->mm->unpaging_lock);

then, anyone who wants to unmap pages from this mm requires
write_lock(unpaging_lock).  So we know the results of that get_user()
cannot be undone.

Or perhaps something like this can be done on a per-vma basis.  Just
something to tell the VM "hey, you're not allowed to unmap this page right
now"?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 9/9] mm: fix pagecache write deadlocks

2007-02-02 Thread Nick Piggin
On Fri, Feb 02, 2007 at 03:53:11PM -0800, Andrew Morton wrote:
> On Mon, 29 Jan 2007 11:33:03 +0100 (CET)
> Nick Piggin <[EMAIL PROTECTED]> wrote:
> 
> > Modify the core write() code so that it won't take a pagefault while 
> > holding a
> > lock on the pagecache page. There are a number of different deadlocks 
> > possible
> > if we try to do such a thing:
> > 
> > 1.  generic_buffered_write
> > 2.   lock_page
> > 3.prepare_write
> > 4. unlock_page+vmtruncate
> > 5. copy_from_user
> > 6.  mmap_sem(r)
> > 7.   handle_mm_fault
> > 8.lock_page (filemap_nopage)
> > 9.commit_write
> > 10.  unlock_page
> > 
> > a. sys_munmap / sys_mlock / others
> > b.  mmap_sem(w)
> > c.   make_pages_present
> > d.get_user_pages
> > e. handle_mm_fault
> > f.  lock_page (filemap_nopage)
> > 
> > 2,8 - recursive deadlock if page is same
> > 2,8;2,8 - ABBA deadlock is page is different
> > 2,6;b,f - ABBA deadlock if page is same
> > 
> > The solution is as follows:
> > 1.  If we find the destination page is uptodate, continue as normal, but use
> > atomic usercopies which do not take pagefaults and do not zero the 
> > uncopied
> > tail of the destination. The destination is already uptodate, so we can
> > commit_write the full length even if there was a partial copy: it does 
> > not
> > matter that the tail was not modified, because if it is dirtied and 
> > written
> > back to disk it will not cause any problems (uptodate *means* that the
> > destination page is as new or newer than the copy on disk).
> > 
> > 1a. The above requires that fault_in_pages_readable correctly returns access
> > information, because atomic usercopies cannot distinguish between
> > non-present pages in a readable mapping, from lack of a readable 
> > mapping.
> > 
> > 2.  If we find the destination page is non uptodate, unlock it (this could 
> > be
> > made slightly more optimal), then find and pin the source page with
> > get_user_pages. Relock the destination page and continue with the copy.
> > However, instead of a usercopy (which might take a fault), copy the data
> > via the kernel address space.
> > 
> 
> Oh what a mess we're making :(
> 
> Unfortunately, write() into a non-uptodate page is very much the common
> case.  We've always tried to avoid doing a pte-walk in the write() path to
> fix this bug.  Careful performance testing is needed here so we can assess
> the impact.  For threaded applications, simply the taking of mmap_sem might
> be the biggest problem.
> 
> And I can't think of any tricks we can play to avoid doing the pte-walk in
> most cases.  For example, we don't yet have a page to run page_mapped()
> against.

After this patch series, I am working on another that will allow filesystems
to specifically code around the problem (eg. by handling short usercopies
properly).

I tried to take this approach generically the first time, but it turns out
lots of filesystems had subtle problems, so if we do it this way instead,
then filesystem developers who actually care enough can improve their
code, and those that don't won't hold them back (or prevent this bug from
being fixed).

> > break;
> > }
> >  
> > +   /*
> > +* non-uptodate pages cannot cope with short copies, and we
> > +* cannot take a pagefault with the destination page locked.
> > +* So pin the source page to copy it.
> > +*/
> > +   if (!PageUptodate(page)) {
> > +   unlock_page(page);
> > +
> > +   bytes = min(bytes, PAGE_CACHE_SIZE -
> > +((unsigned long)buf & ~PAGE_CACHE_MASK));
> > +
> > +   /*
> > +* Cannot get_user_pages with a page locked for the
> > +* same reason as we can't take a page fault with a
> > +* page locked (as explained below).
> > +*/
> > +   down_read(¤t->mm->mmap_sem);
> > +   status = get_user_pages(current, current->mm,
> > +   (unsigned long)buf & PAGE_CACHE_MASK, 1,
> > +   0, 0, &src_page, NULL);
> > +   up_read(¤t->mm->mmap_sem);
> > +   if (status != 1) {
> > +   page_cache_release(page);
> > +   break;
> > +   }
> > +
> > +   lock_page(page);
> > +   if (!page->mapping) {
> 
> Hopefully this can't happen?  If it can, who went and took our page off the
> mapping?  Reclaim?  The elevated page_count will prevent that?

Truncate/invalidate?

> > +   unlock_page(page);
> > +   page_cache_release(page);
> > +   page_cache_release(src_page);
> > +   continue;
> > +  

Re: [patch 9/9] mm: fix pagecache write deadlocks

2007-02-02 Thread Andrew Morton
On Mon, 29 Jan 2007 11:33:03 +0100 (CET)
Nick Piggin <[EMAIL PROTECTED]> wrote:

> Modify the core write() code so that it won't take a pagefault while holding a
> lock on the pagecache page. There are a number of different deadlocks possible
> if we try to do such a thing:
> 
> 1.  generic_buffered_write
> 2.   lock_page
> 3.prepare_write
> 4. unlock_page+vmtruncate
> 5. copy_from_user
> 6.  mmap_sem(r)
> 7.   handle_mm_fault
> 8.lock_page (filemap_nopage)
> 9.commit_write
> 10.  unlock_page
> 
> a. sys_munmap / sys_mlock / others
> b.  mmap_sem(w)
> c.   make_pages_present
> d.get_user_pages
> e. handle_mm_fault
> f.  lock_page (filemap_nopage)
> 
> 2,8   - recursive deadlock if page is same
> 2,8;2,8   - ABBA deadlock is page is different
> 2,6;b,f   - ABBA deadlock if page is same
> 
> The solution is as follows:
> 1.  If we find the destination page is uptodate, continue as normal, but use
> atomic usercopies which do not take pagefaults and do not zero the 
> uncopied
> tail of the destination. The destination is already uptodate, so we can
> commit_write the full length even if there was a partial copy: it does not
> matter that the tail was not modified, because if it is dirtied and 
> written
> back to disk it will not cause any problems (uptodate *means* that the
> destination page is as new or newer than the copy on disk).
> 
> 1a. The above requires that fault_in_pages_readable correctly returns access
> information, because atomic usercopies cannot distinguish between
> non-present pages in a readable mapping, from lack of a readable mapping.
> 
> 2.  If we find the destination page is non uptodate, unlock it (this could be
> made slightly more optimal), then find and pin the source page with
> get_user_pages. Relock the destination page and continue with the copy.
> However, instead of a usercopy (which might take a fault), copy the data
> via the kernel address space.
> 

Oh what a mess we're making :(

Unfortunately, write() into a non-uptodate page is very much the common
case.  We've always tried to avoid doing a pte-walk in the write() path to
fix this bug.  Careful performance testing is needed here so we can assess
the impact.  For threaded applications, simply the taking of mmap_sem might
be the biggest problem.

And I can't think of any tricks we can play to avoid doing the pte-walk in
most cases.  For example, we don't yet have a page to run page_mapped()
against.

>   break;
>   }
>  
> + /*
> +  * non-uptodate pages cannot cope with short copies, and we
> +  * cannot take a pagefault with the destination page locked.
> +  * So pin the source page to copy it.
> +  */
> + if (!PageUptodate(page)) {
> + unlock_page(page);
> +
> + bytes = min(bytes, PAGE_CACHE_SIZE -
> +  ((unsigned long)buf & ~PAGE_CACHE_MASK));
> +
> + /*
> +  * Cannot get_user_pages with a page locked for the
> +  * same reason as we can't take a page fault with a
> +  * page locked (as explained below).
> +  */
> + down_read(¤t->mm->mmap_sem);
> + status = get_user_pages(current, current->mm,
> + (unsigned long)buf & PAGE_CACHE_MASK, 1,
> + 0, 0, &src_page, NULL);
> + up_read(¤t->mm->mmap_sem);
> + if (status != 1) {
> + page_cache_release(page);
> + break;
> + }
> +
> + lock_page(page);
> + if (!page->mapping) {

Hopefully this can't happen?  If it can, who went and took our page off the
mapping?  Reclaim?  The elevated page_count will prevent that?

> + unlock_page(page);
> + page_cache_release(page);
> + page_cache_release(src_page);
> + continue;
> + }
> +
> + }
> +
>   status = a_ops->prepare_write(file, page, offset, offset+bytes);
>   if (unlikely(status))
>   goto fs_write_aop_error;
>  
> - copied = filemap_copy_from_user(page, offset,
> + if (!src_page) {
> + /*
> +  * Must not enter the pagefault handler here, because
> +  * we hold the page lock, so we might recursively
> +  * deadlock on the same lock, or get an ABBA deadlock
> +  * against a different lock, or against the mmap_sem
> +  * (which nests outside the page lock).  So increment
> +  

Re: [patch 9/9] mm: fix pagecache write deadlocks

2007-01-29 Thread Nick Piggin
On Mon, Jan 29, 2007 at 11:33:03AM +0100, Nick Piggin wrote:
> + } else {
> + char *src, *dst;
> + src = kmap(src_page);
> + dst = kmap(page);
> + memcpy(dst + offset,
> + src + ((unsigned long)buf & ~PAGE_CACHE_MASK),
> + bytes);
> + kunmap(page);
> + kunmap(src_page);
> + copied = bytes;
> + }
>   flush_dcache_page(page);

Hmm, I guess these should use kmap_atomic with KM_USER[01]?

The kmap is from an earlier iteration that wanted to sleep
with the page mapped into kernel.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/