Re: [rfc][patch 0/3] a faster buffered write deadlock fix?

2007-02-09 Thread Nick Piggin
On Fri, Feb 09, 2007 at 03:46:44AM -0800, Andrew Morton wrote:
> On Fri, 9 Feb 2007 12:31:16 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote:
> 
> > > 
> > > We'll never, ever, ever update and test all filesytems.  What you're
> > > calling "legacy" code will be there for all time.
> > 
> > I didn't say all; I still prefer correct than fast;
> 
> For gawd's sake.  You can make the kernel less buggy by removing SMP
> support.

I'm talking about known bugs.

> Guess what?  Tradeoffs exist.

I agree that 60% is much too big of a hit for all filesystems. Which is
why I propose this new aop.

> > you are still free
> > to keep the fast-and-buggy code in the legacy path.
> 
> You make it sound like this is a choice.  It isn't.  Nobody is going to go
> in and convert all those filesystems.

IMO, once all the maintained filesystems are converted then it would be
a good choice to make. You think otherwise and I won't argue.

> > > 
> > > I haven't had time to look at the perform_write stuff yet.
> > > 
> > > > Of course I would still want my correct-but-slow version in that case,
> > > > but I just wouldn't care to argue if you still wanted to keep it fast.
> > > 
> > > This is write().  We just cannot go and double-copy all the memory or take
> > > mmap_sem and do a full pagetable walk in there.  It just means that we
> > > haven't found a suitable solution yet.
> > 
> > You prefer speed over correctness even for little used filessytems, which
> > is fine because I'm sick of arguing about it. The main thing for me is that
> > important filesystems can be correct and fast.
> 
> I wouldn't characterise it as "arguing".  It's development.  Going and
> sticking enormous slowdowns into write() to fix some bug which nobody is
> hitting is insane.

Actually I'm doing this because I try to fix real data corruption problems
which people are hitting. You told me I can't get those fixes in until I
fix this problem.

> We need to find a better fix, that's all.

I actually found perform_write to be a speedup. And if perform_write is
merged then I would be happy to not fix the prepare_write path, or wait for
someone to come up with a better fix.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc][patch 0/3] a faster buffered write deadlock fix?

2007-02-09 Thread Andrew Morton
On Fri, 9 Feb 2007 12:31:16 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote:

> > > > > But that all becomes legacy path, so do we really care? Supposing fs
> > > > > maintainers like perform_write, then after the main ones have 
> > > > > implementations
> > > > > we could switch over to the slow-but-correct prepare_write legacy 
> > > > > path.
> > > > > Or we could leave it, or we could use Linus's slightly-less-buggy 
> > > > > scheme...
> > > > > by that point I expect I'd be sick of arguing about it ;)
> > > > 
> > > > It's worth "arguing" about.  This is write().  What matters more??
> > > 
> > > That's the legacy path that uses prepare/commit (ie. supposing that all
> > > major filesystems did get converted to perform_write).
> > 
> > We'll never, ever, ever update and test all filesytems.  What you're
> > calling "legacy" code will be there for all time.
> 
> I didn't say all; I still prefer correct than fast;

For gawd's sake.  You can make the kernel less buggy by removing SMP
support.

Guess what?  Tradeoffs exist.

> you are still free
> to keep the fast-and-buggy code in the legacy path.

You make it sound like this is a choice.  It isn't.  Nobody is going to go
in and convert all those filesystems.

> > 
> > I haven't had time to look at the perform_write stuff yet.
> > 
> > > Of course I would still want my correct-but-slow version in that case,
> > > but I just wouldn't care to argue if you still wanted to keep it fast.
> > 
> > This is write().  We just cannot go and double-copy all the memory or take
> > mmap_sem and do a full pagetable walk in there.  It just means that we
> > haven't found a suitable solution yet.
> 
> You prefer speed over correctness even for little used filessytems, which
> is fine because I'm sick of arguing about it. The main thing for me is that
> important filesystems can be correct and fast.

I wouldn't characterise it as "arguing".  It's development.  Going and
sticking enormous slowdowns into write() to fix some bug which nobody is
hitting is insane.

We need to find a better fix, that's all.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc][patch 0/3] a faster buffered write deadlock fix?

2007-02-09 Thread Nick Piggin
On Fri, Feb 09, 2007 at 02:52:49AM -0800, Andrew Morton wrote:
> On Fri, 9 Feb 2007 11:32:58 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote:
> 
> > On Fri, Feb 09, 2007 at 02:09:54AM -0800, Andrew Morton wrote:
> > > On Fri, 9 Feb 2007 10:54:05 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote:
> > > 
> > > > 
> > > > That's still got a deadlock,
> > > 
> > > It does?
> > 
> > Yes, PG_lock vs mm->mmap_sem.
> 
> Where?  It requires that someone hold mmap_sem for writing as well as a
> page lock (in an order which would require some thought).  Do we ever do
> that?

task1   task2   task3
lock_page(page) down_read(mmap_sem)
down_write(mmap_sem)
down_read(mmap_sem)
lock_page(page)

t1 is blocked by t3
t2 is blocked by t1
t3 is blocked by t2

OK, it isn't quite ABBA, but that's shorthand if we remember that
that an rwsem read must obey normal lock nesting rules.

> > > >  and also it doesn't work if we want to lock
> > > > the page when performing a minor fault (which I want to fix fault vs
> > > > invalidate),
> > > 
> > > It's hard to discuss this without a description of what you want to fix
> > > there, and a description of how you plan to fix it.
> > 
> > http://marc.theaimsgroup.com/?l=linux-mm&m=116865911432667&w=2
> 
> mutter.
> 
> Could perhaps fix that by running ClearPageUptodate in invalidate, thus
> forcing the pagefault code to take the page lock (which we already hold).
> 
> That does mean that we'll fleetingly have a non-uptodate page in pagetables
> which is a bit nasty.
> 
> Or, probably better, we could add a new page flag (heh) telling nopage that
> it needs to lock the page even if it's uptodate.

I don't see how either of those suggestions would fix this bug.

> > > > and also assumes nobody's ->nopage locks the page or
> > > > requires any resources that are held by prepare_write (something not
> > > > immediately clear to me with the cluster filesystems, at least).
> > > 
> > > The nopage handler is filemap_nopage().  Are there any exceptions to that?
> > 
> > OCFS2 and GFS2.
> 
> So the rule is that ->nopage handlers against pagecache mustn't lock the
> page if it's already uptodate.  That's OK.  But it might conflict with the
> above invalidate proposal.

Well that _will_ be the rule if you want to do Linus's half-fix. It will
also be the rule that they're not to deadlock against prepare_write.

> > > > But that all becomes legacy path, so do we really care? Supposing fs
> > > > maintainers like perform_write, then after the main ones have 
> > > > implementations
> > > > we could switch over to the slow-but-correct prepare_write legacy path.
> > > > Or we could leave it, or we could use Linus's slightly-less-buggy 
> > > > scheme...
> > > > by that point I expect I'd be sick of arguing about it ;)
> > > 
> > > It's worth "arguing" about.  This is write().  What matters more??
> > 
> > That's the legacy path that uses prepare/commit (ie. supposing that all
> > major filesystems did get converted to perform_write).
> 
> We'll never, ever, ever update and test all filesytems.  What you're
> calling "legacy" code will be there for all time.

I didn't say all; I still prefer correct than fast; you are still free
to keep the fast-and-buggy code in the legacy path.

> 
> I haven't had time to look at the perform_write stuff yet.
> 
> > Of course I would still want my correct-but-slow version in that case,
> > but I just wouldn't care to argue if you still wanted to keep it fast.
> 
> This is write().  We just cannot go and double-copy all the memory or take
> mmap_sem and do a full pagetable walk in there.  It just means that we
> haven't found a suitable solution yet.

You prefer speed over correctness even for little used filessytems, which
is fine because I'm sick of arguing about it. The main thing for me is that
important filesystems can be correct and fast.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc][patch 0/3] a faster buffered write deadlock fix?

2007-02-09 Thread Andrew Morton
On Fri, 9 Feb 2007 11:32:58 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote:

> On Fri, Feb 09, 2007 at 02:09:54AM -0800, Andrew Morton wrote:
> > On Fri, 9 Feb 2007 10:54:05 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote:
> > 
> > > 
> > > That's still got a deadlock,
> > 
> > It does?
> 
> Yes, PG_lock vs mm->mmap_sem.

Where?  It requires that someone hold mmap_sem for writing as well as a
page lock (in an order which would require some thought).  Do we ever do
that?

> > >  and also it doesn't work if we want to lock
> > > the page when performing a minor fault (which I want to fix fault vs
> > > invalidate),
> > 
> > It's hard to discuss this without a description of what you want to fix
> > there, and a description of how you plan to fix it.
> 
> http://marc.theaimsgroup.com/?l=linux-mm&m=116865911432667&w=2

mutter.

Could perhaps fix that by running ClearPageUptodate in invalidate, thus
forcing the pagefault code to take the page lock (which we already hold).

That does mean that we'll fleetingly have a non-uptodate page in pagetables
which is a bit nasty.

Or, probably better, we could add a new page flag (heh) telling nopage that
it needs to lock the page even if it's uptodate.

> > > and also assumes nobody's ->nopage locks the page or
> > > requires any resources that are held by prepare_write (something not
> > > immediately clear to me with the cluster filesystems, at least).
> > 
> > The nopage handler is filemap_nopage().  Are there any exceptions to that?
> 
> OCFS2 and GFS2.

So the rule is that ->nopage handlers against pagecache mustn't lock the
page if it's already uptodate.  That's OK.  But it might conflict with the
above invalidate proposal.

Gad.  ocfs2_nopage() diddles with signals.


> > > But that all becomes legacy path, so do we really care? Supposing fs
> > > maintainers like perform_write, then after the main ones have 
> > > implementations
> > > we could switch over to the slow-but-correct prepare_write legacy path.
> > > Or we could leave it, or we could use Linus's slightly-less-buggy 
> > > scheme...
> > > by that point I expect I'd be sick of arguing about it ;)
> > 
> > It's worth "arguing" about.  This is write().  What matters more??
> 
> That's the legacy path that uses prepare/commit (ie. supposing that all
> major filesystems did get converted to perform_write).

We'll never, ever, ever update and test all filesytems.  What you're
calling "legacy" code will be there for all time.

I haven't had time to look at the perform_write stuff yet.

> Of course I would still want my correct-but-slow version in that case,
> but I just wouldn't care to argue if you still wanted to keep it fast.

This is write().  We just cannot go and double-copy all the memory or take
mmap_sem and do a full pagetable walk in there.  It just means that we
haven't found a suitable solution yet.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc][patch 0/3] a faster buffered write deadlock fix?

2007-02-09 Thread Nick Piggin
On Fri, Feb 09, 2007 at 02:09:54AM -0800, Andrew Morton wrote:
> On Fri, 9 Feb 2007 10:54:05 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote:
> 
> > 
> > That's still got a deadlock,
> 
> It does?

Yes, PG_lock vs mm->mmap_sem.

> >  and also it doesn't work if we want to lock
> > the page when performing a minor fault (which I want to fix fault vs
> > invalidate),
> 
> It's hard to discuss this without a description of what you want to fix
> there, and a description of how you plan to fix it.

http://marc.theaimsgroup.com/?l=linux-mm&m=116865911432667&w=2

> > and also assumes nobody's ->nopage locks the page or
> > requires any resources that are held by prepare_write (something not
> > immediately clear to me with the cluster filesystems, at least).
> 
> The nopage handler is filemap_nopage().  Are there any exceptions to that?

OCFS2 and GFS2.

> > But that all becomes legacy path, so do we really care? Supposing fs
> > maintainers like perform_write, then after the main ones have 
> > implementations
> > we could switch over to the slow-but-correct prepare_write legacy path.
> > Or we could leave it, or we could use Linus's slightly-less-buggy scheme...
> > by that point I expect I'd be sick of arguing about it ;)
> 
> It's worth "arguing" about.  This is write().  What matters more??

That's the legacy path that uses prepare/commit (ie. supposing that all
major filesystems did get converted to perform_write).

Of course I would still want my correct-but-slow version in that case,
but I just wouldn't care to argue if you still wanted to keep it fast.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc][patch 0/3] a faster buffered write deadlock fix?

2007-02-09 Thread Andrew Morton
On Fri, 9 Feb 2007 10:54:05 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote:

> On Fri, Feb 09, 2007 at 12:41:01AM -0800, Andrew Morton wrote:
> > On Thu,  8 Feb 2007 14:07:15 +0100 (CET) Nick Piggin <[EMAIL PROTECTED]> 
> > wrote:
> > 
> > > So I have finally finished a first slightly-working draft of my new aops
> > > op (perform_write) proposal. I would be interested to hear comments about
> > > it.  Most of my issues and concerns are in the patch headers themselves,
> > > so reply to them.
> > > 
> > > The patches are against my latest buffered-write-fix patchset.
> > 
> > What happened with Linus's proposal to instantiate the page as pinned,
> > non-uptodate, unlocked and in pagecache while we poke the user address?
> 
> That's still got a deadlock,

It does?

>  and also it doesn't work if we want to lock
> the page when performing a minor fault (which I want to fix fault vs
> invalidate),

It's hard to discuss this without a description of what you want to fix
there, and a description of how you plan to fix it.

> and also assumes nobody's ->nopage locks the page or
> requires any resources that are held by prepare_write (something not
> immediately clear to me with the cluster filesystems, at least).

The nopage handler is filemap_nopage().  Are there any exceptions to that?

> But that all becomes legacy path, so do we really care? Supposing fs
> maintainers like perform_write, then after the main ones have implementations
> we could switch over to the slow-but-correct prepare_write legacy path.
> Or we could leave it, or we could use Linus's slightly-less-buggy scheme...
> by that point I expect I'd be sick of arguing about it ;)

It's worth "arguing" about.  This is write().  What matters more??
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc][patch 0/3] a faster buffered write deadlock fix?

2007-02-09 Thread Nick Piggin
On Fri, Feb 09, 2007 at 12:41:01AM -0800, Andrew Morton wrote:
> On Thu,  8 Feb 2007 14:07:15 +0100 (CET) Nick Piggin <[EMAIL PROTECTED]> 
> wrote:
> 
> > So I have finally finished a first slightly-working draft of my new aops
> > op (perform_write) proposal. I would be interested to hear comments about
> > it.  Most of my issues and concerns are in the patch headers themselves,
> > so reply to them.
> > 
> > The patches are against my latest buffered-write-fix patchset.
> 
> What happened with Linus's proposal to instantiate the page as pinned,
> non-uptodate, unlocked and in pagecache while we poke the user address?

That's still got a deadlock, and also it doesn't work if we want to lock
the page when performing a minor fault (which I want to fix fault vs
invalidate), and also assumes nobody's ->nopage locks the page or
requires any resources that are held by prepare_write (something not
immediately clear to me with the cluster filesystems, at least).

But that all becomes legacy path, so do we really care? Supposing fs
maintainers like perform_write, then after the main ones have implementations
we could switch over to the slow-but-correct prepare_write legacy path.
Or we could leave it, or we could use Linus's slightly-less-buggy scheme...
by that point I expect I'd be sick of arguing about it ;)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc][patch 0/3] a faster buffered write deadlock fix?

2007-02-09 Thread Andrew Morton
On Thu,  8 Feb 2007 14:07:15 +0100 (CET) Nick Piggin <[EMAIL PROTECTED]> wrote:

> So I have finally finished a first slightly-working draft of my new aops
> op (perform_write) proposal. I would be interested to hear comments about
> it.  Most of my issues and concerns are in the patch headers themselves,
> so reply to them.
> 
> The patches are against my latest buffered-write-fix patchset.

What happened with Linus's proposal to instantiate the page as pinned,
non-uptodate, unlocked and in pagecache while we poke the user address?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc][patch 0/3] a faster buffered write deadlock fix?

2007-02-08 Thread Nick Piggin
On Thu, Feb 08, 2007 at 04:38:01PM -0800, Mark Fasheh wrote:
> On Thu, Feb 08, 2007 at 02:07:15PM +0100, Nick Piggin wrote:
> > The problem is that the existing aops interface is crap. "correct, fast,
> > compatible -- choose any 2"
> 
> Agreed. There's lots of problems with the interface (imho), but my biggest
> two issues are the page lock being held on ->prepare_write() /
> ->commit_write() and the fact that the file system only sees the write one
> page at a time
> 
> 
> > So I have finally finished a first slightly-working draft of my new aops
> > op (perform_write) proposal. I would be interested to hear comments about
> > it.  Most of my issues and concerns are in the patch headers themselves,
> > so reply to them.
> 
> I like ->perform_write(). It allows the file system to make re-use of the
> checks in the generic write path, but gives a maximum amount of information
> about the overall operation to be done. There's an added advantage in that
> some file systems (ocfs2 is one of these) want to be more careful about
> ordering page locks, which should be much easier with it.

Yeah, if possible I like a range based interface rather than page
based. As you say it gives the most information with the least constraints.

> If this goes in, it could probably be helpful to me in some of the code I'm
> currently writing which needs to be better about page lock / cluster lock
> ordering and wants to see as much of the (allocating) writes as possible.

I think it would be important to have a non trivial user of this new API
before it goes into mainline. It would be great if you could look at
using it, after it passes some review.

Thanks,
Nick
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc][patch 0/3] a faster buffered write deadlock fix?

2007-02-08 Thread Mark Fasheh
On Thu, Feb 08, 2007 at 02:07:15PM +0100, Nick Piggin wrote:
> The problem is that the existing aops interface is crap. "correct, fast,
> compatible -- choose any 2"

Agreed. There's lots of problems with the interface (imho), but my biggest
two issues are the page lock being held on ->prepare_write() /
->commit_write() and the fact that the file system only sees the write one
page at a time


> So I have finally finished a first slightly-working draft of my new aops
> op (perform_write) proposal. I would be interested to hear comments about
> it.  Most of my issues and concerns are in the patch headers themselves,
> so reply to them.

I like ->perform_write(). It allows the file system to make re-use of the
checks in the generic write path, but gives a maximum amount of information
about the overall operation to be done. There's an added advantage in that
some file systems (ocfs2 is one of these) want to be more careful about
ordering page locks, which should be much easier with it.

If this goes in, it could probably be helpful to me in some of the code I'm
currently writing which needs to be better about page lock / cluster lock
ordering and wants to see as much of the (allocating) writes as possible.
--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[rfc][patch 0/3] a faster buffered write deadlock fix?

2007-02-08 Thread Nick Piggin
In my last set of numbers for my buffered-write deadlock fix using 2 copies
per page, I realised there is no real performance hit for !uptodate pages
as opposed to uptodate ones. This is unexpected because the uptodate pages
only require a single copy...

The problem turns out to be operator error. I forgot tmpfs won't use this
prepare_write path, so sorry about that.

On ext2, copy 64MB of data from /dev/zero (IO isn't involved), using
4K and 64K block sizes, and conv=notrunc for testing overwriting of
uptodate pages. Numbers is elapsed time in seconds, lower is better.

2.6.20  bufferd write fix
4K  0.0742  0.1208 (1.63x)
4K-uptodate 0.0493  0.0479 (0.97x)
64K 0.0671  0.1068 (1.59x)
64K-uptodate0.0357  0.0362 (1.01x)

So we get about a 60% performance hit, which is more expected. I guess if
0.5% doesn't fly, then 60% is right out ;)

If there were any misconceptions, the problem is not that the code is
incredibly tricky or impossible to fix with good performance. The problem
is that the existing aops interface is crap. "correct, fast, compatible
-- choose any 2"

So I have finally finished a first slightly-working draft of my new aops
op (perform_write) proposal. I would be interested to hear comments about
it.  Most of my issues and concerns are in the patch headers themselves,
so reply to them.

The patches are against my latest buffered-write-fix patchset. This
means filesystems not implementing the new aop, will remain safe, if slow.
Here's some numbers after converting ext2 to the new aop:

2.6.20  perform_write aop
4K  0.0742  0.0769 (1.04x)
4K-uptodate 0.0493  0.0475 (0.96x)
64K 0.0671  0.0613 (0.91x)
64K-uptodate0.0357  0.0343 (0.96x)

Thanks,
Nick

--
SuSE Labs

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/