Re: [rfc][patch 0/3] a faster buffered write deadlock fix?

2007-02-09 Thread Nick Piggin
On Fri, Feb 09, 2007 at 03:46:44AM -0800, Andrew Morton wrote:
> On Fri, 9 Feb 2007 12:31:16 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote:
> 
> > > 
> > > We'll never, ever, ever update and test all filesytems.  What you're
> > > calling "legacy" code will be there for all time.
> > 
> > I didn't say all; I still prefer correct than fast;
> 
> For gawd's sake.  You can make the kernel less buggy by removing SMP
> support.

I'm talking about known bugs.

> Guess what?  Tradeoffs exist.

I agree that 60% is much too big of a hit for all filesystems. Which is
why I propose this new aop.

> > you are still free
> > to keep the fast-and-buggy code in the legacy path.
> 
> You make it sound like this is a choice.  It isn't.  Nobody is going to go
> in and convert all those filesystems.

IMO, once all the maintained filesystems are converted then it would be
a good choice to make. You think otherwise and I won't argue.

> > > 
> > > I haven't had time to look at the perform_write stuff yet.
> > > 
> > > > Of course I would still want my correct-but-slow version in that case,
> > > > but I just wouldn't care to argue if you still wanted to keep it fast.
> > > 
> > > This is write().  We just cannot go and double-copy all the memory or take
> > > mmap_sem and do a full pagetable walk in there.  It just means that we
> > > haven't found a suitable solution yet.
> > 
> > You prefer speed over correctness even for little used filessytems, which
> > is fine because I'm sick of arguing about it. The main thing for me is that
> > important filesystems can be correct and fast.
> 
> I wouldn't characterise it as "arguing".  It's development.  Going and
> sticking enormous slowdowns into write() to fix some bug which nobody is
> hitting is insane.

Actually I'm doing this because I try to fix real data corruption problems
which people are hitting. You told me I can't get those fixes in until I
fix this problem.

> We need to find a better fix, that's all.

I actually found perform_write to be a speedup. And if perform_write is
merged then I would be happy to not fix the prepare_write path, or wait for
someone to come up with a better fix.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc][patch 0/3] a faster buffered write deadlock fix?

2007-02-09 Thread Andrew Morton
On Fri, 9 Feb 2007 12:31:16 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote:

> > > > > But that all becomes legacy path, so do we really care? Supposing fs
> > > > > maintainers like perform_write, then after the main ones have 
> > > > > implementations
> > > > > we could switch over to the slow-but-correct prepare_write legacy 
> > > > > path.
> > > > > Or we could leave it, or we could use Linus's slightly-less-buggy 
> > > > > scheme...
> > > > > by that point I expect I'd be sick of arguing about it ;)
> > > > 
> > > > It's worth "arguing" about.  This is write().  What matters more??
> > > 
> > > That's the legacy path that uses prepare/commit (ie. supposing that all
> > > major filesystems did get converted to perform_write).
> > 
> > We'll never, ever, ever update and test all filesytems.  What you're
> > calling "legacy" code will be there for all time.
> 
> I didn't say all; I still prefer correct than fast;

For gawd's sake.  You can make the kernel less buggy by removing SMP
support.

Guess what?  Tradeoffs exist.

> you are still free
> to keep the fast-and-buggy code in the legacy path.

You make it sound like this is a choice.  It isn't.  Nobody is going to go
in and convert all those filesystems.

> > 
> > I haven't had time to look at the perform_write stuff yet.
> > 
> > > Of course I would still want my correct-but-slow version in that case,
> > > but I just wouldn't care to argue if you still wanted to keep it fast.
> > 
> > This is write().  We just cannot go and double-copy all the memory or take
> > mmap_sem and do a full pagetable walk in there.  It just means that we
> > haven't found a suitable solution yet.
> 
> You prefer speed over correctness even for little used filessytems, which
> is fine because I'm sick of arguing about it. The main thing for me is that
> important filesystems can be correct and fast.

I wouldn't characterise it as "arguing".  It's development.  Going and
sticking enormous slowdowns into write() to fix some bug which nobody is
hitting is insane.

We need to find a better fix, that's all.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc][patch 0/3] a faster buffered write deadlock fix?

2007-02-09 Thread Nick Piggin
On Fri, Feb 09, 2007 at 02:52:49AM -0800, Andrew Morton wrote:
> On Fri, 9 Feb 2007 11:32:58 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote:
> 
> > On Fri, Feb 09, 2007 at 02:09:54AM -0800, Andrew Morton wrote:
> > > On Fri, 9 Feb 2007 10:54:05 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote:
> > > 
> > > > 
> > > > That's still got a deadlock,
> > > 
> > > It does?
> > 
> > Yes, PG_lock vs mm->mmap_sem.
> 
> Where?  It requires that someone hold mmap_sem for writing as well as a
> page lock (in an order which would require some thought).  Do we ever do
> that?

task1   task2   task3
lock_page(page) down_read(mmap_sem)
down_write(mmap_sem)
down_read(mmap_sem)
lock_page(page)

t1 is blocked by t3
t2 is blocked by t1
t3 is blocked by t2

OK, it isn't quite ABBA, but that's shorthand if we remember that
that an rwsem read must obey normal lock nesting rules.

> > > >  and also it doesn't work if we want to lock
> > > > the page when performing a minor fault (which I want to fix fault vs
> > > > invalidate),
> > > 
> > > It's hard to discuss this without a description of what you want to fix
> > > there, and a description of how you plan to fix it.
> > 
> > http://marc.theaimsgroup.com/?l=linux-mm=116865911432667=2
> 
> mutter.
> 
> Could perhaps fix that by running ClearPageUptodate in invalidate, thus
> forcing the pagefault code to take the page lock (which we already hold).
> 
> That does mean that we'll fleetingly have a non-uptodate page in pagetables
> which is a bit nasty.
> 
> Or, probably better, we could add a new page flag (heh) telling nopage that
> it needs to lock the page even if it's uptodate.

I don't see how either of those suggestions would fix this bug.

> > > > and also assumes nobody's ->nopage locks the page or
> > > > requires any resources that are held by prepare_write (something not
> > > > immediately clear to me with the cluster filesystems, at least).
> > > 
> > > The nopage handler is filemap_nopage().  Are there any exceptions to that?
> > 
> > OCFS2 and GFS2.
> 
> So the rule is that ->nopage handlers against pagecache mustn't lock the
> page if it's already uptodate.  That's OK.  But it might conflict with the
> above invalidate proposal.

Well that _will_ be the rule if you want to do Linus's half-fix. It will
also be the rule that they're not to deadlock against prepare_write.

> > > > But that all becomes legacy path, so do we really care? Supposing fs
> > > > maintainers like perform_write, then after the main ones have 
> > > > implementations
> > > > we could switch over to the slow-but-correct prepare_write legacy path.
> > > > Or we could leave it, or we could use Linus's slightly-less-buggy 
> > > > scheme...
> > > > by that point I expect I'd be sick of arguing about it ;)
> > > 
> > > It's worth "arguing" about.  This is write().  What matters more??
> > 
> > That's the legacy path that uses prepare/commit (ie. supposing that all
> > major filesystems did get converted to perform_write).
> 
> We'll never, ever, ever update and test all filesytems.  What you're
> calling "legacy" code will be there for all time.

I didn't say all; I still prefer correct than fast; you are still free
to keep the fast-and-buggy code in the legacy path.

> 
> I haven't had time to look at the perform_write stuff yet.
> 
> > Of course I would still want my correct-but-slow version in that case,
> > but I just wouldn't care to argue if you still wanted to keep it fast.
> 
> This is write().  We just cannot go and double-copy all the memory or take
> mmap_sem and do a full pagetable walk in there.  It just means that we
> haven't found a suitable solution yet.

You prefer speed over correctness even for little used filessytems, which
is fine because I'm sick of arguing about it. The main thing for me is that
important filesystems can be correct and fast.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc][patch 0/3] a faster buffered write deadlock fix?

2007-02-09 Thread Andrew Morton
On Fri, 9 Feb 2007 11:32:58 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote:

> On Fri, Feb 09, 2007 at 02:09:54AM -0800, Andrew Morton wrote:
> > On Fri, 9 Feb 2007 10:54:05 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote:
> > 
> > > 
> > > That's still got a deadlock,
> > 
> > It does?
> 
> Yes, PG_lock vs mm->mmap_sem.

Where?  It requires that someone hold mmap_sem for writing as well as a
page lock (in an order which would require some thought).  Do we ever do
that?

> > >  and also it doesn't work if we want to lock
> > > the page when performing a minor fault (which I want to fix fault vs
> > > invalidate),
> > 
> > It's hard to discuss this without a description of what you want to fix
> > there, and a description of how you plan to fix it.
> 
> http://marc.theaimsgroup.com/?l=linux-mm=116865911432667=2

mutter.

Could perhaps fix that by running ClearPageUptodate in invalidate, thus
forcing the pagefault code to take the page lock (which we already hold).

That does mean that we'll fleetingly have a non-uptodate page in pagetables
which is a bit nasty.

Or, probably better, we could add a new page flag (heh) telling nopage that
it needs to lock the page even if it's uptodate.

> > > and also assumes nobody's ->nopage locks the page or
> > > requires any resources that are held by prepare_write (something not
> > > immediately clear to me with the cluster filesystems, at least).
> > 
> > The nopage handler is filemap_nopage().  Are there any exceptions to that?
> 
> OCFS2 and GFS2.

So the rule is that ->nopage handlers against pagecache mustn't lock the
page if it's already uptodate.  That's OK.  But it might conflict with the
above invalidate proposal.

Gad.  ocfs2_nopage() diddles with signals.


> > > But that all becomes legacy path, so do we really care? Supposing fs
> > > maintainers like perform_write, then after the main ones have 
> > > implementations
> > > we could switch over to the slow-but-correct prepare_write legacy path.
> > > Or we could leave it, or we could use Linus's slightly-less-buggy 
> > > scheme...
> > > by that point I expect I'd be sick of arguing about it ;)
> > 
> > It's worth "arguing" about.  This is write().  What matters more??
> 
> That's the legacy path that uses prepare/commit (ie. supposing that all
> major filesystems did get converted to perform_write).

We'll never, ever, ever update and test all filesytems.  What you're
calling "legacy" code will be there for all time.

I haven't had time to look at the perform_write stuff yet.

> Of course I would still want my correct-but-slow version in that case,
> but I just wouldn't care to argue if you still wanted to keep it fast.

This is write().  We just cannot go and double-copy all the memory or take
mmap_sem and do a full pagetable walk in there.  It just means that we
haven't found a suitable solution yet.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc][patch 0/3] a faster buffered write deadlock fix?

2007-02-09 Thread Nick Piggin
On Fri, Feb 09, 2007 at 02:09:54AM -0800, Andrew Morton wrote:
> On Fri, 9 Feb 2007 10:54:05 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote:
> 
> > 
> > That's still got a deadlock,
> 
> It does?

Yes, PG_lock vs mm->mmap_sem.

> >  and also it doesn't work if we want to lock
> > the page when performing a minor fault (which I want to fix fault vs
> > invalidate),
> 
> It's hard to discuss this without a description of what you want to fix
> there, and a description of how you plan to fix it.

http://marc.theaimsgroup.com/?l=linux-mm=116865911432667=2

> > and also assumes nobody's ->nopage locks the page or
> > requires any resources that are held by prepare_write (something not
> > immediately clear to me with the cluster filesystems, at least).
> 
> The nopage handler is filemap_nopage().  Are there any exceptions to that?

OCFS2 and GFS2.

> > But that all becomes legacy path, so do we really care? Supposing fs
> > maintainers like perform_write, then after the main ones have 
> > implementations
> > we could switch over to the slow-but-correct prepare_write legacy path.
> > Or we could leave it, or we could use Linus's slightly-less-buggy scheme...
> > by that point I expect I'd be sick of arguing about it ;)
> 
> It's worth "arguing" about.  This is write().  What matters more??

That's the legacy path that uses prepare/commit (ie. supposing that all
major filesystems did get converted to perform_write).

Of course I would still want my correct-but-slow version in that case,
but I just wouldn't care to argue if you still wanted to keep it fast.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc][patch 0/3] a faster buffered write deadlock fix?

2007-02-09 Thread Andrew Morton
On Fri, 9 Feb 2007 10:54:05 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote:

> On Fri, Feb 09, 2007 at 12:41:01AM -0800, Andrew Morton wrote:
> > On Thu,  8 Feb 2007 14:07:15 +0100 (CET) Nick Piggin <[EMAIL PROTECTED]> 
> > wrote:
> > 
> > > So I have finally finished a first slightly-working draft of my new aops
> > > op (perform_write) proposal. I would be interested to hear comments about
> > > it.  Most of my issues and concerns are in the patch headers themselves,
> > > so reply to them.
> > > 
> > > The patches are against my latest buffered-write-fix patchset.
> > 
> > What happened with Linus's proposal to instantiate the page as pinned,
> > non-uptodate, unlocked and in pagecache while we poke the user address?
> 
> That's still got a deadlock,

It does?

>  and also it doesn't work if we want to lock
> the page when performing a minor fault (which I want to fix fault vs
> invalidate),

It's hard to discuss this without a description of what you want to fix
there, and a description of how you plan to fix it.

> and also assumes nobody's ->nopage locks the page or
> requires any resources that are held by prepare_write (something not
> immediately clear to me with the cluster filesystems, at least).

The nopage handler is filemap_nopage().  Are there any exceptions to that?

> But that all becomes legacy path, so do we really care? Supposing fs
> maintainers like perform_write, then after the main ones have implementations
> we could switch over to the slow-but-correct prepare_write legacy path.
> Or we could leave it, or we could use Linus's slightly-less-buggy scheme...
> by that point I expect I'd be sick of arguing about it ;)

It's worth "arguing" about.  This is write().  What matters more??
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc][patch 0/3] a faster buffered write deadlock fix?

2007-02-09 Thread Nick Piggin
On Fri, Feb 09, 2007 at 12:41:01AM -0800, Andrew Morton wrote:
> On Thu,  8 Feb 2007 14:07:15 +0100 (CET) Nick Piggin <[EMAIL PROTECTED]> 
> wrote:
> 
> > So I have finally finished a first slightly-working draft of my new aops
> > op (perform_write) proposal. I would be interested to hear comments about
> > it.  Most of my issues and concerns are in the patch headers themselves,
> > so reply to them.
> > 
> > The patches are against my latest buffered-write-fix patchset.
> 
> What happened with Linus's proposal to instantiate the page as pinned,
> non-uptodate, unlocked and in pagecache while we poke the user address?

That's still got a deadlock, and also it doesn't work if we want to lock
the page when performing a minor fault (which I want to fix fault vs
invalidate), and also assumes nobody's ->nopage locks the page or
requires any resources that are held by prepare_write (something not
immediately clear to me with the cluster filesystems, at least).

But that all becomes legacy path, so do we really care? Supposing fs
maintainers like perform_write, then after the main ones have implementations
we could switch over to the slow-but-correct prepare_write legacy path.
Or we could leave it, or we could use Linus's slightly-less-buggy scheme...
by that point I expect I'd be sick of arguing about it ;)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc][patch 0/3] a faster buffered write deadlock fix?

2007-02-09 Thread Andrew Morton
On Thu,  8 Feb 2007 14:07:15 +0100 (CET) Nick Piggin <[EMAIL PROTECTED]> wrote:

> So I have finally finished a first slightly-working draft of my new aops
> op (perform_write) proposal. I would be interested to hear comments about
> it.  Most of my issues and concerns are in the patch headers themselves,
> so reply to them.
> 
> The patches are against my latest buffered-write-fix patchset.

What happened with Linus's proposal to instantiate the page as pinned,
non-uptodate, unlocked and in pagecache while we poke the user address?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc][patch 0/3] a faster buffered write deadlock fix?

2007-02-09 Thread Andrew Morton
On Thu,  8 Feb 2007 14:07:15 +0100 (CET) Nick Piggin [EMAIL PROTECTED] wrote:

 So I have finally finished a first slightly-working draft of my new aops
 op (perform_write) proposal. I would be interested to hear comments about
 it.  Most of my issues and concerns are in the patch headers themselves,
 so reply to them.
 
 The patches are against my latest buffered-write-fix patchset.

What happened with Linus's proposal to instantiate the page as pinned,
non-uptodate, unlocked and in pagecache while we poke the user address?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc][patch 0/3] a faster buffered write deadlock fix?

2007-02-09 Thread Nick Piggin
On Fri, Feb 09, 2007 at 12:41:01AM -0800, Andrew Morton wrote:
 On Thu,  8 Feb 2007 14:07:15 +0100 (CET) Nick Piggin [EMAIL PROTECTED] 
 wrote:
 
  So I have finally finished a first slightly-working draft of my new aops
  op (perform_write) proposal. I would be interested to hear comments about
  it.  Most of my issues and concerns are in the patch headers themselves,
  so reply to them.
  
  The patches are against my latest buffered-write-fix patchset.
 
 What happened with Linus's proposal to instantiate the page as pinned,
 non-uptodate, unlocked and in pagecache while we poke the user address?

That's still got a deadlock, and also it doesn't work if we want to lock
the page when performing a minor fault (which I want to fix fault vs
invalidate), and also assumes nobody's -nopage locks the page or
requires any resources that are held by prepare_write (something not
immediately clear to me with the cluster filesystems, at least).

But that all becomes legacy path, so do we really care? Supposing fs
maintainers like perform_write, then after the main ones have implementations
we could switch over to the slow-but-correct prepare_write legacy path.
Or we could leave it, or we could use Linus's slightly-less-buggy scheme...
by that point I expect I'd be sick of arguing about it ;)

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc][patch 0/3] a faster buffered write deadlock fix?

2007-02-09 Thread Andrew Morton
On Fri, 9 Feb 2007 10:54:05 +0100 Nick Piggin [EMAIL PROTECTED] wrote:

 On Fri, Feb 09, 2007 at 12:41:01AM -0800, Andrew Morton wrote:
  On Thu,  8 Feb 2007 14:07:15 +0100 (CET) Nick Piggin [EMAIL PROTECTED] 
  wrote:
  
   So I have finally finished a first slightly-working draft of my new aops
   op (perform_write) proposal. I would be interested to hear comments about
   it.  Most of my issues and concerns are in the patch headers themselves,
   so reply to them.
   
   The patches are against my latest buffered-write-fix patchset.
  
  What happened with Linus's proposal to instantiate the page as pinned,
  non-uptodate, unlocked and in pagecache while we poke the user address?
 
 That's still got a deadlock,

It does?

  and also it doesn't work if we want to lock
 the page when performing a minor fault (which I want to fix fault vs
 invalidate),

It's hard to discuss this without a description of what you want to fix
there, and a description of how you plan to fix it.

 and also assumes nobody's -nopage locks the page or
 requires any resources that are held by prepare_write (something not
 immediately clear to me with the cluster filesystems, at least).

The nopage handler is filemap_nopage().  Are there any exceptions to that?

 But that all becomes legacy path, so do we really care? Supposing fs
 maintainers like perform_write, then after the main ones have implementations
 we could switch over to the slow-but-correct prepare_write legacy path.
 Or we could leave it, or we could use Linus's slightly-less-buggy scheme...
 by that point I expect I'd be sick of arguing about it ;)

It's worth arguing about.  This is write().  What matters more??
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc][patch 0/3] a faster buffered write deadlock fix?

2007-02-09 Thread Nick Piggin
On Fri, Feb 09, 2007 at 02:09:54AM -0800, Andrew Morton wrote:
 On Fri, 9 Feb 2007 10:54:05 +0100 Nick Piggin [EMAIL PROTECTED] wrote:
 
  
  That's still got a deadlock,
 
 It does?

Yes, PG_lock vs mm-mmap_sem.

   and also it doesn't work if we want to lock
  the page when performing a minor fault (which I want to fix fault vs
  invalidate),
 
 It's hard to discuss this without a description of what you want to fix
 there, and a description of how you plan to fix it.

http://marc.theaimsgroup.com/?l=linux-mmm=116865911432667w=2

  and also assumes nobody's -nopage locks the page or
  requires any resources that are held by prepare_write (something not
  immediately clear to me with the cluster filesystems, at least).
 
 The nopage handler is filemap_nopage().  Are there any exceptions to that?

OCFS2 and GFS2.

  But that all becomes legacy path, so do we really care? Supposing fs
  maintainers like perform_write, then after the main ones have 
  implementations
  we could switch over to the slow-but-correct prepare_write legacy path.
  Or we could leave it, or we could use Linus's slightly-less-buggy scheme...
  by that point I expect I'd be sick of arguing about it ;)
 
 It's worth arguing about.  This is write().  What matters more??

That's the legacy path that uses prepare/commit (ie. supposing that all
major filesystems did get converted to perform_write).

Of course I would still want my correct-but-slow version in that case,
but I just wouldn't care to argue if you still wanted to keep it fast.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc][patch 0/3] a faster buffered write deadlock fix?

2007-02-09 Thread Andrew Morton
On Fri, 9 Feb 2007 11:32:58 +0100 Nick Piggin [EMAIL PROTECTED] wrote:

 On Fri, Feb 09, 2007 at 02:09:54AM -0800, Andrew Morton wrote:
  On Fri, 9 Feb 2007 10:54:05 +0100 Nick Piggin [EMAIL PROTECTED] wrote:
  
   
   That's still got a deadlock,
  
  It does?
 
 Yes, PG_lock vs mm-mmap_sem.

Where?  It requires that someone hold mmap_sem for writing as well as a
page lock (in an order which would require some thought).  Do we ever do
that?

and also it doesn't work if we want to lock
   the page when performing a minor fault (which I want to fix fault vs
   invalidate),
  
  It's hard to discuss this without a description of what you want to fix
  there, and a description of how you plan to fix it.
 
 http://marc.theaimsgroup.com/?l=linux-mmm=116865911432667w=2

mutter.

Could perhaps fix that by running ClearPageUptodate in invalidate, thus
forcing the pagefault code to take the page lock (which we already hold).

That does mean that we'll fleetingly have a non-uptodate page in pagetables
which is a bit nasty.

Or, probably better, we could add a new page flag (heh) telling nopage that
it needs to lock the page even if it's uptodate.

   and also assumes nobody's -nopage locks the page or
   requires any resources that are held by prepare_write (something not
   immediately clear to me with the cluster filesystems, at least).
  
  The nopage handler is filemap_nopage().  Are there any exceptions to that?
 
 OCFS2 and GFS2.

So the rule is that -nopage handlers against pagecache mustn't lock the
page if it's already uptodate.  That's OK.  But it might conflict with the
above invalidate proposal.

Gad.  ocfs2_nopage() diddles with signals.


   But that all becomes legacy path, so do we really care? Supposing fs
   maintainers like perform_write, then after the main ones have 
   implementations
   we could switch over to the slow-but-correct prepare_write legacy path.
   Or we could leave it, or we could use Linus's slightly-less-buggy 
   scheme...
   by that point I expect I'd be sick of arguing about it ;)
  
  It's worth arguing about.  This is write().  What matters more??
 
 That's the legacy path that uses prepare/commit (ie. supposing that all
 major filesystems did get converted to perform_write).

We'll never, ever, ever update and test all filesytems.  What you're
calling legacy code will be there for all time.

I haven't had time to look at the perform_write stuff yet.

 Of course I would still want my correct-but-slow version in that case,
 but I just wouldn't care to argue if you still wanted to keep it fast.

This is write().  We just cannot go and double-copy all the memory or take
mmap_sem and do a full pagetable walk in there.  It just means that we
haven't found a suitable solution yet.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc][patch 0/3] a faster buffered write deadlock fix?

2007-02-09 Thread Nick Piggin
On Fri, Feb 09, 2007 at 02:52:49AM -0800, Andrew Morton wrote:
 On Fri, 9 Feb 2007 11:32:58 +0100 Nick Piggin [EMAIL PROTECTED] wrote:
 
  On Fri, Feb 09, 2007 at 02:09:54AM -0800, Andrew Morton wrote:
   On Fri, 9 Feb 2007 10:54:05 +0100 Nick Piggin [EMAIL PROTECTED] wrote:
   

That's still got a deadlock,
   
   It does?
  
  Yes, PG_lock vs mm-mmap_sem.
 
 Where?  It requires that someone hold mmap_sem for writing as well as a
 page lock (in an order which would require some thought).  Do we ever do
 that?

task1   task2   task3
lock_page(page) down_read(mmap_sem)
down_write(mmap_sem)
down_read(mmap_sem)
lock_page(page)

t1 is blocked by t3
t2 is blocked by t1
t3 is blocked by t2

OK, it isn't quite ABBA, but that's shorthand if we remember that
that an rwsem read must obey normal lock nesting rules.

 and also it doesn't work if we want to lock
the page when performing a minor fault (which I want to fix fault vs
invalidate),
   
   It's hard to discuss this without a description of what you want to fix
   there, and a description of how you plan to fix it.
  
  http://marc.theaimsgroup.com/?l=linux-mmm=116865911432667w=2
 
 mutter.
 
 Could perhaps fix that by running ClearPageUptodate in invalidate, thus
 forcing the pagefault code to take the page lock (which we already hold).
 
 That does mean that we'll fleetingly have a non-uptodate page in pagetables
 which is a bit nasty.
 
 Or, probably better, we could add a new page flag (heh) telling nopage that
 it needs to lock the page even if it's uptodate.

I don't see how either of those suggestions would fix this bug.

and also assumes nobody's -nopage locks the page or
requires any resources that are held by prepare_write (something not
immediately clear to me with the cluster filesystems, at least).
   
   The nopage handler is filemap_nopage().  Are there any exceptions to that?
  
  OCFS2 and GFS2.
 
 So the rule is that -nopage handlers against pagecache mustn't lock the
 page if it's already uptodate.  That's OK.  But it might conflict with the
 above invalidate proposal.

Well that _will_ be the rule if you want to do Linus's half-fix. It will
also be the rule that they're not to deadlock against prepare_write.

But that all becomes legacy path, so do we really care? Supposing fs
maintainers like perform_write, then after the main ones have 
implementations
we could switch over to the slow-but-correct prepare_write legacy path.
Or we could leave it, or we could use Linus's slightly-less-buggy 
scheme...
by that point I expect I'd be sick of arguing about it ;)
   
   It's worth arguing about.  This is write().  What matters more??
  
  That's the legacy path that uses prepare/commit (ie. supposing that all
  major filesystems did get converted to perform_write).
 
 We'll never, ever, ever update and test all filesytems.  What you're
 calling legacy code will be there for all time.

I didn't say all; I still prefer correct than fast; you are still free
to keep the fast-and-buggy code in the legacy path.

 
 I haven't had time to look at the perform_write stuff yet.
 
  Of course I would still want my correct-but-slow version in that case,
  but I just wouldn't care to argue if you still wanted to keep it fast.
 
 This is write().  We just cannot go and double-copy all the memory or take
 mmap_sem and do a full pagetable walk in there.  It just means that we
 haven't found a suitable solution yet.

You prefer speed over correctness even for little used filessytems, which
is fine because I'm sick of arguing about it. The main thing for me is that
important filesystems can be correct and fast.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc][patch 0/3] a faster buffered write deadlock fix?

2007-02-09 Thread Andrew Morton
On Fri, 9 Feb 2007 12:31:16 +0100 Nick Piggin [EMAIL PROTECTED] wrote:

 But that all becomes legacy path, so do we really care? Supposing fs
 maintainers like perform_write, then after the main ones have 
 implementations
 we could switch over to the slow-but-correct prepare_write legacy 
 path.
 Or we could leave it, or we could use Linus's slightly-less-buggy 
 scheme...
 by that point I expect I'd be sick of arguing about it ;)

It's worth arguing about.  This is write().  What matters more??
   
   That's the legacy path that uses prepare/commit (ie. supposing that all
   major filesystems did get converted to perform_write).
  
  We'll never, ever, ever update and test all filesytems.  What you're
  calling legacy code will be there for all time.
 
 I didn't say all; I still prefer correct than fast;

For gawd's sake.  You can make the kernel less buggy by removing SMP
support.

Guess what?  Tradeoffs exist.

 you are still free
 to keep the fast-and-buggy code in the legacy path.

You make it sound like this is a choice.  It isn't.  Nobody is going to go
in and convert all those filesystems.

  
  I haven't had time to look at the perform_write stuff yet.
  
   Of course I would still want my correct-but-slow version in that case,
   but I just wouldn't care to argue if you still wanted to keep it fast.
  
  This is write().  We just cannot go and double-copy all the memory or take
  mmap_sem and do a full pagetable walk in there.  It just means that we
  haven't found a suitable solution yet.
 
 You prefer speed over correctness even for little used filessytems, which
 is fine because I'm sick of arguing about it. The main thing for me is that
 important filesystems can be correct and fast.

I wouldn't characterise it as arguing.  It's development.  Going and
sticking enormous slowdowns into write() to fix some bug which nobody is
hitting is insane.

We need to find a better fix, that's all.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc][patch 0/3] a faster buffered write deadlock fix?

2007-02-09 Thread Nick Piggin
On Fri, Feb 09, 2007 at 03:46:44AM -0800, Andrew Morton wrote:
 On Fri, 9 Feb 2007 12:31:16 +0100 Nick Piggin [EMAIL PROTECTED] wrote:
 
   
   We'll never, ever, ever update and test all filesytems.  What you're
   calling legacy code will be there for all time.
  
  I didn't say all; I still prefer correct than fast;
 
 For gawd's sake.  You can make the kernel less buggy by removing SMP
 support.

I'm talking about known bugs.

 Guess what?  Tradeoffs exist.

I agree that 60% is much too big of a hit for all filesystems. Which is
why I propose this new aop.

  you are still free
  to keep the fast-and-buggy code in the legacy path.
 
 You make it sound like this is a choice.  It isn't.  Nobody is going to go
 in and convert all those filesystems.

IMO, once all the maintained filesystems are converted then it would be
a good choice to make. You think otherwise and I won't argue.

   
   I haven't had time to look at the perform_write stuff yet.
   
Of course I would still want my correct-but-slow version in that case,
but I just wouldn't care to argue if you still wanted to keep it fast.
   
   This is write().  We just cannot go and double-copy all the memory or take
   mmap_sem and do a full pagetable walk in there.  It just means that we
   haven't found a suitable solution yet.
  
  You prefer speed over correctness even for little used filessytems, which
  is fine because I'm sick of arguing about it. The main thing for me is that
  important filesystems can be correct and fast.
 
 I wouldn't characterise it as arguing.  It's development.  Going and
 sticking enormous slowdowns into write() to fix some bug which nobody is
 hitting is insane.

Actually I'm doing this because I try to fix real data corruption problems
which people are hitting. You told me I can't get those fixes in until I
fix this problem.

 We need to find a better fix, that's all.

I actually found perform_write to be a speedup. And if perform_write is
merged then I would be happy to not fix the prepare_write path, or wait for
someone to come up with a better fix.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc][patch 0/3] a faster buffered write deadlock fix?

2007-02-08 Thread Nick Piggin
On Thu, Feb 08, 2007 at 04:38:01PM -0800, Mark Fasheh wrote:
> On Thu, Feb 08, 2007 at 02:07:15PM +0100, Nick Piggin wrote:
> > The problem is that the existing aops interface is crap. "correct, fast,
> > compatible -- choose any 2"
> 
> Agreed. There's lots of problems with the interface (imho), but my biggest
> two issues are the page lock being held on ->prepare_write() /
> ->commit_write() and the fact that the file system only sees the write one
> page at a time
> 
> 
> > So I have finally finished a first slightly-working draft of my new aops
> > op (perform_write) proposal. I would be interested to hear comments about
> > it.  Most of my issues and concerns are in the patch headers themselves,
> > so reply to them.
> 
> I like ->perform_write(). It allows the file system to make re-use of the
> checks in the generic write path, but gives a maximum amount of information
> about the overall operation to be done. There's an added advantage in that
> some file systems (ocfs2 is one of these) want to be more careful about
> ordering page locks, which should be much easier with it.

Yeah, if possible I like a range based interface rather than page
based. As you say it gives the most information with the least constraints.

> If this goes in, it could probably be helpful to me in some of the code I'm
> currently writing which needs to be better about page lock / cluster lock
> ordering and wants to see as much of the (allocating) writes as possible.

I think it would be important to have a non trivial user of this new API
before it goes into mainline. It would be great if you could look at
using it, after it passes some review.

Thanks,
Nick
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc][patch 0/3] a faster buffered write deadlock fix?

2007-02-08 Thread Mark Fasheh
On Thu, Feb 08, 2007 at 02:07:15PM +0100, Nick Piggin wrote:
> The problem is that the existing aops interface is crap. "correct, fast,
> compatible -- choose any 2"

Agreed. There's lots of problems with the interface (imho), but my biggest
two issues are the page lock being held on ->prepare_write() /
->commit_write() and the fact that the file system only sees the write one
page at a time


> So I have finally finished a first slightly-working draft of my new aops
> op (perform_write) proposal. I would be interested to hear comments about
> it.  Most of my issues and concerns are in the patch headers themselves,
> so reply to them.

I like ->perform_write(). It allows the file system to make re-use of the
checks in the generic write path, but gives a maximum amount of information
about the overall operation to be done. There's an added advantage in that
some file systems (ocfs2 is one of these) want to be more careful about
ordering page locks, which should be much easier with it.

If this goes in, it could probably be helpful to me in some of the code I'm
currently writing which needs to be better about page lock / cluster lock
ordering and wants to see as much of the (allocating) writes as possible.
--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc][patch 0/3] a faster buffered write deadlock fix?

2007-02-08 Thread Mark Fasheh
On Thu, Feb 08, 2007 at 02:07:15PM +0100, Nick Piggin wrote:
 The problem is that the existing aops interface is crap. correct, fast,
 compatible -- choose any 2

Agreed. There's lots of problems with the interface (imho), but my biggest
two issues are the page lock being held on -prepare_write() /
-commit_write() and the fact that the file system only sees the write one
page at a time


 So I have finally finished a first slightly-working draft of my new aops
 op (perform_write) proposal. I would be interested to hear comments about
 it.  Most of my issues and concerns are in the patch headers themselves,
 so reply to them.

I like -perform_write(). It allows the file system to make re-use of the
checks in the generic write path, but gives a maximum amount of information
about the overall operation to be done. There's an added advantage in that
some file systems (ocfs2 is one of these) want to be more careful about
ordering page locks, which should be much easier with it.

If this goes in, it could probably be helpful to me in some of the code I'm
currently writing which needs to be better about page lock / cluster lock
ordering and wants to see as much of the (allocating) writes as possible.
--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc][patch 0/3] a faster buffered write deadlock fix?

2007-02-08 Thread Nick Piggin
On Thu, Feb 08, 2007 at 04:38:01PM -0800, Mark Fasheh wrote:
 On Thu, Feb 08, 2007 at 02:07:15PM +0100, Nick Piggin wrote:
  The problem is that the existing aops interface is crap. correct, fast,
  compatible -- choose any 2
 
 Agreed. There's lots of problems with the interface (imho), but my biggest
 two issues are the page lock being held on -prepare_write() /
 -commit_write() and the fact that the file system only sees the write one
 page at a time
 
 
  So I have finally finished a first slightly-working draft of my new aops
  op (perform_write) proposal. I would be interested to hear comments about
  it.  Most of my issues and concerns are in the patch headers themselves,
  so reply to them.
 
 I like -perform_write(). It allows the file system to make re-use of the
 checks in the generic write path, but gives a maximum amount of information
 about the overall operation to be done. There's an added advantage in that
 some file systems (ocfs2 is one of these) want to be more careful about
 ordering page locks, which should be much easier with it.

Yeah, if possible I like a range based interface rather than page
based. As you say it gives the most information with the least constraints.

 If this goes in, it could probably be helpful to me in some of the code I'm
 currently writing which needs to be better about page lock / cluster lock
 ordering and wants to see as much of the (allocating) writes as possible.

I think it would be important to have a non trivial user of this new API
before it goes into mainline. It would be great if you could look at
using it, after it passes some review.

Thanks,
Nick
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/