Re: [rfc][patch 0/3] a faster buffered write deadlock fix?
On Fri, Feb 09, 2007 at 03:46:44AM -0800, Andrew Morton wrote: > On Fri, 9 Feb 2007 12:31:16 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote: > > > > > > > We'll never, ever, ever update and test all filesytems. What you're > > > calling "legacy" code will be there for all time. > > > > I didn't say all; I still prefer correct than fast; > > For gawd's sake. You can make the kernel less buggy by removing SMP > support. I'm talking about known bugs. > Guess what? Tradeoffs exist. I agree that 60% is much too big of a hit for all filesystems. Which is why I propose this new aop. > > you are still free > > to keep the fast-and-buggy code in the legacy path. > > You make it sound like this is a choice. It isn't. Nobody is going to go > in and convert all those filesystems. IMO, once all the maintained filesystems are converted then it would be a good choice to make. You think otherwise and I won't argue. > > > > > > I haven't had time to look at the perform_write stuff yet. > > > > > > > Of course I would still want my correct-but-slow version in that case, > > > > but I just wouldn't care to argue if you still wanted to keep it fast. > > > > > > This is write(). We just cannot go and double-copy all the memory or take > > > mmap_sem and do a full pagetable walk in there. It just means that we > > > haven't found a suitable solution yet. > > > > You prefer speed over correctness even for little used filessytems, which > > is fine because I'm sick of arguing about it. The main thing for me is that > > important filesystems can be correct and fast. > > I wouldn't characterise it as "arguing". It's development. Going and > sticking enormous slowdowns into write() to fix some bug which nobody is > hitting is insane. Actually I'm doing this because I try to fix real data corruption problems which people are hitting. You told me I can't get those fixes in until I fix this problem. > We need to find a better fix, that's all. I actually found perform_write to be a speedup. And if perform_write is merged then I would be happy to not fix the prepare_write path, or wait for someone to come up with a better fix. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc][patch 0/3] a faster buffered write deadlock fix?
On Fri, 9 Feb 2007 12:31:16 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote: > > > > > But that all becomes legacy path, so do we really care? Supposing fs > > > > > maintainers like perform_write, then after the main ones have > > > > > implementations > > > > > we could switch over to the slow-but-correct prepare_write legacy > > > > > path. > > > > > Or we could leave it, or we could use Linus's slightly-less-buggy > > > > > scheme... > > > > > by that point I expect I'd be sick of arguing about it ;) > > > > > > > > It's worth "arguing" about. This is write(). What matters more?? > > > > > > That's the legacy path that uses prepare/commit (ie. supposing that all > > > major filesystems did get converted to perform_write). > > > > We'll never, ever, ever update and test all filesytems. What you're > > calling "legacy" code will be there for all time. > > I didn't say all; I still prefer correct than fast; For gawd's sake. You can make the kernel less buggy by removing SMP support. Guess what? Tradeoffs exist. > you are still free > to keep the fast-and-buggy code in the legacy path. You make it sound like this is a choice. It isn't. Nobody is going to go in and convert all those filesystems. > > > > I haven't had time to look at the perform_write stuff yet. > > > > > Of course I would still want my correct-but-slow version in that case, > > > but I just wouldn't care to argue if you still wanted to keep it fast. > > > > This is write(). We just cannot go and double-copy all the memory or take > > mmap_sem and do a full pagetable walk in there. It just means that we > > haven't found a suitable solution yet. > > You prefer speed over correctness even for little used filessytems, which > is fine because I'm sick of arguing about it. The main thing for me is that > important filesystems can be correct and fast. I wouldn't characterise it as "arguing". It's development. Going and sticking enormous slowdowns into write() to fix some bug which nobody is hitting is insane. We need to find a better fix, that's all. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc][patch 0/3] a faster buffered write deadlock fix?
On Fri, Feb 09, 2007 at 02:52:49AM -0800, Andrew Morton wrote: > On Fri, 9 Feb 2007 11:32:58 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote: > > > On Fri, Feb 09, 2007 at 02:09:54AM -0800, Andrew Morton wrote: > > > On Fri, 9 Feb 2007 10:54:05 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote: > > > > > > > > > > > That's still got a deadlock, > > > > > > It does? > > > > Yes, PG_lock vs mm->mmap_sem. > > Where? It requires that someone hold mmap_sem for writing as well as a > page lock (in an order which would require some thought). Do we ever do > that? task1 task2 task3 lock_page(page) down_read(mmap_sem) down_write(mmap_sem) down_read(mmap_sem) lock_page(page) t1 is blocked by t3 t2 is blocked by t1 t3 is blocked by t2 OK, it isn't quite ABBA, but that's shorthand if we remember that that an rwsem read must obey normal lock nesting rules. > > > > and also it doesn't work if we want to lock > > > > the page when performing a minor fault (which I want to fix fault vs > > > > invalidate), > > > > > > It's hard to discuss this without a description of what you want to fix > > > there, and a description of how you plan to fix it. > > > > http://marc.theaimsgroup.com/?l=linux-mm&m=116865911432667&w=2 > > mutter. > > Could perhaps fix that by running ClearPageUptodate in invalidate, thus > forcing the pagefault code to take the page lock (which we already hold). > > That does mean that we'll fleetingly have a non-uptodate page in pagetables > which is a bit nasty. > > Or, probably better, we could add a new page flag (heh) telling nopage that > it needs to lock the page even if it's uptodate. I don't see how either of those suggestions would fix this bug. > > > > and also assumes nobody's ->nopage locks the page or > > > > requires any resources that are held by prepare_write (something not > > > > immediately clear to me with the cluster filesystems, at least). > > > > > > The nopage handler is filemap_nopage(). Are there any exceptions to that? > > > > OCFS2 and GFS2. > > So the rule is that ->nopage handlers against pagecache mustn't lock the > page if it's already uptodate. That's OK. But it might conflict with the > above invalidate proposal. Well that _will_ be the rule if you want to do Linus's half-fix. It will also be the rule that they're not to deadlock against prepare_write. > > > > But that all becomes legacy path, so do we really care? Supposing fs > > > > maintainers like perform_write, then after the main ones have > > > > implementations > > > > we could switch over to the slow-but-correct prepare_write legacy path. > > > > Or we could leave it, or we could use Linus's slightly-less-buggy > > > > scheme... > > > > by that point I expect I'd be sick of arguing about it ;) > > > > > > It's worth "arguing" about. This is write(). What matters more?? > > > > That's the legacy path that uses prepare/commit (ie. supposing that all > > major filesystems did get converted to perform_write). > > We'll never, ever, ever update and test all filesytems. What you're > calling "legacy" code will be there for all time. I didn't say all; I still prefer correct than fast; you are still free to keep the fast-and-buggy code in the legacy path. > > I haven't had time to look at the perform_write stuff yet. > > > Of course I would still want my correct-but-slow version in that case, > > but I just wouldn't care to argue if you still wanted to keep it fast. > > This is write(). We just cannot go and double-copy all the memory or take > mmap_sem and do a full pagetable walk in there. It just means that we > haven't found a suitable solution yet. You prefer speed over correctness even for little used filessytems, which is fine because I'm sick of arguing about it. The main thing for me is that important filesystems can be correct and fast. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc][patch 0/3] a faster buffered write deadlock fix?
On Fri, 9 Feb 2007 11:32:58 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote: > On Fri, Feb 09, 2007 at 02:09:54AM -0800, Andrew Morton wrote: > > On Fri, 9 Feb 2007 10:54:05 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote: > > > > > > > > That's still got a deadlock, > > > > It does? > > Yes, PG_lock vs mm->mmap_sem. Where? It requires that someone hold mmap_sem for writing as well as a page lock (in an order which would require some thought). Do we ever do that? > > > and also it doesn't work if we want to lock > > > the page when performing a minor fault (which I want to fix fault vs > > > invalidate), > > > > It's hard to discuss this without a description of what you want to fix > > there, and a description of how you plan to fix it. > > http://marc.theaimsgroup.com/?l=linux-mm&m=116865911432667&w=2 mutter. Could perhaps fix that by running ClearPageUptodate in invalidate, thus forcing the pagefault code to take the page lock (which we already hold). That does mean that we'll fleetingly have a non-uptodate page in pagetables which is a bit nasty. Or, probably better, we could add a new page flag (heh) telling nopage that it needs to lock the page even if it's uptodate. > > > and also assumes nobody's ->nopage locks the page or > > > requires any resources that are held by prepare_write (something not > > > immediately clear to me with the cluster filesystems, at least). > > > > The nopage handler is filemap_nopage(). Are there any exceptions to that? > > OCFS2 and GFS2. So the rule is that ->nopage handlers against pagecache mustn't lock the page if it's already uptodate. That's OK. But it might conflict with the above invalidate proposal. Gad. ocfs2_nopage() diddles with signals. > > > But that all becomes legacy path, so do we really care? Supposing fs > > > maintainers like perform_write, then after the main ones have > > > implementations > > > we could switch over to the slow-but-correct prepare_write legacy path. > > > Or we could leave it, or we could use Linus's slightly-less-buggy > > > scheme... > > > by that point I expect I'd be sick of arguing about it ;) > > > > It's worth "arguing" about. This is write(). What matters more?? > > That's the legacy path that uses prepare/commit (ie. supposing that all > major filesystems did get converted to perform_write). We'll never, ever, ever update and test all filesytems. What you're calling "legacy" code will be there for all time. I haven't had time to look at the perform_write stuff yet. > Of course I would still want my correct-but-slow version in that case, > but I just wouldn't care to argue if you still wanted to keep it fast. This is write(). We just cannot go and double-copy all the memory or take mmap_sem and do a full pagetable walk in there. It just means that we haven't found a suitable solution yet. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc][patch 0/3] a faster buffered write deadlock fix?
On Fri, Feb 09, 2007 at 02:09:54AM -0800, Andrew Morton wrote: > On Fri, 9 Feb 2007 10:54:05 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote: > > > > > That's still got a deadlock, > > It does? Yes, PG_lock vs mm->mmap_sem. > > and also it doesn't work if we want to lock > > the page when performing a minor fault (which I want to fix fault vs > > invalidate), > > It's hard to discuss this without a description of what you want to fix > there, and a description of how you plan to fix it. http://marc.theaimsgroup.com/?l=linux-mm&m=116865911432667&w=2 > > and also assumes nobody's ->nopage locks the page or > > requires any resources that are held by prepare_write (something not > > immediately clear to me with the cluster filesystems, at least). > > The nopage handler is filemap_nopage(). Are there any exceptions to that? OCFS2 and GFS2. > > But that all becomes legacy path, so do we really care? Supposing fs > > maintainers like perform_write, then after the main ones have > > implementations > > we could switch over to the slow-but-correct prepare_write legacy path. > > Or we could leave it, or we could use Linus's slightly-less-buggy scheme... > > by that point I expect I'd be sick of arguing about it ;) > > It's worth "arguing" about. This is write(). What matters more?? That's the legacy path that uses prepare/commit (ie. supposing that all major filesystems did get converted to perform_write). Of course I would still want my correct-but-slow version in that case, but I just wouldn't care to argue if you still wanted to keep it fast. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc][patch 0/3] a faster buffered write deadlock fix?
On Fri, 9 Feb 2007 10:54:05 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote: > On Fri, Feb 09, 2007 at 12:41:01AM -0800, Andrew Morton wrote: > > On Thu, 8 Feb 2007 14:07:15 +0100 (CET) Nick Piggin <[EMAIL PROTECTED]> > > wrote: > > > > > So I have finally finished a first slightly-working draft of my new aops > > > op (perform_write) proposal. I would be interested to hear comments about > > > it. Most of my issues and concerns are in the patch headers themselves, > > > so reply to them. > > > > > > The patches are against my latest buffered-write-fix patchset. > > > > What happened with Linus's proposal to instantiate the page as pinned, > > non-uptodate, unlocked and in pagecache while we poke the user address? > > That's still got a deadlock, It does? > and also it doesn't work if we want to lock > the page when performing a minor fault (which I want to fix fault vs > invalidate), It's hard to discuss this without a description of what you want to fix there, and a description of how you plan to fix it. > and also assumes nobody's ->nopage locks the page or > requires any resources that are held by prepare_write (something not > immediately clear to me with the cluster filesystems, at least). The nopage handler is filemap_nopage(). Are there any exceptions to that? > But that all becomes legacy path, so do we really care? Supposing fs > maintainers like perform_write, then after the main ones have implementations > we could switch over to the slow-but-correct prepare_write legacy path. > Or we could leave it, or we could use Linus's slightly-less-buggy scheme... > by that point I expect I'd be sick of arguing about it ;) It's worth "arguing" about. This is write(). What matters more?? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc][patch 0/3] a faster buffered write deadlock fix?
On Fri, Feb 09, 2007 at 12:41:01AM -0800, Andrew Morton wrote: > On Thu, 8 Feb 2007 14:07:15 +0100 (CET) Nick Piggin <[EMAIL PROTECTED]> > wrote: > > > So I have finally finished a first slightly-working draft of my new aops > > op (perform_write) proposal. I would be interested to hear comments about > > it. Most of my issues and concerns are in the patch headers themselves, > > so reply to them. > > > > The patches are against my latest buffered-write-fix patchset. > > What happened with Linus's proposal to instantiate the page as pinned, > non-uptodate, unlocked and in pagecache while we poke the user address? That's still got a deadlock, and also it doesn't work if we want to lock the page when performing a minor fault (which I want to fix fault vs invalidate), and also assumes nobody's ->nopage locks the page or requires any resources that are held by prepare_write (something not immediately clear to me with the cluster filesystems, at least). But that all becomes legacy path, so do we really care? Supposing fs maintainers like perform_write, then after the main ones have implementations we could switch over to the slow-but-correct prepare_write legacy path. Or we could leave it, or we could use Linus's slightly-less-buggy scheme... by that point I expect I'd be sick of arguing about it ;) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc][patch 0/3] a faster buffered write deadlock fix?
On Thu, 8 Feb 2007 14:07:15 +0100 (CET) Nick Piggin <[EMAIL PROTECTED]> wrote: > So I have finally finished a first slightly-working draft of my new aops > op (perform_write) proposal. I would be interested to hear comments about > it. Most of my issues and concerns are in the patch headers themselves, > so reply to them. > > The patches are against my latest buffered-write-fix patchset. What happened with Linus's proposal to instantiate the page as pinned, non-uptodate, unlocked and in pagecache while we poke the user address? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc][patch 0/3] a faster buffered write deadlock fix?
On Thu, Feb 08, 2007 at 04:38:01PM -0800, Mark Fasheh wrote: > On Thu, Feb 08, 2007 at 02:07:15PM +0100, Nick Piggin wrote: > > The problem is that the existing aops interface is crap. "correct, fast, > > compatible -- choose any 2" > > Agreed. There's lots of problems with the interface (imho), but my biggest > two issues are the page lock being held on ->prepare_write() / > ->commit_write() and the fact that the file system only sees the write one > page at a time > > > > So I have finally finished a first slightly-working draft of my new aops > > op (perform_write) proposal. I would be interested to hear comments about > > it. Most of my issues and concerns are in the patch headers themselves, > > so reply to them. > > I like ->perform_write(). It allows the file system to make re-use of the > checks in the generic write path, but gives a maximum amount of information > about the overall operation to be done. There's an added advantage in that > some file systems (ocfs2 is one of these) want to be more careful about > ordering page locks, which should be much easier with it. Yeah, if possible I like a range based interface rather than page based. As you say it gives the most information with the least constraints. > If this goes in, it could probably be helpful to me in some of the code I'm > currently writing which needs to be better about page lock / cluster lock > ordering and wants to see as much of the (allocating) writes as possible. I think it would be important to have a non trivial user of this new API before it goes into mainline. It would be great if you could look at using it, after it passes some review. Thanks, Nick - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc][patch 0/3] a faster buffered write deadlock fix?
On Thu, Feb 08, 2007 at 02:07:15PM +0100, Nick Piggin wrote: > The problem is that the existing aops interface is crap. "correct, fast, > compatible -- choose any 2" Agreed. There's lots of problems with the interface (imho), but my biggest two issues are the page lock being held on ->prepare_write() / ->commit_write() and the fact that the file system only sees the write one page at a time > So I have finally finished a first slightly-working draft of my new aops > op (perform_write) proposal. I would be interested to hear comments about > it. Most of my issues and concerns are in the patch headers themselves, > so reply to them. I like ->perform_write(). It allows the file system to make re-use of the checks in the generic write path, but gives a maximum amount of information about the overall operation to be done. There's an added advantage in that some file systems (ocfs2 is one of these) want to be more careful about ordering page locks, which should be much easier with it. If this goes in, it could probably be helpful to me in some of the code I'm currently writing which needs to be better about page lock / cluster lock ordering and wants to see as much of the (allocating) writes as possible. --Mark -- Mark Fasheh Senior Software Developer, Oracle [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[rfc][patch 0/3] a faster buffered write deadlock fix?
In my last set of numbers for my buffered-write deadlock fix using 2 copies per page, I realised there is no real performance hit for !uptodate pages as opposed to uptodate ones. This is unexpected because the uptodate pages only require a single copy... The problem turns out to be operator error. I forgot tmpfs won't use this prepare_write path, so sorry about that. On ext2, copy 64MB of data from /dev/zero (IO isn't involved), using 4K and 64K block sizes, and conv=notrunc for testing overwriting of uptodate pages. Numbers is elapsed time in seconds, lower is better. 2.6.20 bufferd write fix 4K 0.0742 0.1208 (1.63x) 4K-uptodate 0.0493 0.0479 (0.97x) 64K 0.0671 0.1068 (1.59x) 64K-uptodate0.0357 0.0362 (1.01x) So we get about a 60% performance hit, which is more expected. I guess if 0.5% doesn't fly, then 60% is right out ;) If there were any misconceptions, the problem is not that the code is incredibly tricky or impossible to fix with good performance. The problem is that the existing aops interface is crap. "correct, fast, compatible -- choose any 2" So I have finally finished a first slightly-working draft of my new aops op (perform_write) proposal. I would be interested to hear comments about it. Most of my issues and concerns are in the patch headers themselves, so reply to them. The patches are against my latest buffered-write-fix patchset. This means filesystems not implementing the new aop, will remain safe, if slow. Here's some numbers after converting ext2 to the new aop: 2.6.20 perform_write aop 4K 0.0742 0.0769 (1.04x) 4K-uptodate 0.0493 0.0475 (0.96x) 64K 0.0671 0.0613 (0.91x) 64K-uptodate0.0357 0.0343 (0.96x) Thanks, Nick -- SuSE Labs - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/