Re: 2.6.22 -mm merge plans -- vm bugfixes
Hugh Dickins wrote: On Thu, 10 May 2007, Nick Piggin wrote: The filesystem (or page cache) allows pages beyond i_size to come in there? That wasn't a problem before, was it? But now it is? The filesystem still doesn't, but if i_size is updated after the page is returned, we can have a problem that was previously taken care of with the truncate_count but now isn't. But... I thought the page lock was now taking care of that in your scheme? truncate_inode_pages has to wait for the page lock, then it finds the page is mapped and... ahh, it finds the copiee page is not mapped, so doesn't do its own little unmap_mapping_range, and the copied page squeaks through. Drat. I really think the truncate_count solution worked better, for truncation anyway. There may be persuasive reasons you need the page lock for invalidation: I gave up on trying to understand the required behaviour(s) for invalidation. So, bring back (the original use of, not my tree marker use of) truncate_count? Hmm, you probably don't want to do that, because there was some pleasure in removing the strange barriers associated with it. A second unmap_mapping_range is just one line of code - but it sure feels like a defeat to me, calling the whole exercise into question. (But then, you'd be right to say my perfectionism made it impossible for me to come up with any solution to the invalidation issues.) Well we could bring back the truncate_count, but I think that sucks because that's moving work into the page fault handler in order to avoid a bit of work when truncating mapped files. But that's a change we could have made at any time if we'd bothered, it's not really the issue here. I don't see how you could, because you need to increment truncate_count. Though indeed we did so, I don't see that we needed to increment truncate_count in that case (nobody could be coming through do_no_page on that file, when there are no mappings of it). Of course :P -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
On Thu, 10 May 2007, Nick Piggin wrote: > > > > The filesystem (or page cache) allows pages beyond i_size to come > > in there? That wasn't a problem before, was it? But now it is? > > The filesystem still doesn't, but if i_size is updated after the page > is returned, we can have a problem that was previously taken care of > with the truncate_count but now isn't. But... I thought the page lock was now taking care of that in your scheme? truncate_inode_pages has to wait for the page lock, then it finds the page is mapped and... ahh, it finds the copiee page is not mapped, so doesn't do its own little unmap_mapping_range, and the copied page squeaks through. Drat. I really think the truncate_count solution worked better, for truncation anyway. There may be persuasive reasons you need the page lock for invalidation: I gave up on trying to understand the required behaviour(s) for invalidation. So, bring back (the original use of, not my tree marker use of) truncate_count? Hmm, you probably don't want to do that, because there was some pleasure in removing the strange barriers associated with it. A second unmap_mapping_range is just one line of code - but it sure feels like a defeat to me, calling the whole exercise into question. (But then, you'd be right to say my perfectionism made it impossible for me to come up with any solution to the invalidation issues.) > > Suspect you'd need a barrier of some kind between the i_size_write and > > the mapping_mapped test? > > The unmap_mapping_range that runs after the truncate_inode_pages should > run in the correct order, I believe. Yes, if there's going to be that backup call, the first won't really need a barrier. > > But that's a change we could have made at > > any time if we'd bothered, it's not really the issue here. > > I don't see how you could, because you need to increment truncate_count. Though indeed we did so, I don't see that we needed to increment truncate_count in that case (nobody could be coming through do_no_page on that file, when there are no mappings of it). > But I believe this is fixing the issue, even if it does so in a peripheral > manner, because it avoids the added cost for unmapped files. It's a small improvement to your common case, I agree. Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
Hugh Dickins wrote: On Wed, 9 May 2007, Nick Piggin wrote: Hugh Dickins wrote: On Wed, 2 May 2007, Nick Piggin wrote: But I'm pretty sure (to use your words!) regular truncate was not racy before: I believe Andrea's sequence count was handling that case fine, without a second unmap_mapping_range. OK, I think you're right. I _think_ it should also be OK with the lock_page version as well: we should not be able to have any pages after the first unmap_mapping_range call, because of the i_size write. So if we have no pages, there is nothing to 'cow' from. I'd be delighted if you can remove those later unmap_mapping_ranges. As I recall, the important thing for the copy pages is to be holding the page lock (or whatever other serialization) on the copied page still while the copy page is inserted into pagetable: that looks to be so in your __do_fault. Hmm, on second thoughts, I think I was right the first time, and do need the unmap after the pages are truncated. With the lock_page code, after the first unmap, we can get new ptes mapping pages, and subsequently they can be COWed and then the original pte zapped before the truncate loop checks it. The filesystem (or page cache) allows pages beyond i_size to come in there? That wasn't a problem before, was it? But now it is? The filesystem still doesn't, but if i_size is updated after the page is returned, we can have a problem that was previously taken care of with the truncate_count but now isn't. However, I wonder if we can't test mapping_mapped before the spinlock, which would make most truncates cheaper? Slightly cheaper, yes, though I doubt it'd be much in comparison with actually doing any work in unmap_mapping_range or truncate_inode_pages. But if we're supposing the common case for truncate is unmapped mappings, then the main cost there will be the locking, which I'm trying to avoid. Hopefully with this patch, most truncate workloads would get faster, even though truncate mapped files is going to be unavoidably slower. Suspect you'd need a barrier of some kind between the i_size_write and the mapping_mapped test? The unmap_mapping_range that runs after the truncate_inode_pages should run in the correct order, I believe. But that's a change we could have made at any time if we'd bothered, it's not really the issue here. I don't see how you could, because you need to increment truncate_count. But I believe this is fixing the issue, even if it does so in a peripheral manner, because it avoids the added cost for unmapped files. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
On Wed, 9 May 2007, Nick Piggin wrote: > Hugh Dickins wrote: > > On Wed, 2 May 2007, Nick Piggin wrote: > > > > >But I'm pretty sure (to use your words!) regular truncate was not racy > > > >before: I believe Andrea's sequence count was handling that case fine, > > > >without a second unmap_mapping_range. > > > > > >OK, I think you're right. I _think_ it should also be OK with the > > >lock_page version as well: we should not be able to have any pages > > >after the first unmap_mapping_range call, because of the i_size > > >write. So if we have no pages, there is nothing to 'cow' from. > > > > I'd be delighted if you can remove those later unmap_mapping_ranges. > > As I recall, the important thing for the copy pages is to be holding > > the page lock (or whatever other serialization) on the copied page > > still while the copy page is inserted into pagetable: that looks > > to be so in your __do_fault. > > Hmm, on second thoughts, I think I was right the first time, and do > need the unmap after the pages are truncated. With the lock_page code, > after the first unmap, we can get new ptes mapping pages, and > subsequently they can be COWed and then the original pte zapped before > the truncate loop checks it. The filesystem (or page cache) allows pages beyond i_size to come in there? That wasn't a problem before, was it? But now it is? > > However, I wonder if we can't test mapping_mapped before the spinlock, > which would make most truncates cheaper? Slightly cheaper, yes, though I doubt it'd be much in comparison with actually doing any work in unmap_mapping_range or truncate_inode_pages. Suspect you'd need a barrier of some kind between the i_size_write and the mapping_mapped test? But that's a change we could have made at any time if we'd bothered, it's not really the issue here. Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
Hugh Dickins wrote: On Wed, 2 May 2007, Nick Piggin wrote: But I'm pretty sure (to use your words!) regular truncate was not racy before: I believe Andrea's sequence count was handling that case fine, without a second unmap_mapping_range. OK, I think you're right. I _think_ it should also be OK with the lock_page version as well: we should not be able to have any pages after the first unmap_mapping_range call, because of the i_size write. So if we have no pages, there is nothing to 'cow' from. I'd be delighted if you can remove those later unmap_mapping_ranges. As I recall, the important thing for the copy pages is to be holding the page lock (or whatever other serialization) on the copied page still while the copy page is inserted into pagetable: that looks to be so in your __do_fault. Hmm, on second thoughts, I think I was right the first time, and do need the unmap after the pages are truncated. With the lock_page code, after the first unmap, we can get new ptes mapping pages, and subsequently they can be COWed and then the original pte zapped before the truncate loop checks it. However, I wonder if we can't test mapping_mapped before the spinlock, which would make most truncates cheaper? -- SUSE Labs, Novell Inc. Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c 2007-04-24 15:02:51.0 +1000 +++ linux-2.6/mm/filemap.c 2007-05-09 17:30:47.0 +1000 @@ -2579,8 +2579,7 @@ if (rw == WRITE) { write_len = iov_length(iov, nr_segs); end = (offset + write_len - 1) >> PAGE_CACHE_SHIFT; - if (mapping_mapped(mapping)) - unmap_mapping_range(mapping, offset, write_len, 0); + unmap_mapping_range(mapping, offset, write_len, 0); } retval = filemap_write_and_wait(mapping); Index: linux-2.6/mm/memory.c === --- linux-2.6.orig/mm/memory.c 2007-05-09 17:25:28.0 +1000 +++ linux-2.6/mm/memory.c 2007-05-09 17:30:22.0 +1000 @@ -1956,6 +1956,9 @@ pgoff_t hba = holebegin >> PAGE_SHIFT; pgoff_t hlen = (holelen + PAGE_SIZE - 1) >> PAGE_SHIFT; + if (!mapping_mapped(mapping)) + return; + /* Check for overflow. */ if (sizeof(holelen) > sizeof(hlen)) { long long holeend =
Re: 2.6.22 -mm merge plans -- vm bugfixes
Hugh Dickins wrote: On Wed, 2 May 2007, Nick Piggin wrote: But I'm pretty sure (to use your words!) regular truncate was not racy before: I believe Andrea's sequence count was handling that case fine, without a second unmap_mapping_range. OK, I think you're right. I _think_ it should also be OK with the lock_page version as well: we should not be able to have any pages after the first unmap_mapping_range call, because of the i_size write. So if we have no pages, there is nothing to 'cow' from. I'd be delighted if you can remove those later unmap_mapping_ranges. As I recall, the important thing for the copy pages is to be holding the page lock (or whatever other serialization) on the copied page still while the copy page is inserted into pagetable: that looks to be so in your __do_fault. Hmm, on second thoughts, I think I was right the first time, and do need the unmap after the pages are truncated. With the lock_page code, after the first unmap, we can get new ptes mapping pages, and subsequently they can be COWed and then the original pte zapped before the truncate loop checks it. However, I wonder if we can't test mapping_mapped before the spinlock, which would make most truncates cheaper? -- SUSE Labs, Novell Inc. Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c 2007-04-24 15:02:51.0 +1000 +++ linux-2.6/mm/filemap.c 2007-05-09 17:30:47.0 +1000 @@ -2579,8 +2579,7 @@ if (rw == WRITE) { write_len = iov_length(iov, nr_segs); end = (offset + write_len - 1) PAGE_CACHE_SHIFT; - if (mapping_mapped(mapping)) - unmap_mapping_range(mapping, offset, write_len, 0); + unmap_mapping_range(mapping, offset, write_len, 0); } retval = filemap_write_and_wait(mapping); Index: linux-2.6/mm/memory.c === --- linux-2.6.orig/mm/memory.c 2007-05-09 17:25:28.0 +1000 +++ linux-2.6/mm/memory.c 2007-05-09 17:30:22.0 +1000 @@ -1956,6 +1956,9 @@ pgoff_t hba = holebegin PAGE_SHIFT; pgoff_t hlen = (holelen + PAGE_SIZE - 1) PAGE_SHIFT; + if (!mapping_mapped(mapping)) + return; + /* Check for overflow. */ if (sizeof(holelen) sizeof(hlen)) { long long holeend =
Re: 2.6.22 -mm merge plans -- vm bugfixes
On Wed, 9 May 2007, Nick Piggin wrote: Hugh Dickins wrote: On Wed, 2 May 2007, Nick Piggin wrote: But I'm pretty sure (to use your words!) regular truncate was not racy before: I believe Andrea's sequence count was handling that case fine, without a second unmap_mapping_range. OK, I think you're right. I _think_ it should also be OK with the lock_page version as well: we should not be able to have any pages after the first unmap_mapping_range call, because of the i_size write. So if we have no pages, there is nothing to 'cow' from. I'd be delighted if you can remove those later unmap_mapping_ranges. As I recall, the important thing for the copy pages is to be holding the page lock (or whatever other serialization) on the copied page still while the copy page is inserted into pagetable: that looks to be so in your __do_fault. Hmm, on second thoughts, I think I was right the first time, and do need the unmap after the pages are truncated. With the lock_page code, after the first unmap, we can get new ptes mapping pages, and subsequently they can be COWed and then the original pte zapped before the truncate loop checks it. The filesystem (or page cache) allows pages beyond i_size to come in there? That wasn't a problem before, was it? But now it is? However, I wonder if we can't test mapping_mapped before the spinlock, which would make most truncates cheaper? Slightly cheaper, yes, though I doubt it'd be much in comparison with actually doing any work in unmap_mapping_range or truncate_inode_pages. Suspect you'd need a barrier of some kind between the i_size_write and the mapping_mapped test? But that's a change we could have made at any time if we'd bothered, it's not really the issue here. Hugh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
Hugh Dickins wrote: On Wed, 9 May 2007, Nick Piggin wrote: Hugh Dickins wrote: On Wed, 2 May 2007, Nick Piggin wrote: But I'm pretty sure (to use your words!) regular truncate was not racy before: I believe Andrea's sequence count was handling that case fine, without a second unmap_mapping_range. OK, I think you're right. I _think_ it should also be OK with the lock_page version as well: we should not be able to have any pages after the first unmap_mapping_range call, because of the i_size write. So if we have no pages, there is nothing to 'cow' from. I'd be delighted if you can remove those later unmap_mapping_ranges. As I recall, the important thing for the copy pages is to be holding the page lock (or whatever other serialization) on the copied page still while the copy page is inserted into pagetable: that looks to be so in your __do_fault. Hmm, on second thoughts, I think I was right the first time, and do need the unmap after the pages are truncated. With the lock_page code, after the first unmap, we can get new ptes mapping pages, and subsequently they can be COWed and then the original pte zapped before the truncate loop checks it. The filesystem (or page cache) allows pages beyond i_size to come in there? That wasn't a problem before, was it? But now it is? The filesystem still doesn't, but if i_size is updated after the page is returned, we can have a problem that was previously taken care of with the truncate_count but now isn't. However, I wonder if we can't test mapping_mapped before the spinlock, which would make most truncates cheaper? Slightly cheaper, yes, though I doubt it'd be much in comparison with actually doing any work in unmap_mapping_range or truncate_inode_pages. But if we're supposing the common case for truncate is unmapped mappings, then the main cost there will be the locking, which I'm trying to avoid. Hopefully with this patch, most truncate workloads would get faster, even though truncate mapped files is going to be unavoidably slower. Suspect you'd need a barrier of some kind between the i_size_write and the mapping_mapped test? The unmap_mapping_range that runs after the truncate_inode_pages should run in the correct order, I believe. But that's a change we could have made at any time if we'd bothered, it's not really the issue here. I don't see how you could, because you need to increment truncate_count. But I believe this is fixing the issue, even if it does so in a peripheral manner, because it avoids the added cost for unmapped files. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
On Thu, 10 May 2007, Nick Piggin wrote: The filesystem (or page cache) allows pages beyond i_size to come in there? That wasn't a problem before, was it? But now it is? The filesystem still doesn't, but if i_size is updated after the page is returned, we can have a problem that was previously taken care of with the truncate_count but now isn't. But... I thought the page lock was now taking care of that in your scheme? truncate_inode_pages has to wait for the page lock, then it finds the page is mapped and... ahh, it finds the copiee page is not mapped, so doesn't do its own little unmap_mapping_range, and the copied page squeaks through. Drat. I really think the truncate_count solution worked better, for truncation anyway. There may be persuasive reasons you need the page lock for invalidation: I gave up on trying to understand the required behaviour(s) for invalidation. So, bring back (the original use of, not my tree marker use of) truncate_count? Hmm, you probably don't want to do that, because there was some pleasure in removing the strange barriers associated with it. A second unmap_mapping_range is just one line of code - but it sure feels like a defeat to me, calling the whole exercise into question. (But then, you'd be right to say my perfectionism made it impossible for me to come up with any solution to the invalidation issues.) Suspect you'd need a barrier of some kind between the i_size_write and the mapping_mapped test? The unmap_mapping_range that runs after the truncate_inode_pages should run in the correct order, I believe. Yes, if there's going to be that backup call, the first won't really need a barrier. But that's a change we could have made at any time if we'd bothered, it's not really the issue here. I don't see how you could, because you need to increment truncate_count. Though indeed we did so, I don't see that we needed to increment truncate_count in that case (nobody could be coming through do_no_page on that file, when there are no mappings of it). But I believe this is fixing the issue, even if it does so in a peripheral manner, because it avoids the added cost for unmapped files. It's a small improvement to your common case, I agree. Hugh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
Hugh Dickins wrote: On Thu, 10 May 2007, Nick Piggin wrote: The filesystem (or page cache) allows pages beyond i_size to come in there? That wasn't a problem before, was it? But now it is? The filesystem still doesn't, but if i_size is updated after the page is returned, we can have a problem that was previously taken care of with the truncate_count but now isn't. But... I thought the page lock was now taking care of that in your scheme? truncate_inode_pages has to wait for the page lock, then it finds the page is mapped and... ahh, it finds the copiee page is not mapped, so doesn't do its own little unmap_mapping_range, and the copied page squeaks through. Drat. I really think the truncate_count solution worked better, for truncation anyway. There may be persuasive reasons you need the page lock for invalidation: I gave up on trying to understand the required behaviour(s) for invalidation. So, bring back (the original use of, not my tree marker use of) truncate_count? Hmm, you probably don't want to do that, because there was some pleasure in removing the strange barriers associated with it. A second unmap_mapping_range is just one line of code - but it sure feels like a defeat to me, calling the whole exercise into question. (But then, you'd be right to say my perfectionism made it impossible for me to come up with any solution to the invalidation issues.) Well we could bring back the truncate_count, but I think that sucks because that's moving work into the page fault handler in order to avoid a bit of work when truncating mapped files. But that's a change we could have made at any time if we'd bothered, it's not really the issue here. I don't see how you could, because you need to increment truncate_count. Though indeed we did so, I don't see that we needed to increment truncate_count in that case (nobody could be coming through do_no_page on that file, when there are no mappings of it). Of course :P -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
On Fri, 2007-05-04 at 19:23 +1000, Nick Piggin wrote: > These ops could also be put to use in bit spinlocks, buffer lock, and > probably a few other places too. Ok, the performance hit seems to be under control (especially with the bigger benchmark showing actual improvements). There's a little bogon with the PG_waiters bit that you already know about but appart from that it should be ok. I must say I absolutely _LOVE_ the bitops with explicit _lock/_unlock semantics. That should allow us to remove a whole bunch of dodgy barriers and smp_mb__before_whatever_magic_crap() things we have all over the place by providing precisely the expected semantics for bit locks. There are quite a few people who've been trying to do bit locks and I've always been very worried by how easy it is to get the barriers wrong (or too much barriers in the fast path) with these. There are a couple of things we might want to think about regarding the actual API to bit locks... the API you propose is simple, but it might not fit some of the most exotic usage requirements, which typically are related to manipulating other bits along with the lock bit. We might just ignore them though. In the case of the page lock, it's only hitting the slow path, and I would expect other usage scenarii to be similar. Cheers, Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
On Fri, 2007-05-04 at 19:23 +1000, Nick Piggin wrote: These ops could also be put to use in bit spinlocks, buffer lock, and probably a few other places too. Ok, the performance hit seems to be under control (especially with the bigger benchmark showing actual improvements). There's a little bogon with the PG_waiters bit that you already know about but appart from that it should be ok. I must say I absolutely _LOVE_ the bitops with explicit _lock/_unlock semantics. That should allow us to remove a whole bunch of dodgy barriers and smp_mb__before_whatever_magic_crap() things we have all over the place by providing precisely the expected semantics for bit locks. There are quite a few people who've been trying to do bit locks and I've always been very worried by how easy it is to get the barriers wrong (or too much barriers in the fast path) with these. There are a couple of things we might want to think about regarding the actual API to bit locks... the API you propose is simple, but it might not fit some of the most exotic usage requirements, which typically are related to manipulating other bits along with the lock bit. We might just ignore them though. In the case of the page lock, it's only hitting the slow path, and I would expect other usage scenarii to be similar. Cheers, Ben. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
Nick Piggin wrote: Nick Piggin wrote: Christoph Hellwig wrote: Is that every fork/exec or just under certain cicumstances? A 5% regression on every fork/exec is not acceptable. Well after patch2, G5 fork is 3% and exec is 1%, I'd say the P4 numbers will be improved as well with that patch. Then if we have specific lock/unlock bitops, I hope it should reduce that further. OK, with the races and missing barriers fixed from the previous patch, plus the attached one added (+patch3), numbers are better again (I'm not sure if I have the ppc barriers correct though). These ops could also be put to use in bit spinlocks, buffer lock, and probably a few other places too. 2.6.21 1.49-1.51 164.6-170.8 741.8-760.3 +patch 1.71-1.73 175.2-180.8 780.5-794.2 +patch2 1.61-1.63 169.8-175.0 748.6-757.0 +patch3 1.54-1.57 165.6-170.9 748.5-757.5 So fault performance goes to under 5%, fork is in the noise, exec is still up 1%, but maybe that's noise or cache effects again. OK, with my new lock/unlock_page, dd if=large (bigger than RAM) sparse file of=/dev/null with an experimentally optimal block size (32K) goes from 626MB/s to 683MB/s on 2 CPU G5 booted with maxcpus=1. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
Nick Piggin wrote: Christoph Hellwig wrote: Is that every fork/exec or just under certain cicumstances? A 5% regression on every fork/exec is not acceptable. Well after patch2, G5 fork is 3% and exec is 1%, I'd say the P4 numbers will be improved as well with that patch. Then if we have specific lock/unlock bitops, I hope it should reduce that further. OK, with the races and missing barriers fixed from the previous patch, plus the attached one added (+patch3), numbers are better again (I'm not sure if I have the ppc barriers correct though). These ops could also be put to use in bit spinlocks, buffer lock, and probably a few other places too. 2.6.21 1.49-1.51 164.6-170.8 741.8-760.3 +patch 1.71-1.73 175.2-180.8 780.5-794.2 +patch2 1.61-1.63 169.8-175.0 748.6-757.0 +patch3 1.54-1.57 165.6-170.9 748.5-757.5 So fault performance goes to under 5%, fork is in the noise, exec is still up 1%, but maybe that's noise or cache effects again. -- SUSE Labs, Novell Inc. Index: linux-2.6/include/asm-powerpc/bitops.h === --- linux-2.6.orig/include/asm-powerpc/bitops.h 2007-05-04 16:08:20.0 +1000 +++ linux-2.6/include/asm-powerpc/bitops.h 2007-05-04 16:14:39.0 +1000 @@ -87,6 +87,24 @@ : "cc" ); } +static __inline__ void clear_bit_unlock(int nr, volatile unsigned long *addr) +{ + unsigned long old; + unsigned long mask = BITOP_MASK(nr); + unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr); + + __asm__ __volatile__( + LWSYNC_ON_SMP +"1:" PPC_LLARX "%0,0,%3 # clear_bit_unlock\n" + "andc %0,%0,%2\n" + PPC405_ERR77(0,%3) + PPC_STLCX "%0,0,%3\n" + "bne- 1b" + : "=" (old), "+m" (*p) + : "r" (mask), "r" (p) + : "cc" ); +} + static __inline__ void change_bit(int nr, volatile unsigned long *addr) { unsigned long old; @@ -126,6 +144,27 @@ return (old & mask) != 0; } +static __inline__ int test_and_set_bit_lock(unsigned long nr, + volatile unsigned long *addr) +{ + unsigned long old, t; + unsigned long mask = BITOP_MASK(nr); + unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr); + + __asm__ __volatile__( +"1:" PPC_LLARX "%0,0,%3 # test_and_set_bit_lock\n" + "or %1,%0,%2 \n" + PPC405_ERR77(0,%3) + PPC_STLCX "%1,0,%3 \n" + "bne- 1b" + ISYNC_ON_SMP + : "=" (old), "=" (t) + : "r" (mask), "r" (p) + : "cc", "memory"); + + return (old & mask) != 0; +} + static __inline__ int test_and_clear_bit(unsigned long nr, volatile unsigned long *addr) { Index: linux-2.6/include/linux/pagemap.h === --- linux-2.6.orig/include/linux/pagemap.h 2007-05-04 16:14:36.0 +1000 +++ linux-2.6/include/linux/pagemap.h 2007-05-04 16:17:34.0 +1000 @@ -136,13 +136,18 @@ extern void FASTCALL(__wait_on_page_locked(struct page *page)); extern void FASTCALL(unlock_page(struct page *page)); +static inline int trylock_page(struct page *page) +{ + return (likely(!TestSetPageLocked_Lock(page))); +} + /* * lock_page may only be called if we have the page's inode pinned. */ static inline void lock_page(struct page *page) { might_sleep(); - if (unlikely(TestSetPageLocked(page))) + if (!trylock_page(page)) __lock_page(page); } @@ -153,7 +158,7 @@ static inline void lock_page_nosync(struct page *page) { might_sleep(); - if (unlikely(TestSetPageLocked(page))) + if (!trylock_page(page)) __lock_page_nosync(page); } Index: linux-2.6/drivers/scsi/sg.c === --- linux-2.6.orig/drivers/scsi/sg.c2007-04-12 14:35:08.0 +1000 +++ linux-2.6/drivers/scsi/sg.c 2007-05-04 16:23:27.0 +1000 @@ -1734,7 +1734,7 @@ */ flush_dcache_page(pages[i]); /* ?? Is locking needed? I don't think so */ - /* if (TestSetPageLocked(pages[i])) + /* if (!trylock_page(pages[i])) goto out_unlock; */ } Index: linux-2.6/fs/cifs/file.c === --- linux-2.6.orig/fs/cifs/file.c 2007-04-12 14:35:09.0 +1000 +++ linux-2.6/fs/cifs/file.c2007-05-04 16:23:36.0 +1000 @@ -1229,7 +1229,7 @@ if (first < 0) lock_page(page); - else if (TestSetPageLocked(page)) + else if (!trylock_page(page)) break; if (unlikely(page->mapping != mapping)) { Index: linux-2.6/fs/jbd/commit.c
Re: 2.6.22 -mm merge plans -- vm bugfixes
Nick Piggin wrote: Christoph Hellwig wrote: Is that every fork/exec or just under certain cicumstances? A 5% regression on every fork/exec is not acceptable. Well after patch2, G5 fork is 3% and exec is 1%, I'd say the P4 numbers will be improved as well with that patch. Then if we have specific lock/unlock bitops, I hope it should reduce that further. OK, with the races and missing barriers fixed from the previous patch, plus the attached one added (+patch3), numbers are better again (I'm not sure if I have the ppc barriers correct though). These ops could also be put to use in bit spinlocks, buffer lock, and probably a few other places too. 2.6.21 1.49-1.51 164.6-170.8 741.8-760.3 +patch 1.71-1.73 175.2-180.8 780.5-794.2 +patch2 1.61-1.63 169.8-175.0 748.6-757.0 +patch3 1.54-1.57 165.6-170.9 748.5-757.5 So fault performance goes to under 5%, fork is in the noise, exec is still up 1%, but maybe that's noise or cache effects again. -- SUSE Labs, Novell Inc. Index: linux-2.6/include/asm-powerpc/bitops.h === --- linux-2.6.orig/include/asm-powerpc/bitops.h 2007-05-04 16:08:20.0 +1000 +++ linux-2.6/include/asm-powerpc/bitops.h 2007-05-04 16:14:39.0 +1000 @@ -87,6 +87,24 @@ : cc ); } +static __inline__ void clear_bit_unlock(int nr, volatile unsigned long *addr) +{ + unsigned long old; + unsigned long mask = BITOP_MASK(nr); + unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr); + + __asm__ __volatile__( + LWSYNC_ON_SMP +1: PPC_LLARX %0,0,%3 # clear_bit_unlock\n + andc %0,%0,%2\n + PPC405_ERR77(0,%3) + PPC_STLCX %0,0,%3\n + bne- 1b + : =r (old), +m (*p) + : r (mask), r (p) + : cc ); +} + static __inline__ void change_bit(int nr, volatile unsigned long *addr) { unsigned long old; @@ -126,6 +144,27 @@ return (old mask) != 0; } +static __inline__ int test_and_set_bit_lock(unsigned long nr, + volatile unsigned long *addr) +{ + unsigned long old, t; + unsigned long mask = BITOP_MASK(nr); + unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr); + + __asm__ __volatile__( +1: PPC_LLARX %0,0,%3 # test_and_set_bit_lock\n + or %1,%0,%2 \n + PPC405_ERR77(0,%3) + PPC_STLCX %1,0,%3 \n + bne- 1b + ISYNC_ON_SMP + : =r (old), =r (t) + : r (mask), r (p) + : cc, memory); + + return (old mask) != 0; +} + static __inline__ int test_and_clear_bit(unsigned long nr, volatile unsigned long *addr) { Index: linux-2.6/include/linux/pagemap.h === --- linux-2.6.orig/include/linux/pagemap.h 2007-05-04 16:14:36.0 +1000 +++ linux-2.6/include/linux/pagemap.h 2007-05-04 16:17:34.0 +1000 @@ -136,13 +136,18 @@ extern void FASTCALL(__wait_on_page_locked(struct page *page)); extern void FASTCALL(unlock_page(struct page *page)); +static inline int trylock_page(struct page *page) +{ + return (likely(!TestSetPageLocked_Lock(page))); +} + /* * lock_page may only be called if we have the page's inode pinned. */ static inline void lock_page(struct page *page) { might_sleep(); - if (unlikely(TestSetPageLocked(page))) + if (!trylock_page(page)) __lock_page(page); } @@ -153,7 +158,7 @@ static inline void lock_page_nosync(struct page *page) { might_sleep(); - if (unlikely(TestSetPageLocked(page))) + if (!trylock_page(page)) __lock_page_nosync(page); } Index: linux-2.6/drivers/scsi/sg.c === --- linux-2.6.orig/drivers/scsi/sg.c2007-04-12 14:35:08.0 +1000 +++ linux-2.6/drivers/scsi/sg.c 2007-05-04 16:23:27.0 +1000 @@ -1734,7 +1734,7 @@ */ flush_dcache_page(pages[i]); /* ?? Is locking needed? I don't think so */ - /* if (TestSetPageLocked(pages[i])) + /* if (!trylock_page(pages[i])) goto out_unlock; */ } Index: linux-2.6/fs/cifs/file.c === --- linux-2.6.orig/fs/cifs/file.c 2007-04-12 14:35:09.0 +1000 +++ linux-2.6/fs/cifs/file.c2007-05-04 16:23:36.0 +1000 @@ -1229,7 +1229,7 @@ if (first 0) lock_page(page); - else if (TestSetPageLocked(page)) + else if (!trylock_page(page)) break; if (unlikely(page-mapping != mapping)) { Index: linux-2.6/fs/jbd/commit.c
Re: 2.6.22 -mm merge plans -- vm bugfixes
Nick Piggin wrote: Nick Piggin wrote: Christoph Hellwig wrote: Is that every fork/exec or just under certain cicumstances? A 5% regression on every fork/exec is not acceptable. Well after patch2, G5 fork is 3% and exec is 1%, I'd say the P4 numbers will be improved as well with that patch. Then if we have specific lock/unlock bitops, I hope it should reduce that further. OK, with the races and missing barriers fixed from the previous patch, plus the attached one added (+patch3), numbers are better again (I'm not sure if I have the ppc barriers correct though). These ops could also be put to use in bit spinlocks, buffer lock, and probably a few other places too. 2.6.21 1.49-1.51 164.6-170.8 741.8-760.3 +patch 1.71-1.73 175.2-180.8 780.5-794.2 +patch2 1.61-1.63 169.8-175.0 748.6-757.0 +patch3 1.54-1.57 165.6-170.9 748.5-757.5 So fault performance goes to under 5%, fork is in the noise, exec is still up 1%, but maybe that's noise or cache effects again. OK, with my new lock/unlock_page, dd if=large (bigger than RAM) sparse file of=/dev/null with an experimentally optimal block size (32K) goes from 626MB/s to 683MB/s on 2 CPU G5 booted with maxcpus=1. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
Andrew Morton wrote: On Thu, 03 May 2007 11:32:23 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote: void fastcall unlock_page(struct page *page) { + VM_BUG_ON(!PageLocked(page)); smp_mb__before_clear_bit(); - if (!TestClearPageLocked(page)) - BUG(); - smp_mb__after_clear_bit(); - wake_up_page(page, PG_locked); + ClearPageLocked(page); + if (unlikely(test_bit(PG_waiters, >flags))) { + clear_bit(PG_waiters, >flags); + wake_up_page(page, PG_locked); + } } Why is that significantly faster than plain old wake_up_page(), which tests waitqueue_active()? Because it needs fewer barriers and doesn't touch random a random hash cacheline in the fastpath. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
On Thu, 03 May 2007 11:32:23 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote: > void fastcall unlock_page(struct page *page) > { > + VM_BUG_ON(!PageLocked(page)); > smp_mb__before_clear_bit(); > - if (!TestClearPageLocked(page)) > - BUG(); > - smp_mb__after_clear_bit(); > - wake_up_page(page, PG_locked); > + ClearPageLocked(page); > + if (unlikely(test_bit(PG_waiters, >flags))) { > + clear_bit(PG_waiters, >flags); > + wake_up_page(page, PG_locked); > + } > } Why is that significantly faster than plain old wake_up_page(), which tests waitqueue_active()? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
Hugh Dickins wrote: On Thu, 3 May 2007, Nick Piggin wrote: @@ -568,6 +570,11 @@ __lock_page (diff -p would tell us!) { DEFINE_WAIT_BIT(wait, >flags, PG_locked); + set_bit(PG_waiters, >flags); + if (unlikely(!TestSetPageLocked(page))) { What happens if another cpu is coming through __lock_page at the same time, did its set_bit, now finds PageLocked, and so proceeds to the __wait_on_bit_lock? But this cpu now clears PG_waiters, so this task's unlock_page won't wake the other? You're right, we can't clear the bit here. Doubt it mattered much anyway? Ah yes, that's a good easy answer. In fact, just remove this whole test and block (we already tried TestSetPageLocked outside just a short while ago, so this repeat won't often save anything). Yeah, I was getting too clever for my own boots :) I think the patch has merit though. Unfortunate that it uses another page flag, however it seemed to have quite a bit speedup on unlock_page (probably from both the barriers and an extra random cacheline load (from the hash)). I guess it has to get good results from more benchmarks... BTW. I also forgot an smp_mb__after_clear_bit() before the wake_up_page above... that barrier is in the slow path as well though, so it shouldn't matter either. I vaguely wondered how such barriers had managed to dissolve away, but cranking my brain up to think about barriers takes far too long. That barrier was one too many :) However I believe the fastpath barrier can go away because the PG_locked operation is depending on the same cacheline as PG_waiters. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
On Thu, 3 May 2007, Nick Piggin wrote: > >>@@ -568,6 +570,11 @@ __lock_page (diff -p would tell us!) > > > { > > > DEFINE_WAIT_BIT(wait, >flags, PG_locked); > > > > > >+ set_bit(PG_waiters, >flags); > > >+ if (unlikely(!TestSetPageLocked(page))) { > > > > What happens if another cpu is coming through __lock_page at the > > same time, did its set_bit, now finds PageLocked, and so proceeds > > to the __wait_on_bit_lock? But this cpu now clears PG_waiters, > > so this task's unlock_page won't wake the other? > > You're right, we can't clear the bit here. Doubt it mattered much anyway? Ah yes, that's a good easy answer. In fact, just remove this whole test and block (we already tried TestSetPageLocked outside just a short while ago, so this repeat won't often save anything). > > BTW. I also forgot an smp_mb__after_clear_bit() before the wake_up_page > above... that barrier is in the slow path as well though, so it shouldn't > matter either. I vaguely wondered how such barriers had managed to dissolve away, but cranking my brain up to think about barriers takes far too long. > > >+ clear_bit(PG_waiters, >flags); > > >+ return; > > >+ } > > > __wait_on_bit_lock(page_waitqueue(page), , sync_page, > > >TASK_UNINTERRUPTIBLE); > >> } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
Christoph Hellwig wrote: On Thu, May 03, 2007 at 11:32:23AM +1000, Nick Piggin wrote: The attached patch gets performance up a bit by avoiding some barriers and some cachelines: G5 pagefault fork exec 2.6.21 1.49-1.51 164.6-170.8 741.8-760.3 +patch 1.71-1.73 175.2-180.8 780.5-794.2 +patch2 1.61-1.63 169.8-175.0 748.6-757.0 So that brings the fork/exec hits down to much less than 5%, and would likely speed up other things that lock the page, like write or page reclaim. Is that every fork/exec or just under certain cicumstances? A 5% regression on every fork/exec is not acceptable. Well after patch2, G5 fork is 3% and exec is 1%, I'd say the P4 numbers will be improved as well with that patch. Then if we have specific lock/unlock bitops, I hope it should reduce that further. The overhead that is there should just be coming from the extra overhead in the file backed fault handler. For noop fork/execs, I think that tends to be more pronounced, it is hard to see any difference on any non-micro benchmark. The other thing is that I think there could be some cache effects happening -- for example the exec numbers on the 2nd line are disproportionately large. It definitely isn't a good thing to drop performance anywhere though, so I'll keep looking for improvements. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
Hugh Dickins wrote: On Thu, 3 May 2007, Nick Piggin wrote: The problem is that lock/unlock_page is expensive on powerpc, and if we improve that, we improve more than just the fault handler... The attached patch gets performance up a bit by avoiding some barriers and some cachelines: There's a strong whiff of raciness about this... but I could very easily be wrong. Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c 2007-05-02 15:00:26.0 +1000 +++ linux-2.6/mm/filemap.c 2007-05-03 08:34:32.0 +1000 @@ -532,11 +532,13 @@ */ void fastcall unlock_page(struct page *page) { + VM_BUG_ON(!PageLocked(page)); smp_mb__before_clear_bit(); - if (!TestClearPageLocked(page)) - BUG(); - smp_mb__after_clear_bit(); - wake_up_page(page, PG_locked); + ClearPageLocked(page); + if (unlikely(test_bit(PG_waiters, >flags))) { + clear_bit(PG_waiters, >flags); + wake_up_page(page, PG_locked); + } } EXPORT_SYMBOL(unlock_page); @@ -568,6 +570,11 @@ __lock_page (diff -p would tell us!) { DEFINE_WAIT_BIT(wait, >flags, PG_locked); + set_bit(PG_waiters, >flags); + if (unlikely(!TestSetPageLocked(page))) { What happens if another cpu is coming through __lock_page at the same time, did its set_bit, now finds PageLocked, and so proceeds to the __wait_on_bit_lock? But this cpu now clears PG_waiters, so this task's unlock_page won't wake the other? You're right, we can't clear the bit here. Doubt it mattered much anyway? BTW. I also forgot an smp_mb__after_clear_bit() before the wake_up_page above... that barrier is in the slow path as well though, so it shouldn't matter either. + clear_bit(PG_waiters, >flags); + return; + } __wait_on_bit_lock(page_waitqueue(page), , sync_page, TASK_UNINTERRUPTIBLE); } -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
On Thu, 3 May 2007, Nick Piggin wrote: > > The problem is that lock/unlock_page is expensive on powerpc, and > if we improve that, we improve more than just the fault handler... > > The attached patch gets performance up a bit by avoiding some > barriers and some cachelines: There's a strong whiff of raciness about this... but I could very easily be wrong. > Index: linux-2.6/mm/filemap.c > === > --- linux-2.6.orig/mm/filemap.c 2007-05-02 15:00:26.0 +1000 > +++ linux-2.6/mm/filemap.c2007-05-03 08:34:32.0 +1000 > @@ -532,11 +532,13 @@ > */ > void fastcall unlock_page(struct page *page) > { > + VM_BUG_ON(!PageLocked(page)); > smp_mb__before_clear_bit(); > - if (!TestClearPageLocked(page)) > - BUG(); > - smp_mb__after_clear_bit(); > - wake_up_page(page, PG_locked); > + ClearPageLocked(page); > + if (unlikely(test_bit(PG_waiters, >flags))) { > + clear_bit(PG_waiters, >flags); > + wake_up_page(page, PG_locked); > + } > } > EXPORT_SYMBOL(unlock_page); > > @@ -568,6 +570,11 @@ __lock_page (diff -p would tell us!) > { > DEFINE_WAIT_BIT(wait, >flags, PG_locked); > > + set_bit(PG_waiters, >flags); > + if (unlikely(!TestSetPageLocked(page))) { What happens if another cpu is coming through __lock_page at the same time, did its set_bit, now finds PageLocked, and so proceeds to the __wait_on_bit_lock? But this cpu now clears PG_waiters, so this task's unlock_page won't wake the other? > + clear_bit(PG_waiters, >flags); > + return; > + } > __wait_on_bit_lock(page_waitqueue(page), , sync_page, > TASK_UNINTERRUPTIBLE); > } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
On Thu, May 03, 2007 at 11:32:23AM +1000, Nick Piggin wrote: > The attached patch gets performance up a bit by avoiding some > barriers and some cachelines: > > G5 > pagefault fork exec > 2.6.21 1.49-1.51 164.6-170.8 741.8-760.3 > +patch 1.71-1.73 175.2-180.8 780.5-794.2 > +patch2 1.61-1.63 169.8-175.0 748.6-757.0 > > So that brings the fork/exec hits down to much less than 5%, and > would likely speed up other things that lock the page, like write > or page reclaim. Is that every fork/exec or just under certain cicumstances? A 5% regression on every fork/exec is not acceptable. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
On Thu, May 03, 2007 at 11:32:23AM +1000, Nick Piggin wrote: The attached patch gets performance up a bit by avoiding some barriers and some cachelines: G5 pagefault fork exec 2.6.21 1.49-1.51 164.6-170.8 741.8-760.3 +patch 1.71-1.73 175.2-180.8 780.5-794.2 +patch2 1.61-1.63 169.8-175.0 748.6-757.0 So that brings the fork/exec hits down to much less than 5%, and would likely speed up other things that lock the page, like write or page reclaim. Is that every fork/exec or just under certain cicumstances? A 5% regression on every fork/exec is not acceptable. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
On Thu, 3 May 2007, Nick Piggin wrote: The problem is that lock/unlock_page is expensive on powerpc, and if we improve that, we improve more than just the fault handler... The attached patch gets performance up a bit by avoiding some barriers and some cachelines: There's a strong whiff of raciness about this... but I could very easily be wrong. Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c 2007-05-02 15:00:26.0 +1000 +++ linux-2.6/mm/filemap.c2007-05-03 08:34:32.0 +1000 @@ -532,11 +532,13 @@ */ void fastcall unlock_page(struct page *page) { + VM_BUG_ON(!PageLocked(page)); smp_mb__before_clear_bit(); - if (!TestClearPageLocked(page)) - BUG(); - smp_mb__after_clear_bit(); - wake_up_page(page, PG_locked); + ClearPageLocked(page); + if (unlikely(test_bit(PG_waiters, page-flags))) { + clear_bit(PG_waiters, page-flags); + wake_up_page(page, PG_locked); + } } EXPORT_SYMBOL(unlock_page); @@ -568,6 +570,11 @@ __lock_page (diff -p would tell us!) { DEFINE_WAIT_BIT(wait, page-flags, PG_locked); + set_bit(PG_waiters, page-flags); + if (unlikely(!TestSetPageLocked(page))) { What happens if another cpu is coming through __lock_page at the same time, did its set_bit, now finds PageLocked, and so proceeds to the __wait_on_bit_lock? But this cpu now clears PG_waiters, so this task's unlock_page won't wake the other? + clear_bit(PG_waiters, page-flags); + return; + } __wait_on_bit_lock(page_waitqueue(page), wait, sync_page, TASK_UNINTERRUPTIBLE); } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
Hugh Dickins wrote: On Thu, 3 May 2007, Nick Piggin wrote: The problem is that lock/unlock_page is expensive on powerpc, and if we improve that, we improve more than just the fault handler... The attached patch gets performance up a bit by avoiding some barriers and some cachelines: There's a strong whiff of raciness about this... but I could very easily be wrong. Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c 2007-05-02 15:00:26.0 +1000 +++ linux-2.6/mm/filemap.c 2007-05-03 08:34:32.0 +1000 @@ -532,11 +532,13 @@ */ void fastcall unlock_page(struct page *page) { + VM_BUG_ON(!PageLocked(page)); smp_mb__before_clear_bit(); - if (!TestClearPageLocked(page)) - BUG(); - smp_mb__after_clear_bit(); - wake_up_page(page, PG_locked); + ClearPageLocked(page); + if (unlikely(test_bit(PG_waiters, page-flags))) { + clear_bit(PG_waiters, page-flags); + wake_up_page(page, PG_locked); + } } EXPORT_SYMBOL(unlock_page); @@ -568,6 +570,11 @@ __lock_page (diff -p would tell us!) { DEFINE_WAIT_BIT(wait, page-flags, PG_locked); + set_bit(PG_waiters, page-flags); + if (unlikely(!TestSetPageLocked(page))) { What happens if another cpu is coming through __lock_page at the same time, did its set_bit, now finds PageLocked, and so proceeds to the __wait_on_bit_lock? But this cpu now clears PG_waiters, so this task's unlock_page won't wake the other? You're right, we can't clear the bit here. Doubt it mattered much anyway? BTW. I also forgot an smp_mb__after_clear_bit() before the wake_up_page above... that barrier is in the slow path as well though, so it shouldn't matter either. + clear_bit(PG_waiters, page-flags); + return; + } __wait_on_bit_lock(page_waitqueue(page), wait, sync_page, TASK_UNINTERRUPTIBLE); } -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
Christoph Hellwig wrote: On Thu, May 03, 2007 at 11:32:23AM +1000, Nick Piggin wrote: The attached patch gets performance up a bit by avoiding some barriers and some cachelines: G5 pagefault fork exec 2.6.21 1.49-1.51 164.6-170.8 741.8-760.3 +patch 1.71-1.73 175.2-180.8 780.5-794.2 +patch2 1.61-1.63 169.8-175.0 748.6-757.0 So that brings the fork/exec hits down to much less than 5%, and would likely speed up other things that lock the page, like write or page reclaim. Is that every fork/exec or just under certain cicumstances? A 5% regression on every fork/exec is not acceptable. Well after patch2, G5 fork is 3% and exec is 1%, I'd say the P4 numbers will be improved as well with that patch. Then if we have specific lock/unlock bitops, I hope it should reduce that further. The overhead that is there should just be coming from the extra overhead in the file backed fault handler. For noop fork/execs, I think that tends to be more pronounced, it is hard to see any difference on any non-micro benchmark. The other thing is that I think there could be some cache effects happening -- for example the exec numbers on the 2nd line are disproportionately large. It definitely isn't a good thing to drop performance anywhere though, so I'll keep looking for improvements. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
On Thu, 3 May 2007, Nick Piggin wrote: @@ -568,6 +570,11 @@ __lock_page (diff -p would tell us!) { DEFINE_WAIT_BIT(wait, page-flags, PG_locked); + set_bit(PG_waiters, page-flags); + if (unlikely(!TestSetPageLocked(page))) { What happens if another cpu is coming through __lock_page at the same time, did its set_bit, now finds PageLocked, and so proceeds to the __wait_on_bit_lock? But this cpu now clears PG_waiters, so this task's unlock_page won't wake the other? You're right, we can't clear the bit here. Doubt it mattered much anyway? Ah yes, that's a good easy answer. In fact, just remove this whole test and block (we already tried TestSetPageLocked outside just a short while ago, so this repeat won't often save anything). BTW. I also forgot an smp_mb__after_clear_bit() before the wake_up_page above... that barrier is in the slow path as well though, so it shouldn't matter either. I vaguely wondered how such barriers had managed to dissolve away, but cranking my brain up to think about barriers takes far too long. + clear_bit(PG_waiters, page-flags); + return; + } __wait_on_bit_lock(page_waitqueue(page), wait, sync_page, TASK_UNINTERRUPTIBLE); } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
Hugh Dickins wrote: On Thu, 3 May 2007, Nick Piggin wrote: @@ -568,6 +570,11 @@ __lock_page (diff -p would tell us!) { DEFINE_WAIT_BIT(wait, page-flags, PG_locked); + set_bit(PG_waiters, page-flags); + if (unlikely(!TestSetPageLocked(page))) { What happens if another cpu is coming through __lock_page at the same time, did its set_bit, now finds PageLocked, and so proceeds to the __wait_on_bit_lock? But this cpu now clears PG_waiters, so this task's unlock_page won't wake the other? You're right, we can't clear the bit here. Doubt it mattered much anyway? Ah yes, that's a good easy answer. In fact, just remove this whole test and block (we already tried TestSetPageLocked outside just a short while ago, so this repeat won't often save anything). Yeah, I was getting too clever for my own boots :) I think the patch has merit though. Unfortunate that it uses another page flag, however it seemed to have quite a bit speedup on unlock_page (probably from both the barriers and an extra random cacheline load (from the hash)). I guess it has to get good results from more benchmarks... BTW. I also forgot an smp_mb__after_clear_bit() before the wake_up_page above... that barrier is in the slow path as well though, so it shouldn't matter either. I vaguely wondered how such barriers had managed to dissolve away, but cranking my brain up to think about barriers takes far too long. That barrier was one too many :) However I believe the fastpath barrier can go away because the PG_locked operation is depending on the same cacheline as PG_waiters. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
On Thu, 03 May 2007 11:32:23 +1000 Nick Piggin [EMAIL PROTECTED] wrote: void fastcall unlock_page(struct page *page) { + VM_BUG_ON(!PageLocked(page)); smp_mb__before_clear_bit(); - if (!TestClearPageLocked(page)) - BUG(); - smp_mb__after_clear_bit(); - wake_up_page(page, PG_locked); + ClearPageLocked(page); + if (unlikely(test_bit(PG_waiters, page-flags))) { + clear_bit(PG_waiters, page-flags); + wake_up_page(page, PG_locked); + } } Why is that significantly faster than plain old wake_up_page(), which tests waitqueue_active()? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
Andrew Morton wrote: On Thu, 03 May 2007 11:32:23 +1000 Nick Piggin [EMAIL PROTECTED] wrote: void fastcall unlock_page(struct page *page) { + VM_BUG_ON(!PageLocked(page)); smp_mb__before_clear_bit(); - if (!TestClearPageLocked(page)) - BUG(); - smp_mb__after_clear_bit(); - wake_up_page(page, PG_locked); + ClearPageLocked(page); + if (unlikely(test_bit(PG_waiters, page-flags))) { + clear_bit(PG_waiters, page-flags); + wake_up_page(page, PG_locked); + } } Why is that significantly faster than plain old wake_up_page(), which tests waitqueue_active()? Because it needs fewer barriers and doesn't touch random a random hash cacheline in the fastpath. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
Hugh Dickins wrote: On Wed, 2 May 2007, Nick Piggin wrote: [snip] More on-topic, since you suggest doing more within vmtruncate_range than the filesystem: no, I'm afraid that's misdesigned, and I want to move almost all of it into the filesystem ->truncate_range. Because, if what vmtruncate_range is doing before it gets to the filesystem isn't to be just a waste of time, the filesystem needs to know what's going on in advance - just as notify_change warns the filesystem about a coming truncation. But easier than inventing some new notification is to move it all into the filesystem, with unmap_mapping_range+truncate_inode_pages_range its library helpers. Well I would prefer it to follow the same pattern as regular truncate. I don't think it is misdesigned to call the filesystem _first_, but I think if you do that then the filesystem should call the vm to prepare / finish truncate, rather than open code calls to unmap itself. But I'm pretty sure (to use your words!) regular truncate was not racy before: I believe Andrea's sequence count was handling that case fine, without a second unmap_mapping_range. OK, I think you're right. I _think_ it should also be OK with the lock_page version as well: we should not be able to have any pages after the first unmap_mapping_range call, because of the i_size write. So if we have no pages, there is nothing to 'cow' from. I'd be delighted if you can remove those later unmap_mapping_ranges. As I recall, the important thing for the copy pages is to be holding the page lock (or whatever other serialization) on the copied page still while the copy page is inserted into pagetable: that looks to be so in your __do_fault. Yeah, I think my thought process went wrong on those... I'll revisit. But it is a shame, and leaves me wondering what you gained with the page lock there. One thing gained is ease of understanding, and if your later patches build an edifice upon the knowledge of holding that page lock while faulting, I've no wish to undermine that foundation. It also fixes a bug, doesn't it? ;) Well, I'd come to think that perhaps the bugs would be solved by that second unmap_mapping_range alone, so the pagelock changes just a misleading diversion. I'm not sure how I feel about that: calling unmap_mapping_range a second time feels such a cheat, but if (big if) it does solve the races, and the pagelock method is as expensive as your numbers now suggest... Well aside from being terribly ugly, it means we can still drop the dirty bit where we'd otherwise rather not, so I don't think we can do that. I think there may be some way we can do this without taking the page lock, and I was going to look at it, but I think it is quite neat to just lock the page... I don't think performance is _that_ bad. On the P4 it is a couple of % on the microbenchmarks. The G5 is worse, but even then I don't think it is I'll try to improve that and get back to you. The problem is that lock/unlock_page is expensive on powerpc, and if we improve that, we improve more than just the fault handler... The attached patch gets performance up a bit by avoiding some barriers and some cachelines: G5 pagefault fork exec 2.6.21 1.49-1.51 164.6-170.8 741.8-760.3 +patch 1.71-1.73 175.2-180.8 780.5-794.2 +patch2 1.61-1.63 169.8-175.0 748.6-757.0 So that brings the fork/exec hits down to much less than 5%, and would likely speed up other things that lock the page, like write or page reclaim. I think we could get further performance improvement by implementing arch specific bitops for lock/unlock operations, so we don't need to use things like smb_mb__before_clear_bit() if they aren't needed or full barriers in the test_and_set_bit(). -- SUSE Labs, Novell Inc. Index: linux-2.6/include/linux/page-flags.h === --- linux-2.6.orig/include/linux/page-flags.h 2007-04-24 10:39:56.0 +1000 +++ linux-2.6/include/linux/page-flags.h2007-05-03 08:38:53.0 +1000 @@ -91,6 +91,8 @@ #define PG_nosave_free 18 /* Used for system suspend/resume */ #define PG_buddy 19 /* Page is free, on buddy lists */ +#define PG_waiters 20 /* Page has PG_locked waiters */ + /* PG_owner_priv_1 users should have descriptive aliases */ #define PG_checked PG_owner_priv_1 /* Used by some filesystems */ Index: linux-2.6/include/linux/pagemap.h === --- linux-2.6.orig/include/linux/pagemap.h 2007-04-24 10:39:56.0 +1000 +++ linux-2.6/include/linux/pagemap.h 2007-05-03 08:35:08.0 +1000 @@ -141,7 +141,7 @@ static inline void lock_page(struct page *page) { might_sleep(); - if (TestSetPageLocked(page)) + if (unlikely(TestSetPageLocked(page))) __lock_page(page); } @@ -152,7 +152,7 @@ static inline void
Re: 2.6.22 -mm merge plans -- vm bugfixes
On Wed, 2 May 2007, Nick Piggin wrote: > Hugh Dickins wrote: > > > > But I was quite disappointed when > > mm-fix-fault-vs-invalidate-race-for-linear-mappings-fix.patch > > appeared, putting double unmap_mapping_range calls in. Certainly > > you were wrong to take the one out, but a pity to end up with two. > > > > Your comment says/said: > > The nopage vs invalidate race fix patch did not take care of truncating > > private COW pages. Mind you, I'm pretty sure this was previously racy > > even for regular truncate, not to mention vmtruncate_range. > > > > vmtruncate_range (holepunch) was deficient I agree, and though we > > can now take out your second unmap_mapping_range there, that's only > > because I've slipped one into shmem_truncate_range. In due course it > > needs to be properly handled by noting the range in shmem inode info. > > > > (I think you couldn't take that approach, noting invalid range in > > ->mapping while invalidating, because NFS has/had some cases of > > invalidate_whatever without i_mutex?) > > Sorry, I didn't parse this? I meant that i_size is used to protect against truncation races, but we have no equivalent inval_start,inval_end in the struct inode or struct address_space, such as could be used for similar protection against races while invalidating. And that IIRC there are places where NFS was doing the invalidation without i_mutex: so there could be concurrent invalidations, so one inval_start,inval_end in the structure wouldn't be enough anyway. > But I wonder whether it is better to do > it in vmtruncate_range than the filesystem? Private COWed pages are > not really a filesystem "thing"... It wasn't the thought of private COWed pages which made me put a second unmap_mapping_range in shmem_truncate_range, it was its own internal file<->swap consistency which needed that (as a quick fix). The real fix to be having a trunc_start,trunc_end or whatever in the shmem_inode_info (assuming it's not wanted in the common inode: might be if holepunch spreads e.g. it's been mentioned with fallocate). Re private COWed pages and holepunch: Miklos and I agree that really it would be better for holepunch _not_ to remove them - but that's rather off-topic. More on-topic, since you suggest doing more within vmtruncate_range than the filesystem: no, I'm afraid that's misdesigned, and I want to move almost all of it into the filesystem ->truncate_range. Because, if what vmtruncate_range is doing before it gets to the filesystem isn't to be just a waste of time, the filesystem needs to know what's going on in advance - just as notify_change warns the filesystem about a coming truncation. But easier than inventing some new notification is to move it all into the filesystem, with unmap_mapping_range+truncate_inode_pages_range its library helpers. > > > But I'm pretty sure (to use your words!) regular truncate was not racy > > before: I believe Andrea's sequence count was handling that case fine, > > without a second unmap_mapping_range. > > OK, I think you're right. I _think_ it should also be OK with the > lock_page version as well: we should not be able to have any pages > after the first unmap_mapping_range call, because of the i_size > write. So if we have no pages, there is nothing to 'cow' from. I'd be delighted if you can remove those later unmap_mapping_ranges. As I recall, the important thing for the copy pages is to be holding the page lock (or whatever other serialization) on the copied page still while the copy page is inserted into pagetable: that looks to be so in your __do_fault. > > But it is a shame, and leaves me wondering what you gained with the > > page lock there. > > > > One thing gained is ease of understanding, and if your later patches > > build an edifice upon the knowledge of holding that page lock while > > faulting, I've no wish to undermine that foundation. > > It also fixes a bug, doesn't it? ;) Well, I'd come to think that perhaps the bugs would be solved by that second unmap_mapping_range alone, so the pagelock changes just a misleading diversion. I'm not sure how I feel about that: calling unmap_mapping_range a second time feels such a cheat, but if (big if) it does solve the races, and the pagelock method is as expensive as your numbers now suggest... Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
Nick Piggin wrote: Hugh Dickins wrote: On Tue, 1 May 2007, Nick Piggin wrote: There were concerns that we could do this more cheaply, but I think it is important to start with a base that is simple and more likely to be correct and build on that. My testing didn't show any obvious problems with performance. I don't see _problems_ with performance, but I do consistently see the same kind of ~5% degradation in lmbench fork, exec, sh, mmap latency and page fault tests on SMP, several machines, just as I did last year. OK. I did run some tests at one stage which didn't show a regression on my P4, however I don't know that they were statistically significant. I'll try a couple more runs and post numbers. I didn't have enough time tonight to get means/stddev, etc, but the runs are pretty stable. Patch tested was just the lock page one. SMP kernel, tasks bound to 1 CPU: P4 Xeon pagefault fork exec 2.6.21 1.67-1.69 140.7-142.0 449.5-460.8 +patch 1.75-1.77 144.0-145.5 456.2-463.0 So it's taken on nearly 5% on pagefault, but looks like less than 2% on fork, so not as bad as your numbers (phew). G5 pagefault fork exec 2.6.21 1.49-1.51 164.6-170.8 741.8-760.3 +patch 1.71-1.73 175.2-180.8 780.5-794.2 Bigger hit there. Page faults can be improved a tiny bit by not using a test and clear op in unlock_page (less barriers for the G5). I don't think that's really a blocker problem for a merge, but I wonder what we can do to improve it. Lockless pagecache shaves quite a bit of straight line find_get_page performance there. Going to a non-sleeping lock might be one way to go in the long term, but it would require quite a lot of restructuring. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
Nick Piggin wrote: Hugh Dickins wrote: On Tue, 1 May 2007, Nick Piggin wrote: There were concerns that we could do this more cheaply, but I think it is important to start with a base that is simple and more likely to be correct and build on that. My testing didn't show any obvious problems with performance. I don't see _problems_ with performance, but I do consistently see the same kind of ~5% degradation in lmbench fork, exec, sh, mmap latency and page fault tests on SMP, several machines, just as I did last year. OK. I did run some tests at one stage which didn't show a regression on my P4, however I don't know that they were statistically significant. I'll try a couple more runs and post numbers. I didn't have enough time tonight to get means/stddev, etc, but the runs are pretty stable. Patch tested was just the lock page one. SMP kernel, tasks bound to 1 CPU: P4 Xeon pagefault fork exec 2.6.21 1.67-1.69 140.7-142.0 449.5-460.8 +patch 1.75-1.77 144.0-145.5 456.2-463.0 So it's taken on nearly 5% on pagefault, but looks like less than 2% on fork, so not as bad as your numbers (phew). G5 pagefault fork exec 2.6.21 1.49-1.51 164.6-170.8 741.8-760.3 +patch 1.71-1.73 175.2-180.8 780.5-794.2 Bigger hit there. Page faults can be improved a tiny bit by not using a test and clear op in unlock_page (less barriers for the G5). I don't think that's really a blocker problem for a merge, but I wonder what we can do to improve it. Lockless pagecache shaves quite a bit of straight line find_get_page performance there. Going to a non-sleeping lock might be one way to go in the long term, but it would require quite a lot of restructuring. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
On Wed, 2 May 2007, Nick Piggin wrote: Hugh Dickins wrote: But I was quite disappointed when mm-fix-fault-vs-invalidate-race-for-linear-mappings-fix.patch appeared, putting double unmap_mapping_range calls in. Certainly you were wrong to take the one out, but a pity to end up with two. Your comment says/said: The nopage vs invalidate race fix patch did not take care of truncating private COW pages. Mind you, I'm pretty sure this was previously racy even for regular truncate, not to mention vmtruncate_range. vmtruncate_range (holepunch) was deficient I agree, and though we can now take out your second unmap_mapping_range there, that's only because I've slipped one into shmem_truncate_range. In due course it needs to be properly handled by noting the range in shmem inode info. (I think you couldn't take that approach, noting invalid range in -mapping while invalidating, because NFS has/had some cases of invalidate_whatever without i_mutex?) Sorry, I didn't parse this? I meant that i_size is used to protect against truncation races, but we have no equivalent inval_start,inval_end in the struct inode or struct address_space, such as could be used for similar protection against races while invalidating. And that IIRC there are places where NFS was doing the invalidation without i_mutex: so there could be concurrent invalidations, so one inval_start,inval_end in the structure wouldn't be enough anyway. But I wonder whether it is better to do it in vmtruncate_range than the filesystem? Private COWed pages are not really a filesystem thing... It wasn't the thought of private COWed pages which made me put a second unmap_mapping_range in shmem_truncate_range, it was its own internal file-swap consistency which needed that (as a quick fix). The real fix to be having a trunc_start,trunc_end or whatever in the shmem_inode_info (assuming it's not wanted in the common inode: might be if holepunch spreads e.g. it's been mentioned with fallocate). Re private COWed pages and holepunch: Miklos and I agree that really it would be better for holepunch _not_ to remove them - but that's rather off-topic. More on-topic, since you suggest doing more within vmtruncate_range than the filesystem: no, I'm afraid that's misdesigned, and I want to move almost all of it into the filesystem -truncate_range. Because, if what vmtruncate_range is doing before it gets to the filesystem isn't to be just a waste of time, the filesystem needs to know what's going on in advance - just as notify_change warns the filesystem about a coming truncation. But easier than inventing some new notification is to move it all into the filesystem, with unmap_mapping_range+truncate_inode_pages_range its library helpers. But I'm pretty sure (to use your words!) regular truncate was not racy before: I believe Andrea's sequence count was handling that case fine, without a second unmap_mapping_range. OK, I think you're right. I _think_ it should also be OK with the lock_page version as well: we should not be able to have any pages after the first unmap_mapping_range call, because of the i_size write. So if we have no pages, there is nothing to 'cow' from. I'd be delighted if you can remove those later unmap_mapping_ranges. As I recall, the important thing for the copy pages is to be holding the page lock (or whatever other serialization) on the copied page still while the copy page is inserted into pagetable: that looks to be so in your __do_fault. But it is a shame, and leaves me wondering what you gained with the page lock there. One thing gained is ease of understanding, and if your later patches build an edifice upon the knowledge of holding that page lock while faulting, I've no wish to undermine that foundation. It also fixes a bug, doesn't it? ;) Well, I'd come to think that perhaps the bugs would be solved by that second unmap_mapping_range alone, so the pagelock changes just a misleading diversion. I'm not sure how I feel about that: calling unmap_mapping_range a second time feels such a cheat, but if (big if) it does solve the races, and the pagelock method is as expensive as your numbers now suggest... Hugh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
Hugh Dickins wrote: On Wed, 2 May 2007, Nick Piggin wrote: [snip] More on-topic, since you suggest doing more within vmtruncate_range than the filesystem: no, I'm afraid that's misdesigned, and I want to move almost all of it into the filesystem -truncate_range. Because, if what vmtruncate_range is doing before it gets to the filesystem isn't to be just a waste of time, the filesystem needs to know what's going on in advance - just as notify_change warns the filesystem about a coming truncation. But easier than inventing some new notification is to move it all into the filesystem, with unmap_mapping_range+truncate_inode_pages_range its library helpers. Well I would prefer it to follow the same pattern as regular truncate. I don't think it is misdesigned to call the filesystem _first_, but I think if you do that then the filesystem should call the vm to prepare / finish truncate, rather than open code calls to unmap itself. But I'm pretty sure (to use your words!) regular truncate was not racy before: I believe Andrea's sequence count was handling that case fine, without a second unmap_mapping_range. OK, I think you're right. I _think_ it should also be OK with the lock_page version as well: we should not be able to have any pages after the first unmap_mapping_range call, because of the i_size write. So if we have no pages, there is nothing to 'cow' from. I'd be delighted if you can remove those later unmap_mapping_ranges. As I recall, the important thing for the copy pages is to be holding the page lock (or whatever other serialization) on the copied page still while the copy page is inserted into pagetable: that looks to be so in your __do_fault. Yeah, I think my thought process went wrong on those... I'll revisit. But it is a shame, and leaves me wondering what you gained with the page lock there. One thing gained is ease of understanding, and if your later patches build an edifice upon the knowledge of holding that page lock while faulting, I've no wish to undermine that foundation. It also fixes a bug, doesn't it? ;) Well, I'd come to think that perhaps the bugs would be solved by that second unmap_mapping_range alone, so the pagelock changes just a misleading diversion. I'm not sure how I feel about that: calling unmap_mapping_range a second time feels such a cheat, but if (big if) it does solve the races, and the pagelock method is as expensive as your numbers now suggest... Well aside from being terribly ugly, it means we can still drop the dirty bit where we'd otherwise rather not, so I don't think we can do that. I think there may be some way we can do this without taking the page lock, and I was going to look at it, but I think it is quite neat to just lock the page... I don't think performance is _that_ bad. On the P4 it is a couple of % on the microbenchmarks. The G5 is worse, but even then I don't think it is I'll try to improve that and get back to you. The problem is that lock/unlock_page is expensive on powerpc, and if we improve that, we improve more than just the fault handler... The attached patch gets performance up a bit by avoiding some barriers and some cachelines: G5 pagefault fork exec 2.6.21 1.49-1.51 164.6-170.8 741.8-760.3 +patch 1.71-1.73 175.2-180.8 780.5-794.2 +patch2 1.61-1.63 169.8-175.0 748.6-757.0 So that brings the fork/exec hits down to much less than 5%, and would likely speed up other things that lock the page, like write or page reclaim. I think we could get further performance improvement by implementing arch specific bitops for lock/unlock operations, so we don't need to use things like smb_mb__before_clear_bit() if they aren't needed or full barriers in the test_and_set_bit(). -- SUSE Labs, Novell Inc. Index: linux-2.6/include/linux/page-flags.h === --- linux-2.6.orig/include/linux/page-flags.h 2007-04-24 10:39:56.0 +1000 +++ linux-2.6/include/linux/page-flags.h2007-05-03 08:38:53.0 +1000 @@ -91,6 +91,8 @@ #define PG_nosave_free 18 /* Used for system suspend/resume */ #define PG_buddy 19 /* Page is free, on buddy lists */ +#define PG_waiters 20 /* Page has PG_locked waiters */ + /* PG_owner_priv_1 users should have descriptive aliases */ #define PG_checked PG_owner_priv_1 /* Used by some filesystems */ Index: linux-2.6/include/linux/pagemap.h === --- linux-2.6.orig/include/linux/pagemap.h 2007-04-24 10:39:56.0 +1000 +++ linux-2.6/include/linux/pagemap.h 2007-05-03 08:35:08.0 +1000 @@ -141,7 +141,7 @@ static inline void lock_page(struct page *page) { might_sleep(); - if (TestSetPageLocked(page)) + if (unlikely(TestSetPageLocked(page))) __lock_page(page); } @@ -152,7 +152,7 @@ static inline void
Re: 2.6.22 -mm merge plans -- vm bugfixes
Hugh Dickins wrote: On Tue, 1 May 2007, Nick Piggin wrote: There were concerns that we could do this more cheaply, but I think it is important to start with a base that is simple and more likely to be correct and build on that. My testing didn't show any obvious problems with performance. I don't see _problems_ with performance, but I do consistently see the same kind of ~5% degradation in lmbench fork, exec, sh, mmap latency and page fault tests on SMP, several machines, just as I did last year. OK. I did run some tests at one stage which didn't show a regression on my P4, however I don't know that they were statistically significant. I'll try a couple more runs and post numbers. I'm assuming this patch is the one responsible: at 2.6.20-rc4 time you posted a set of 10 and a set of 7 patches I tried in versus out; at 2.6.21-rc3-mm2 time you had a group of patches in -mm I tried in versus out; with similar results. I did check the graphs on test.kernel.org, I couldn't see any bad behaviour there that correlated with this work; though each -mm has such a variety of new work in it, it's very hard to attribute. And nobody else has reported any regression from your patches. I'm inclined to write it off as poorer performance in some micro- benchmarks, against which we offset the improved understandabilty of holding the page lock over the file fault. But I was quite disappointed when mm-fix-fault-vs-invalidate-race-for-linear-mappings-fix.patch appeared, putting double unmap_mapping_range calls in. Certainly you were wrong to take the one out, but a pity to end up with two. Your comment says/said: The nopage vs invalidate race fix patch did not take care of truncating private COW pages. Mind you, I'm pretty sure this was previously racy even for regular truncate, not to mention vmtruncate_range. vmtruncate_range (holepunch) was deficient I agree, and though we can now take out your second unmap_mapping_range there, that's only because I've slipped one into shmem_truncate_range. In due course it needs to be properly handled by noting the range in shmem inode info. (I think you couldn't take that approach, noting invalid range in ->mapping while invalidating, because NFS has/had some cases of invalidate_whatever without i_mutex?) Sorry, I didn't parse this? But I wonder whether it is better to do it in vmtruncate_range than the filesystem? Private COWed pages are not really a filesystem "thing"... But I'm pretty sure (to use your words!) regular truncate was not racy before: I believe Andrea's sequence count was handling that case fine, without a second unmap_mapping_range. OK, I think you're right. I _think_ it should also be OK with the lock_page version as well: we should not be able to have any pages after the first unmap_mapping_range call, because of the i_size write. So if we have no pages, there is nothing to 'cow' from. Well, I guess I've come to accept that, expensive as unmap_mapping_range may be, truncating files while they're mmap'ed is perverse behaviour: perhaps even deserving such punishment. But it is a shame, and leaves me wondering what you gained with the page lock there. One thing gained is ease of understanding, and if your later patches build an edifice upon the knowledge of holding that page lock while faulting, I've no wish to undermine that foundation. It also fixes a bug, doesn't it? ;) -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
On Tue, 1 May 2007, Nick Piggin wrote: > Andrew Morton wrote: > > > mm-simplify-filemap_nopage.patch > > mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch > > mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch > > mm-merge-nopfn-into-fault.patch > > convert-hugetlbfs-to-use-vm_ops-fault.patch > > mm-remove-legacy-cruft.patch > > mm-debug-check-for-the-fault-vs-invalidate-race.patch > > > mm-fix-clear_page_dirty_for_io-vs-fault-race.patch > > > Miscish MM changes. Will merge, dependent upon what still applies and works > > if the moveable-zone patches get stalled. > > These fix some bugs in the core vm, at least the former one we have > seen numerous people hitting in production... ... > > So, do you or anyone else have any problems with these patches going in > 2.6.22? I haven't had much feedback for a while, but I was under the > impression that people are more-or-less happy with them? > > mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch > > This patch fixes the core filemap_nopage vs invalidate_inode_pages2 > race by having filemap_nopage return a locked page to do_no_page, > and removes the fairly complex (and inadequate) truncate_count > synchronisation logic. > > There were concerns that we could do this more cheaply, but I think it > is important to start with a base that is simple and more likely to > be correct and build on that. My testing didn't show any obvious > problems with performance. I don't see _problems_ with performance, but I do consistently see the same kind of ~5% degradation in lmbench fork, exec, sh, mmap latency and page fault tests on SMP, several machines, just as I did last year. I'm assuming this patch is the one responsible: at 2.6.20-rc4 time you posted a set of 10 and a set of 7 patches I tried in versus out; at 2.6.21-rc3-mm2 time you had a group of patches in -mm I tried in versus out; with similar results. I did check the graphs on test.kernel.org, I couldn't see any bad behaviour there that correlated with this work; though each -mm has such a variety of new work in it, it's very hard to attribute. And nobody else has reported any regression from your patches. I'm inclined to write it off as poorer performance in some micro- benchmarks, against which we offset the improved understandabilty of holding the page lock over the file fault. But I was quite disappointed when mm-fix-fault-vs-invalidate-race-for-linear-mappings-fix.patch appeared, putting double unmap_mapping_range calls in. Certainly you were wrong to take the one out, but a pity to end up with two. Your comment says/said: The nopage vs invalidate race fix patch did not take care of truncating private COW pages. Mind you, I'm pretty sure this was previously racy even for regular truncate, not to mention vmtruncate_range. vmtruncate_range (holepunch) was deficient I agree, and though we can now take out your second unmap_mapping_range there, that's only because I've slipped one into shmem_truncate_range. In due course it needs to be properly handled by noting the range in shmem inode info. (I think you couldn't take that approach, noting invalid range in ->mapping while invalidating, because NFS has/had some cases of invalidate_whatever without i_mutex?) But I'm pretty sure (to use your words!) regular truncate was not racy before: I believe Andrea's sequence count was handling that case fine, without a second unmap_mapping_range. Well, I guess I've come to accept that, expensive as unmap_mapping_range may be, truncating files while they're mmap'ed is perverse behaviour: perhaps even deserving such punishment. But it is a shame, and leaves me wondering what you gained with the page lock there. One thing gained is ease of understanding, and if your later patches build an edifice upon the knowledge of holding that page lock while faulting, I've no wish to undermine that foundation. > > mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch > mm-merge-nopfn-into-fault.patch > etc. > > These move ->nopage, ->populate, ->nopfn (and soon, ->page_mkwrite) > into a single, unified interface. Although this strictly closes some > similar holes in nonlinear faults as well, they are very uncommon, so > I wouldn't be so upset if these aren't merged in 2.6.22 (I don't see > any reason not to, but at least they don't fix major bugs). I don't have an opinion on these, but I think BenH and others were strongly in favour, with various people waiting upon them. Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
On Tue, 01 May 2007 18:44:07 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote: > > mm-simplify-filemap_nopage.patch > > mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch > > mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch > > mm-merge-nopfn-into-fault.patch > > convert-hugetlbfs-to-use-vm_ops-fault.patch > > mm-remove-legacy-cruft.patch > > mm-debug-check-for-the-fault-vs-invalidate-race.patch > > > mm-fix-clear_page_dirty_for_io-vs-fault-race.patch > > > Miscish MM changes. Will merge, dependent upon what still applies and works > > if the moveable-zone patches get stalled. > > These fix some bugs in the core vm, at least the former one we have > seen numerous people hitting in production... > > I don't suppose you mean these are logically dependant on new features > sitting below them in your patch stack, just that you don't want to > spend time fixing a lot of rejects? It'll probably be OK - I just haven't checked yet. I'm fairly handy at fixing rejects nowadays ;) Nobody seems to be taking up this opportunity to provide us with review and test results on the antifrag patches. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
Andrew Morton wrote: mm-simplify-filemap_nopage.patch mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch mm-merge-nopfn-into-fault.patch convert-hugetlbfs-to-use-vm_ops-fault.patch mm-remove-legacy-cruft.patch mm-debug-check-for-the-fault-vs-invalidate-race.patch mm-fix-clear_page_dirty_for_io-vs-fault-race.patch Miscish MM changes. Will merge, dependent upon what still applies and works if the moveable-zone patches get stalled. These fix some bugs in the core vm, at least the former one we have seen numerous people hitting in production... I don't suppose you mean these are logically dependant on new features sitting below them in your patch stack, just that you don't want to spend time fixing a lot of rejects? If so, I can help fix those up, but I don't think there is anything major, IIRC the biggest annoyance is just that changing some GFP_types throws some big hunks. So, do you or anyone else have any problems with these patches going in 2.6.22? I haven't had much feedback for a while, but I was under the impression that people are more-or-less happy with them? mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch This patch fixes the core filemap_nopage vs invalidate_inode_pages2 race by having filemap_nopage return a locked page to do_no_page, and removes the fairly complex (and inadequate) truncate_count synchronisation logic. There were concerns that we could do this more cheaply, but I think it is important to start with a base that is simple and more likely to be correct and build on that. My testing didn't show any obvious problems with performance. mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch mm-merge-nopfn-into-fault.patch etc. These move ->nopage, ->populate, ->nopfn (and soon, ->page_mkwrite) into a single, unified interface. Although this strictly closes some similar holes in nonlinear faults as well, they are very uncommon, so I wouldn't be so upset if these aren't merged in 2.6.22 (I don't see any reason not to, but at least they don't fix major bugs). -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
Andrew Morton wrote: mm-simplify-filemap_nopage.patch mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch mm-merge-nopfn-into-fault.patch convert-hugetlbfs-to-use-vm_ops-fault.patch mm-remove-legacy-cruft.patch mm-debug-check-for-the-fault-vs-invalidate-race.patch mm-fix-clear_page_dirty_for_io-vs-fault-race.patch Miscish MM changes. Will merge, dependent upon what still applies and works if the moveable-zone patches get stalled. These fix some bugs in the core vm, at least the former one we have seen numerous people hitting in production... I don't suppose you mean these are logically dependant on new features sitting below them in your patch stack, just that you don't want to spend time fixing a lot of rejects? If so, I can help fix those up, but I don't think there is anything major, IIRC the biggest annoyance is just that changing some GFP_types throws some big hunks. So, do you or anyone else have any problems with these patches going in 2.6.22? I haven't had much feedback for a while, but I was under the impression that people are more-or-less happy with them? mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch This patch fixes the core filemap_nopage vs invalidate_inode_pages2 race by having filemap_nopage return a locked page to do_no_page, and removes the fairly complex (and inadequate) truncate_count synchronisation logic. There were concerns that we could do this more cheaply, but I think it is important to start with a base that is simple and more likely to be correct and build on that. My testing didn't show any obvious problems with performance. mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch mm-merge-nopfn-into-fault.patch etc. These move -nopage, -populate, -nopfn (and soon, -page_mkwrite) into a single, unified interface. Although this strictly closes some similar holes in nonlinear faults as well, they are very uncommon, so I wouldn't be so upset if these aren't merged in 2.6.22 (I don't see any reason not to, but at least they don't fix major bugs). -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
On Tue, 01 May 2007 18:44:07 +1000 Nick Piggin [EMAIL PROTECTED] wrote: mm-simplify-filemap_nopage.patch mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch mm-merge-nopfn-into-fault.patch convert-hugetlbfs-to-use-vm_ops-fault.patch mm-remove-legacy-cruft.patch mm-debug-check-for-the-fault-vs-invalidate-race.patch mm-fix-clear_page_dirty_for_io-vs-fault-race.patch Miscish MM changes. Will merge, dependent upon what still applies and works if the moveable-zone patches get stalled. These fix some bugs in the core vm, at least the former one we have seen numerous people hitting in production... I don't suppose you mean these are logically dependant on new features sitting below them in your patch stack, just that you don't want to spend time fixing a lot of rejects? It'll probably be OK - I just haven't checked yet. I'm fairly handy at fixing rejects nowadays ;) Nobody seems to be taking up this opportunity to provide us with review and test results on the antifrag patches. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
On Tue, 1 May 2007, Nick Piggin wrote: Andrew Morton wrote: mm-simplify-filemap_nopage.patch mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch mm-merge-nopfn-into-fault.patch convert-hugetlbfs-to-use-vm_ops-fault.patch mm-remove-legacy-cruft.patch mm-debug-check-for-the-fault-vs-invalidate-race.patch mm-fix-clear_page_dirty_for_io-vs-fault-race.patch Miscish MM changes. Will merge, dependent upon what still applies and works if the moveable-zone patches get stalled. These fix some bugs in the core vm, at least the former one we have seen numerous people hitting in production... ... So, do you or anyone else have any problems with these patches going in 2.6.22? I haven't had much feedback for a while, but I was under the impression that people are more-or-less happy with them? mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch This patch fixes the core filemap_nopage vs invalidate_inode_pages2 race by having filemap_nopage return a locked page to do_no_page, and removes the fairly complex (and inadequate) truncate_count synchronisation logic. There were concerns that we could do this more cheaply, but I think it is important to start with a base that is simple and more likely to be correct and build on that. My testing didn't show any obvious problems with performance. I don't see _problems_ with performance, but I do consistently see the same kind of ~5% degradation in lmbench fork, exec, sh, mmap latency and page fault tests on SMP, several machines, just as I did last year. I'm assuming this patch is the one responsible: at 2.6.20-rc4 time you posted a set of 10 and a set of 7 patches I tried in versus out; at 2.6.21-rc3-mm2 time you had a group of patches in -mm I tried in versus out; with similar results. I did check the graphs on test.kernel.org, I couldn't see any bad behaviour there that correlated with this work; though each -mm has such a variety of new work in it, it's very hard to attribute. And nobody else has reported any regression from your patches. I'm inclined to write it off as poorer performance in some micro- benchmarks, against which we offset the improved understandabilty of holding the page lock over the file fault. But I was quite disappointed when mm-fix-fault-vs-invalidate-race-for-linear-mappings-fix.patch appeared, putting double unmap_mapping_range calls in. Certainly you were wrong to take the one out, but a pity to end up with two. Your comment says/said: The nopage vs invalidate race fix patch did not take care of truncating private COW pages. Mind you, I'm pretty sure this was previously racy even for regular truncate, not to mention vmtruncate_range. vmtruncate_range (holepunch) was deficient I agree, and though we can now take out your second unmap_mapping_range there, that's only because I've slipped one into shmem_truncate_range. In due course it needs to be properly handled by noting the range in shmem inode info. (I think you couldn't take that approach, noting invalid range in -mapping while invalidating, because NFS has/had some cases of invalidate_whatever without i_mutex?) But I'm pretty sure (to use your words!) regular truncate was not racy before: I believe Andrea's sequence count was handling that case fine, without a second unmap_mapping_range. Well, I guess I've come to accept that, expensive as unmap_mapping_range may be, truncating files while they're mmap'ed is perverse behaviour: perhaps even deserving such punishment. But it is a shame, and leaves me wondering what you gained with the page lock there. One thing gained is ease of understanding, and if your later patches build an edifice upon the knowledge of holding that page lock while faulting, I've no wish to undermine that foundation. mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch mm-merge-nopfn-into-fault.patch etc. These move -nopage, -populate, -nopfn (and soon, -page_mkwrite) into a single, unified interface. Although this strictly closes some similar holes in nonlinear faults as well, they are very uncommon, so I wouldn't be so upset if these aren't merged in 2.6.22 (I don't see any reason not to, but at least they don't fix major bugs). I don't have an opinion on these, but I think BenH and others were strongly in favour, with various people waiting upon them. Hugh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22 -mm merge plans -- vm bugfixes
Hugh Dickins wrote: On Tue, 1 May 2007, Nick Piggin wrote: There were concerns that we could do this more cheaply, but I think it is important to start with a base that is simple and more likely to be correct and build on that. My testing didn't show any obvious problems with performance. I don't see _problems_ with performance, but I do consistently see the same kind of ~5% degradation in lmbench fork, exec, sh, mmap latency and page fault tests on SMP, several machines, just as I did last year. OK. I did run some tests at one stage which didn't show a regression on my P4, however I don't know that they were statistically significant. I'll try a couple more runs and post numbers. I'm assuming this patch is the one responsible: at 2.6.20-rc4 time you posted a set of 10 and a set of 7 patches I tried in versus out; at 2.6.21-rc3-mm2 time you had a group of patches in -mm I tried in versus out; with similar results. I did check the graphs on test.kernel.org, I couldn't see any bad behaviour there that correlated with this work; though each -mm has such a variety of new work in it, it's very hard to attribute. And nobody else has reported any regression from your patches. I'm inclined to write it off as poorer performance in some micro- benchmarks, against which we offset the improved understandabilty of holding the page lock over the file fault. But I was quite disappointed when mm-fix-fault-vs-invalidate-race-for-linear-mappings-fix.patch appeared, putting double unmap_mapping_range calls in. Certainly you were wrong to take the one out, but a pity to end up with two. Your comment says/said: The nopage vs invalidate race fix patch did not take care of truncating private COW pages. Mind you, I'm pretty sure this was previously racy even for regular truncate, not to mention vmtruncate_range. vmtruncate_range (holepunch) was deficient I agree, and though we can now take out your second unmap_mapping_range there, that's only because I've slipped one into shmem_truncate_range. In due course it needs to be properly handled by noting the range in shmem inode info. (I think you couldn't take that approach, noting invalid range in -mapping while invalidating, because NFS has/had some cases of invalidate_whatever without i_mutex?) Sorry, I didn't parse this? But I wonder whether it is better to do it in vmtruncate_range than the filesystem? Private COWed pages are not really a filesystem thing... But I'm pretty sure (to use your words!) regular truncate was not racy before: I believe Andrea's sequence count was handling that case fine, without a second unmap_mapping_range. OK, I think you're right. I _think_ it should also be OK with the lock_page version as well: we should not be able to have any pages after the first unmap_mapping_range call, because of the i_size write. So if we have no pages, there is nothing to 'cow' from. Well, I guess I've come to accept that, expensive as unmap_mapping_range may be, truncating files while they're mmap'ed is perverse behaviour: perhaps even deserving such punishment. But it is a shame, and leaves me wondering what you gained with the page lock there. One thing gained is ease of understanding, and if your later patches build an edifice upon the knowledge of holding that page lock while faulting, I've no wish to undermine that foundation. It also fixes a bug, doesn't it? ;) -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/