Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-09 Thread Nick Piggin

Hugh Dickins wrote:

On Thu, 10 May 2007, Nick Piggin wrote:


The filesystem (or page cache) allows pages beyond i_size to come
in there?  That wasn't a problem before, was it?  But now it is?


The filesystem still doesn't, but if i_size is updated after the page
is returned, we can have a problem that was previously taken care of
with the truncate_count but now isn't.



But... I thought the page lock was now taking care of that in your
scheme?  truncate_inode_pages has to wait for the page lock, then
it finds the page is mapped and... ahh, it finds the copiee page
is not mapped, so doesn't do its own little unmap_mapping_range,
and the copied page squeaks through.  Drat.

I really think the truncate_count solution worked better, for
truncation anyway.  There may be persuasive reasons you need the
page lock for invalidation: I gave up on trying to understand the
required behaviour(s) for invalidation.

So, bring back (the original use of, not my tree marker use of)
truncate_count?  Hmm, you probably don't want to do that, because
there was some pleasure in removing the strange barriers associated
with it.

A second unmap_mapping_range is just one line of code - but it sure
feels like a defeat to me, calling the whole exercise into question.
(But then, you'd be right to say my perfectionism made it impossible
for me to come up with any solution to the invalidation issues.)


Well we could bring back the truncate_count, but I think that sucks
because that's moving work into the page fault handler in order to
avoid a bit of work when truncating mapped files.



But that's a change we could have made at
any time if we'd bothered, it's not really the issue here.


I don't see how you could, because you need to increment truncate_count.



Though indeed we did so, I don't see that we needed to increment
truncate_count in that case (nobody could be coming through
do_no_page on that file, when there are no mappings of it).


Of course :P

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-09 Thread Hugh Dickins
On Thu, 10 May 2007, Nick Piggin wrote:
> > 
> > The filesystem (or page cache) allows pages beyond i_size to come
> > in there?  That wasn't a problem before, was it?  But now it is?
> 
> The filesystem still doesn't, but if i_size is updated after the page
> is returned, we can have a problem that was previously taken care of
> with the truncate_count but now isn't.

But... I thought the page lock was now taking care of that in your
scheme?  truncate_inode_pages has to wait for the page lock, then
it finds the page is mapped and... ahh, it finds the copiee page
is not mapped, so doesn't do its own little unmap_mapping_range,
and the copied page squeaks through.  Drat.

I really think the truncate_count solution worked better, for
truncation anyway.  There may be persuasive reasons you need the
page lock for invalidation: I gave up on trying to understand the
required behaviour(s) for invalidation.

So, bring back (the original use of, not my tree marker use of)
truncate_count?  Hmm, you probably don't want to do that, because
there was some pleasure in removing the strange barriers associated
with it.

A second unmap_mapping_range is just one line of code - but it sure
feels like a defeat to me, calling the whole exercise into question.
(But then, you'd be right to say my perfectionism made it impossible
for me to come up with any solution to the invalidation issues.)

> > Suspect you'd need a barrier of some kind between the i_size_write and
> > the mapping_mapped test?
> 
> The unmap_mapping_range that runs after the truncate_inode_pages should
> run in the correct order, I believe.

Yes, if there's going to be that backup call, the first won't really
need a barrier.

> > But that's a change we could have made at
> > any time if we'd bothered, it's not really the issue here.
> 
> I don't see how you could, because you need to increment truncate_count.

Though indeed we did so, I don't see that we needed to increment
truncate_count in that case (nobody could be coming through
do_no_page on that file, when there are no mappings of it).

> But I believe this is fixing the issue, even if it does so in a peripheral
> manner, because it avoids the added cost for unmapped files.

It's a small improvement to your common case, I agree.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-09 Thread Nick Piggin

Hugh Dickins wrote:

On Wed, 9 May 2007, Nick Piggin wrote:


Hugh Dickins wrote:


On Wed, 2 May 2007, Nick Piggin wrote:



But I'm pretty sure (to use your words!) regular truncate was not racy
before: I believe Andrea's sequence count was handling that case fine,
without a second unmap_mapping_range.


OK, I think you're right. I _think_ it should also be OK with the
lock_page version as well: we should not be able to have any pages
after the first unmap_mapping_range call, because of the i_size
write. So if we have no pages, there is nothing to 'cow' from.


I'd be delighted if you can remove those later unmap_mapping_ranges.
As I recall, the important thing for the copy pages is to be holding
the page lock (or whatever other serialization) on the copied page
still while the copy page is inserted into pagetable: that looks
to be so in your __do_fault.


Hmm, on second thoughts, I think I was right the first time, and do
need the unmap after the pages are truncated. With the lock_page code,
after the first unmap, we can get new ptes mapping pages, and
subsequently they can be COWed and then the original pte zapped before
the truncate loop checks it.



The filesystem (or page cache) allows pages beyond i_size to come
in there?  That wasn't a problem before, was it?  But now it is?


The filesystem still doesn't, but if i_size is updated after the page
is returned, we can have a problem that was previously taken care of
with the truncate_count but now isn't.


However, I wonder if we can't test mapping_mapped before the spinlock,
which would make most truncates cheaper?



Slightly cheaper, yes, though I doubt it'd be much in comparison with
actually doing any work in unmap_mapping_range or truncate_inode_pages.


But if we're supposing the common case for truncate is unmapped mappings,
then the main cost there will be the locking, which I'm trying to avoid.
Hopefully with this patch, most truncate workloads would get faster, even
though truncate mapped files is going to be unavoidably slower.



Suspect you'd need a barrier of some kind between the i_size_write and
the mapping_mapped test?


The unmap_mapping_range that runs after the truncate_inode_pages should
run in the correct order, I believe.


 But that's a change we could have made at
any time if we'd bothered, it's not really the issue here.


I don't see how you could, because you need to increment truncate_count.

But I believe this is fixing the issue, even if it does so in a peripheral
manner, because it avoids the added cost for unmapped files.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-09 Thread Hugh Dickins
On Wed, 9 May 2007, Nick Piggin wrote:
> Hugh Dickins wrote:
> > On Wed, 2 May 2007, Nick Piggin wrote:
> 
> > > >But I'm pretty sure (to use your words!) regular truncate was not racy
> > > >before: I believe Andrea's sequence count was handling that case fine,
> > > >without a second unmap_mapping_range.
> > >
> > >OK, I think you're right. I _think_ it should also be OK with the
> > >lock_page version as well: we should not be able to have any pages
> > >after the first unmap_mapping_range call, because of the i_size
> > >write. So if we have no pages, there is nothing to 'cow' from.
> > 
> > I'd be delighted if you can remove those later unmap_mapping_ranges.
> > As I recall, the important thing for the copy pages is to be holding
> > the page lock (or whatever other serialization) on the copied page
> > still while the copy page is inserted into pagetable: that looks
> > to be so in your __do_fault.
> 
> Hmm, on second thoughts, I think I was right the first time, and do
> need the unmap after the pages are truncated. With the lock_page code,
> after the first unmap, we can get new ptes mapping pages, and
> subsequently they can be COWed and then the original pte zapped before
> the truncate loop checks it.

The filesystem (or page cache) allows pages beyond i_size to come
in there?  That wasn't a problem before, was it?  But now it is?

> 
> However, I wonder if we can't test mapping_mapped before the spinlock,
> which would make most truncates cheaper?

Slightly cheaper, yes, though I doubt it'd be much in comparison with
actually doing any work in unmap_mapping_range or truncate_inode_pages.
Suspect you'd need a barrier of some kind between the i_size_write and
the mapping_mapped test?  But that's a change we could have made at
any time if we'd bothered, it's not really the issue here.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-09 Thread Nick Piggin

Hugh Dickins wrote:

On Wed, 2 May 2007, Nick Piggin wrote:



But I'm pretty sure (to use your words!) regular truncate was not racy
before: I believe Andrea's sequence count was handling that case fine,
without a second unmap_mapping_range.


OK, I think you're right. I _think_ it should also be OK with the
lock_page version as well: we should not be able to have any pages
after the first unmap_mapping_range call, because of the i_size
write. So if we have no pages, there is nothing to 'cow' from.



I'd be delighted if you can remove those later unmap_mapping_ranges.
As I recall, the important thing for the copy pages is to be holding
the page lock (or whatever other serialization) on the copied page
still while the copy page is inserted into pagetable: that looks
to be so in your __do_fault.


Hmm, on second thoughts, I think I was right the first time, and do
need the unmap after the pages are truncated. With the lock_page code,
after the first unmap, we can get new ptes mapping pages, and
subsequently they can be COWed and then the original pte zapped before
the truncate loop checks it.

However, I wonder if we can't test mapping_mapped before the spinlock,
which would make most truncates cheaper?

--
SUSE Labs, Novell Inc.
Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c 2007-04-24 15:02:51.0 +1000
+++ linux-2.6/mm/filemap.c  2007-05-09 17:30:47.0 +1000
@@ -2579,8 +2579,7 @@
if (rw == WRITE) {
write_len = iov_length(iov, nr_segs);
end = (offset + write_len - 1) >> PAGE_CACHE_SHIFT;
-   if (mapping_mapped(mapping))
-   unmap_mapping_range(mapping, offset, write_len, 0);
+   unmap_mapping_range(mapping, offset, write_len, 0);
}
 
retval = filemap_write_and_wait(mapping);
Index: linux-2.6/mm/memory.c
===
--- linux-2.6.orig/mm/memory.c  2007-05-09 17:25:28.0 +1000
+++ linux-2.6/mm/memory.c   2007-05-09 17:30:22.0 +1000
@@ -1956,6 +1956,9 @@
pgoff_t hba = holebegin >> PAGE_SHIFT;
pgoff_t hlen = (holelen + PAGE_SIZE - 1) >> PAGE_SHIFT;
 
+   if (!mapping_mapped(mapping))
+   return;
+
/* Check for overflow. */
if (sizeof(holelen) > sizeof(hlen)) {
long long holeend =


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-09 Thread Nick Piggin

Hugh Dickins wrote:

On Wed, 2 May 2007, Nick Piggin wrote:



But I'm pretty sure (to use your words!) regular truncate was not racy
before: I believe Andrea's sequence count was handling that case fine,
without a second unmap_mapping_range.


OK, I think you're right. I _think_ it should also be OK with the
lock_page version as well: we should not be able to have any pages
after the first unmap_mapping_range call, because of the i_size
write. So if we have no pages, there is nothing to 'cow' from.



I'd be delighted if you can remove those later unmap_mapping_ranges.
As I recall, the important thing for the copy pages is to be holding
the page lock (or whatever other serialization) on the copied page
still while the copy page is inserted into pagetable: that looks
to be so in your __do_fault.


Hmm, on second thoughts, I think I was right the first time, and do
need the unmap after the pages are truncated. With the lock_page code,
after the first unmap, we can get new ptes mapping pages, and
subsequently they can be COWed and then the original pte zapped before
the truncate loop checks it.

However, I wonder if we can't test mapping_mapped before the spinlock,
which would make most truncates cheaper?

--
SUSE Labs, Novell Inc.
Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c 2007-04-24 15:02:51.0 +1000
+++ linux-2.6/mm/filemap.c  2007-05-09 17:30:47.0 +1000
@@ -2579,8 +2579,7 @@
if (rw == WRITE) {
write_len = iov_length(iov, nr_segs);
end = (offset + write_len - 1)  PAGE_CACHE_SHIFT;
-   if (mapping_mapped(mapping))
-   unmap_mapping_range(mapping, offset, write_len, 0);
+   unmap_mapping_range(mapping, offset, write_len, 0);
}
 
retval = filemap_write_and_wait(mapping);
Index: linux-2.6/mm/memory.c
===
--- linux-2.6.orig/mm/memory.c  2007-05-09 17:25:28.0 +1000
+++ linux-2.6/mm/memory.c   2007-05-09 17:30:22.0 +1000
@@ -1956,6 +1956,9 @@
pgoff_t hba = holebegin  PAGE_SHIFT;
pgoff_t hlen = (holelen + PAGE_SIZE - 1)  PAGE_SHIFT;
 
+   if (!mapping_mapped(mapping))
+   return;
+
/* Check for overflow. */
if (sizeof(holelen)  sizeof(hlen)) {
long long holeend =


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-09 Thread Hugh Dickins
On Wed, 9 May 2007, Nick Piggin wrote:
 Hugh Dickins wrote:
  On Wed, 2 May 2007, Nick Piggin wrote:
 
   But I'm pretty sure (to use your words!) regular truncate was not racy
   before: I believe Andrea's sequence count was handling that case fine,
   without a second unmap_mapping_range.
  
  OK, I think you're right. I _think_ it should also be OK with the
  lock_page version as well: we should not be able to have any pages
  after the first unmap_mapping_range call, because of the i_size
  write. So if we have no pages, there is nothing to 'cow' from.
  
  I'd be delighted if you can remove those later unmap_mapping_ranges.
  As I recall, the important thing for the copy pages is to be holding
  the page lock (or whatever other serialization) on the copied page
  still while the copy page is inserted into pagetable: that looks
  to be so in your __do_fault.
 
 Hmm, on second thoughts, I think I was right the first time, and do
 need the unmap after the pages are truncated. With the lock_page code,
 after the first unmap, we can get new ptes mapping pages, and
 subsequently they can be COWed and then the original pte zapped before
 the truncate loop checks it.

The filesystem (or page cache) allows pages beyond i_size to come
in there?  That wasn't a problem before, was it?  But now it is?

 
 However, I wonder if we can't test mapping_mapped before the spinlock,
 which would make most truncates cheaper?

Slightly cheaper, yes, though I doubt it'd be much in comparison with
actually doing any work in unmap_mapping_range or truncate_inode_pages.
Suspect you'd need a barrier of some kind between the i_size_write and
the mapping_mapped test?  But that's a change we could have made at
any time if we'd bothered, it's not really the issue here.

Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-09 Thread Nick Piggin

Hugh Dickins wrote:

On Wed, 9 May 2007, Nick Piggin wrote:


Hugh Dickins wrote:


On Wed, 2 May 2007, Nick Piggin wrote:



But I'm pretty sure (to use your words!) regular truncate was not racy
before: I believe Andrea's sequence count was handling that case fine,
without a second unmap_mapping_range.


OK, I think you're right. I _think_ it should also be OK with the
lock_page version as well: we should not be able to have any pages
after the first unmap_mapping_range call, because of the i_size
write. So if we have no pages, there is nothing to 'cow' from.


I'd be delighted if you can remove those later unmap_mapping_ranges.
As I recall, the important thing for the copy pages is to be holding
the page lock (or whatever other serialization) on the copied page
still while the copy page is inserted into pagetable: that looks
to be so in your __do_fault.


Hmm, on second thoughts, I think I was right the first time, and do
need the unmap after the pages are truncated. With the lock_page code,
after the first unmap, we can get new ptes mapping pages, and
subsequently they can be COWed and then the original pte zapped before
the truncate loop checks it.



The filesystem (or page cache) allows pages beyond i_size to come
in there?  That wasn't a problem before, was it?  But now it is?


The filesystem still doesn't, but if i_size is updated after the page
is returned, we can have a problem that was previously taken care of
with the truncate_count but now isn't.


However, I wonder if we can't test mapping_mapped before the spinlock,
which would make most truncates cheaper?



Slightly cheaper, yes, though I doubt it'd be much in comparison with
actually doing any work in unmap_mapping_range or truncate_inode_pages.


But if we're supposing the common case for truncate is unmapped mappings,
then the main cost there will be the locking, which I'm trying to avoid.
Hopefully with this patch, most truncate workloads would get faster, even
though truncate mapped files is going to be unavoidably slower.



Suspect you'd need a barrier of some kind between the i_size_write and
the mapping_mapped test?


The unmap_mapping_range that runs after the truncate_inode_pages should
run in the correct order, I believe.


 But that's a change we could have made at
any time if we'd bothered, it's not really the issue here.


I don't see how you could, because you need to increment truncate_count.

But I believe this is fixing the issue, even if it does so in a peripheral
manner, because it avoids the added cost for unmapped files.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-09 Thread Hugh Dickins
On Thu, 10 May 2007, Nick Piggin wrote:
  
  The filesystem (or page cache) allows pages beyond i_size to come
  in there?  That wasn't a problem before, was it?  But now it is?
 
 The filesystem still doesn't, but if i_size is updated after the page
 is returned, we can have a problem that was previously taken care of
 with the truncate_count but now isn't.

But... I thought the page lock was now taking care of that in your
scheme?  truncate_inode_pages has to wait for the page lock, then
it finds the page is mapped and... ahh, it finds the copiee page
is not mapped, so doesn't do its own little unmap_mapping_range,
and the copied page squeaks through.  Drat.

I really think the truncate_count solution worked better, for
truncation anyway.  There may be persuasive reasons you need the
page lock for invalidation: I gave up on trying to understand the
required behaviour(s) for invalidation.

So, bring back (the original use of, not my tree marker use of)
truncate_count?  Hmm, you probably don't want to do that, because
there was some pleasure in removing the strange barriers associated
with it.

A second unmap_mapping_range is just one line of code - but it sure
feels like a defeat to me, calling the whole exercise into question.
(But then, you'd be right to say my perfectionism made it impossible
for me to come up with any solution to the invalidation issues.)

  Suspect you'd need a barrier of some kind between the i_size_write and
  the mapping_mapped test?
 
 The unmap_mapping_range that runs after the truncate_inode_pages should
 run in the correct order, I believe.

Yes, if there's going to be that backup call, the first won't really
need a barrier.

  But that's a change we could have made at
  any time if we'd bothered, it's not really the issue here.
 
 I don't see how you could, because you need to increment truncate_count.

Though indeed we did so, I don't see that we needed to increment
truncate_count in that case (nobody could be coming through
do_no_page on that file, when there are no mappings of it).

 But I believe this is fixing the issue, even if it does so in a peripheral
 manner, because it avoids the added cost for unmapped files.

It's a small improvement to your common case, I agree.

Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-09 Thread Nick Piggin

Hugh Dickins wrote:

On Thu, 10 May 2007, Nick Piggin wrote:


The filesystem (or page cache) allows pages beyond i_size to come
in there?  That wasn't a problem before, was it?  But now it is?


The filesystem still doesn't, but if i_size is updated after the page
is returned, we can have a problem that was previously taken care of
with the truncate_count but now isn't.



But... I thought the page lock was now taking care of that in your
scheme?  truncate_inode_pages has to wait for the page lock, then
it finds the page is mapped and... ahh, it finds the copiee page
is not mapped, so doesn't do its own little unmap_mapping_range,
and the copied page squeaks through.  Drat.

I really think the truncate_count solution worked better, for
truncation anyway.  There may be persuasive reasons you need the
page lock for invalidation: I gave up on trying to understand the
required behaviour(s) for invalidation.

So, bring back (the original use of, not my tree marker use of)
truncate_count?  Hmm, you probably don't want to do that, because
there was some pleasure in removing the strange barriers associated
with it.

A second unmap_mapping_range is just one line of code - but it sure
feels like a defeat to me, calling the whole exercise into question.
(But then, you'd be right to say my perfectionism made it impossible
for me to come up with any solution to the invalidation issues.)


Well we could bring back the truncate_count, but I think that sucks
because that's moving work into the page fault handler in order to
avoid a bit of work when truncating mapped files.



But that's a change we could have made at
any time if we'd bothered, it's not really the issue here.


I don't see how you could, because you need to increment truncate_count.



Though indeed we did so, I don't see that we needed to increment
truncate_count in that case (nobody could be coming through
do_no_page on that file, when there are no mappings of it).


Of course :P

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-07 Thread Benjamin Herrenschmidt
On Fri, 2007-05-04 at 19:23 +1000, Nick Piggin wrote:

> These ops could also be put to use in bit spinlocks, buffer lock, and
> probably a few other places too.

Ok, the performance hit seems to be under control (especially with the
bigger benchmark showing actual improvements).

There's a little bogon with the PG_waiters bit that you already know
about but appart from that it should be ok.

I must say I absolutely _LOVE_ the bitops with explicit _lock/_unlock
semantics. That should allow us to remove a whole bunch of dodgy
barriers and smp_mb__before_whatever_magic_crap() things we have all
over the place by providing precisely the expected semantics for bit
locks.

There are quite a few people who've been trying to do bit locks and I've
always been very worried by how easy it is to get the barriers wrong (or
too much barriers in the fast path) with these.

There are a couple of things we might want to think about regarding the
actual API to bit locks... the API you propose is simple, but it might
not fit some of the most exotic usage requirements, which typically are
related to manipulating other bits along with the lock bit.

We might just ignore them though. In the case of the page lock, it's
only hitting the slow path, and I would expect other usage scenarii to
be similar.

Cheers,
Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-07 Thread Benjamin Herrenschmidt
On Fri, 2007-05-04 at 19:23 +1000, Nick Piggin wrote:

 These ops could also be put to use in bit spinlocks, buffer lock, and
 probably a few other places too.

Ok, the performance hit seems to be under control (especially with the
bigger benchmark showing actual improvements).

There's a little bogon with the PG_waiters bit that you already know
about but appart from that it should be ok.

I must say I absolutely _LOVE_ the bitops with explicit _lock/_unlock
semantics. That should allow us to remove a whole bunch of dodgy
barriers and smp_mb__before_whatever_magic_crap() things we have all
over the place by providing precisely the expected semantics for bit
locks.

There are quite a few people who've been trying to do bit locks and I've
always been very worried by how easy it is to get the barriers wrong (or
too much barriers in the fast path) with these.

There are a couple of things we might want to think about regarding the
actual API to bit locks... the API you propose is simple, but it might
not fit some of the most exotic usage requirements, which typically are
related to manipulating other bits along with the lock bit.

We might just ignore them though. In the case of the page lock, it's
only hitting the slow path, and I would expect other usage scenarii to
be similar.

Cheers,
Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-04 Thread Nick Piggin

Nick Piggin wrote:

Nick Piggin wrote:


Christoph Hellwig wrote:




Is that every fork/exec or just under certain cicumstances?
A 5% regression on every fork/exec is not acceptable.




Well after patch2, G5 fork is 3% and exec is 1%, I'd say the P4
numbers will be improved as well with that patch. Then if we have
specific lock/unlock bitops, I hope it should reduce that further.



OK, with the races and missing barriers fixed from the previous patch,
plus the attached one added (+patch3), numbers are better again (I'm not
sure if I have the ppc barriers correct though).

These ops could also be put to use in bit spinlocks, buffer lock, and
probably a few other places too.

2.6.21   1.49-1.51   164.6-170.8   741.8-760.3
+patch   1.71-1.73   175.2-180.8   780.5-794.2
+patch2  1.61-1.63   169.8-175.0   748.6-757.0
+patch3  1.54-1.57   165.6-170.9   748.5-757.5

So fault performance goes to under 5%, fork is in the noise, exec is
still up 1%, but maybe that's noise or cache effects again.


OK, with my new lock/unlock_page, dd if=large (bigger than RAM) sparse
file of=/dev/null with an experimentally optimal block size (32K) goes
from 626MB/s to 683MB/s on 2 CPU G5 booted with maxcpus=1.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-04 Thread Nick Piggin

Nick Piggin wrote:

Christoph Hellwig wrote:



Is that every fork/exec or just under certain cicumstances?
A 5% regression on every fork/exec is not acceptable.



Well after patch2, G5 fork is 3% and exec is 1%, I'd say the P4
numbers will be improved as well with that patch. Then if we have
specific lock/unlock bitops, I hope it should reduce that further.


OK, with the races and missing barriers fixed from the previous patch,
plus the attached one added (+patch3), numbers are better again (I'm not
sure if I have the ppc barriers correct though).

These ops could also be put to use in bit spinlocks, buffer lock, and
probably a few other places too.

2.6.21   1.49-1.51   164.6-170.8   741.8-760.3
+patch   1.71-1.73   175.2-180.8   780.5-794.2
+patch2  1.61-1.63   169.8-175.0   748.6-757.0
+patch3  1.54-1.57   165.6-170.9   748.5-757.5

So fault performance goes to under 5%, fork is in the noise, exec is
still up 1%, but maybe that's noise or cache effects again.

--
SUSE Labs, Novell Inc.
Index: linux-2.6/include/asm-powerpc/bitops.h
===
--- linux-2.6.orig/include/asm-powerpc/bitops.h 2007-05-04 16:08:20.0 
+1000
+++ linux-2.6/include/asm-powerpc/bitops.h  2007-05-04 16:14:39.0 
+1000
@@ -87,6 +87,24 @@
: "cc" );
 }
 
+static __inline__ void clear_bit_unlock(int nr, volatile unsigned long *addr)
+{
+   unsigned long old;
+   unsigned long mask = BITOP_MASK(nr);
+   unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr);
+
+   __asm__ __volatile__(
+   LWSYNC_ON_SMP
+"1:"   PPC_LLARX "%0,0,%3  # clear_bit_unlock\n"
+   "andc   %0,%0,%2\n"
+   PPC405_ERR77(0,%3)
+   PPC_STLCX "%0,0,%3\n"
+   "bne-   1b"
+   : "=" (old), "+m" (*p)
+   : "r" (mask), "r" (p)
+   : "cc" );
+}
+
 static __inline__ void change_bit(int nr, volatile unsigned long *addr)
 {
unsigned long old;
@@ -126,6 +144,27 @@
return (old & mask) != 0;
 }
 
+static __inline__ int test_and_set_bit_lock(unsigned long nr,
+  volatile unsigned long *addr)
+{
+   unsigned long old, t;
+   unsigned long mask = BITOP_MASK(nr);
+   unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr);
+
+   __asm__ __volatile__(
+"1:"   PPC_LLARX "%0,0,%3  # test_and_set_bit_lock\n"
+   "or %1,%0,%2 \n"
+   PPC405_ERR77(0,%3)
+   PPC_STLCX "%1,0,%3 \n"
+   "bne-   1b"
+   ISYNC_ON_SMP
+   : "=" (old), "=" (t)
+   : "r" (mask), "r" (p)
+   : "cc", "memory");
+
+   return (old & mask) != 0;
+}
+
 static __inline__ int test_and_clear_bit(unsigned long nr,
 volatile unsigned long *addr)
 {
Index: linux-2.6/include/linux/pagemap.h
===
--- linux-2.6.orig/include/linux/pagemap.h  2007-05-04 16:14:36.0 
+1000
+++ linux-2.6/include/linux/pagemap.h   2007-05-04 16:17:34.0 +1000
@@ -136,13 +136,18 @@
 extern void FASTCALL(__wait_on_page_locked(struct page *page));
 extern void FASTCALL(unlock_page(struct page *page));
 
+static inline int trylock_page(struct page *page)
+{
+   return (likely(!TestSetPageLocked_Lock(page)));
+}
+
 /*
  * lock_page may only be called if we have the page's inode pinned.
  */
 static inline void lock_page(struct page *page)
 {
might_sleep();
-   if (unlikely(TestSetPageLocked(page)))
+   if (!trylock_page(page))
__lock_page(page);
 }
 
@@ -153,7 +158,7 @@
 static inline void lock_page_nosync(struct page *page)
 {
might_sleep();
-   if (unlikely(TestSetPageLocked(page)))
+   if (!trylock_page(page))
__lock_page_nosync(page);
 }

Index: linux-2.6/drivers/scsi/sg.c
===
--- linux-2.6.orig/drivers/scsi/sg.c2007-04-12 14:35:08.0 +1000
+++ linux-2.6/drivers/scsi/sg.c 2007-05-04 16:23:27.0 +1000
@@ -1734,7 +1734,7 @@
  */
flush_dcache_page(pages[i]);
/* ?? Is locking needed? I don't think so */
-   /* if (TestSetPageLocked(pages[i]))
+   /* if (!trylock_page(pages[i]))
   goto out_unlock; */
 }
 
Index: linux-2.6/fs/cifs/file.c
===
--- linux-2.6.orig/fs/cifs/file.c   2007-04-12 14:35:09.0 +1000
+++ linux-2.6/fs/cifs/file.c2007-05-04 16:23:36.0 +1000
@@ -1229,7 +1229,7 @@
 
if (first < 0)
lock_page(page);
-   else if (TestSetPageLocked(page))
+   else if (!trylock_page(page))
break;
 
if (unlikely(page->mapping != mapping)) {
Index: linux-2.6/fs/jbd/commit.c

Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-04 Thread Nick Piggin

Nick Piggin wrote:

Christoph Hellwig wrote:



Is that every fork/exec or just under certain cicumstances?
A 5% regression on every fork/exec is not acceptable.



Well after patch2, G5 fork is 3% and exec is 1%, I'd say the P4
numbers will be improved as well with that patch. Then if we have
specific lock/unlock bitops, I hope it should reduce that further.


OK, with the races and missing barriers fixed from the previous patch,
plus the attached one added (+patch3), numbers are better again (I'm not
sure if I have the ppc barriers correct though).

These ops could also be put to use in bit spinlocks, buffer lock, and
probably a few other places too.

2.6.21   1.49-1.51   164.6-170.8   741.8-760.3
+patch   1.71-1.73   175.2-180.8   780.5-794.2
+patch2  1.61-1.63   169.8-175.0   748.6-757.0
+patch3  1.54-1.57   165.6-170.9   748.5-757.5

So fault performance goes to under 5%, fork is in the noise, exec is
still up 1%, but maybe that's noise or cache effects again.

--
SUSE Labs, Novell Inc.
Index: linux-2.6/include/asm-powerpc/bitops.h
===
--- linux-2.6.orig/include/asm-powerpc/bitops.h 2007-05-04 16:08:20.0 
+1000
+++ linux-2.6/include/asm-powerpc/bitops.h  2007-05-04 16:14:39.0 
+1000
@@ -87,6 +87,24 @@
: cc );
 }
 
+static __inline__ void clear_bit_unlock(int nr, volatile unsigned long *addr)
+{
+   unsigned long old;
+   unsigned long mask = BITOP_MASK(nr);
+   unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr);
+
+   __asm__ __volatile__(
+   LWSYNC_ON_SMP
+1:   PPC_LLARX %0,0,%3  # clear_bit_unlock\n
+   andc   %0,%0,%2\n
+   PPC405_ERR77(0,%3)
+   PPC_STLCX %0,0,%3\n
+   bne-   1b
+   : =r (old), +m (*p)
+   : r (mask), r (p)
+   : cc );
+}
+
 static __inline__ void change_bit(int nr, volatile unsigned long *addr)
 {
unsigned long old;
@@ -126,6 +144,27 @@
return (old  mask) != 0;
 }
 
+static __inline__ int test_and_set_bit_lock(unsigned long nr,
+  volatile unsigned long *addr)
+{
+   unsigned long old, t;
+   unsigned long mask = BITOP_MASK(nr);
+   unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr);
+
+   __asm__ __volatile__(
+1:   PPC_LLARX %0,0,%3  # test_and_set_bit_lock\n
+   or %1,%0,%2 \n
+   PPC405_ERR77(0,%3)
+   PPC_STLCX %1,0,%3 \n
+   bne-   1b
+   ISYNC_ON_SMP
+   : =r (old), =r (t)
+   : r (mask), r (p)
+   : cc, memory);
+
+   return (old  mask) != 0;
+}
+
 static __inline__ int test_and_clear_bit(unsigned long nr,
 volatile unsigned long *addr)
 {
Index: linux-2.6/include/linux/pagemap.h
===
--- linux-2.6.orig/include/linux/pagemap.h  2007-05-04 16:14:36.0 
+1000
+++ linux-2.6/include/linux/pagemap.h   2007-05-04 16:17:34.0 +1000
@@ -136,13 +136,18 @@
 extern void FASTCALL(__wait_on_page_locked(struct page *page));
 extern void FASTCALL(unlock_page(struct page *page));
 
+static inline int trylock_page(struct page *page)
+{
+   return (likely(!TestSetPageLocked_Lock(page)));
+}
+
 /*
  * lock_page may only be called if we have the page's inode pinned.
  */
 static inline void lock_page(struct page *page)
 {
might_sleep();
-   if (unlikely(TestSetPageLocked(page)))
+   if (!trylock_page(page))
__lock_page(page);
 }
 
@@ -153,7 +158,7 @@
 static inline void lock_page_nosync(struct page *page)
 {
might_sleep();
-   if (unlikely(TestSetPageLocked(page)))
+   if (!trylock_page(page))
__lock_page_nosync(page);
 }

Index: linux-2.6/drivers/scsi/sg.c
===
--- linux-2.6.orig/drivers/scsi/sg.c2007-04-12 14:35:08.0 +1000
+++ linux-2.6/drivers/scsi/sg.c 2007-05-04 16:23:27.0 +1000
@@ -1734,7 +1734,7 @@
  */
flush_dcache_page(pages[i]);
/* ?? Is locking needed? I don't think so */
-   /* if (TestSetPageLocked(pages[i]))
+   /* if (!trylock_page(pages[i]))
   goto out_unlock; */
 }
 
Index: linux-2.6/fs/cifs/file.c
===
--- linux-2.6.orig/fs/cifs/file.c   2007-04-12 14:35:09.0 +1000
+++ linux-2.6/fs/cifs/file.c2007-05-04 16:23:36.0 +1000
@@ -1229,7 +1229,7 @@
 
if (first  0)
lock_page(page);
-   else if (TestSetPageLocked(page))
+   else if (!trylock_page(page))
break;
 
if (unlikely(page-mapping != mapping)) {
Index: linux-2.6/fs/jbd/commit.c

Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-04 Thread Nick Piggin

Nick Piggin wrote:

Nick Piggin wrote:


Christoph Hellwig wrote:




Is that every fork/exec or just under certain cicumstances?
A 5% regression on every fork/exec is not acceptable.




Well after patch2, G5 fork is 3% and exec is 1%, I'd say the P4
numbers will be improved as well with that patch. Then if we have
specific lock/unlock bitops, I hope it should reduce that further.



OK, with the races and missing barriers fixed from the previous patch,
plus the attached one added (+patch3), numbers are better again (I'm not
sure if I have the ppc barriers correct though).

These ops could also be put to use in bit spinlocks, buffer lock, and
probably a few other places too.

2.6.21   1.49-1.51   164.6-170.8   741.8-760.3
+patch   1.71-1.73   175.2-180.8   780.5-794.2
+patch2  1.61-1.63   169.8-175.0   748.6-757.0
+patch3  1.54-1.57   165.6-170.9   748.5-757.5

So fault performance goes to under 5%, fork is in the noise, exec is
still up 1%, but maybe that's noise or cache effects again.


OK, with my new lock/unlock_page, dd if=large (bigger than RAM) sparse
file of=/dev/null with an experimentally optimal block size (32K) goes
from 626MB/s to 683MB/s on 2 CPU G5 booted with maxcpus=1.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-03 Thread Nick Piggin

Andrew Morton wrote:

On Thu, 03 May 2007 11:32:23 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote:



void fastcall unlock_page(struct page *page)
{
+   VM_BUG_ON(!PageLocked(page));
smp_mb__before_clear_bit();
-   if (!TestClearPageLocked(page))
-   BUG();
-	smp_mb__after_clear_bit(); 
-	wake_up_page(page, PG_locked);

+   ClearPageLocked(page);
+   if (unlikely(test_bit(PG_waiters, >flags))) {
+   clear_bit(PG_waiters, >flags);
+   wake_up_page(page, PG_locked);
+   }
}



Why is that significantly faster than plain old wake_up_page(), which
tests waitqueue_active()?


Because it needs fewer barriers and doesn't touch random a random hash
cacheline in the fastpath.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-03 Thread Andrew Morton
On Thu, 03 May 2007 11:32:23 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote:

>  void fastcall unlock_page(struct page *page)
>  {
> + VM_BUG_ON(!PageLocked(page));
>   smp_mb__before_clear_bit();
> - if (!TestClearPageLocked(page))
> - BUG();
> - smp_mb__after_clear_bit(); 
> - wake_up_page(page, PG_locked);
> + ClearPageLocked(page);
> + if (unlikely(test_bit(PG_waiters, >flags))) {
> + clear_bit(PG_waiters, >flags);
> + wake_up_page(page, PG_locked);
> + }
>  }

Why is that significantly faster than plain old wake_up_page(), which
tests waitqueue_active()?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-03 Thread Nick Piggin

Hugh Dickins wrote:

On Thu, 3 May 2007, Nick Piggin wrote:


@@ -568,6 +570,11 @@ __lock_page (diff -p would tell us!)
{
DEFINE_WAIT_BIT(wait, >flags, PG_locked);

+   set_bit(PG_waiters, >flags);
+   if (unlikely(!TestSetPageLocked(page))) {


What happens if another cpu is coming through __lock_page at the
same time, did its set_bit, now finds PageLocked, and so proceeds
to the __wait_on_bit_lock?  But this cpu now clears PG_waiters,
so this task's unlock_page won't wake the other?


You're right, we can't clear the bit here. Doubt it mattered much anyway?



Ah yes, that's a good easy answer.  In fact, just remove this whole
test and block (we already tried TestSetPageLocked outside just a
short while ago, so this repeat won't often save anything).


Yeah, I was getting too clever for my own boots :)

I think the patch has merit though. Unfortunate that it uses another page
flag, however it seemed to have quite a bit speedup on unlock_page (probably
from both the barriers and an extra random cacheline load (from the hash)).

I guess it has to get good results from more benchmarks...



BTW. I also forgot an smp_mb__after_clear_bit() before the wake_up_page
above... that barrier is in the slow path as well though, so it shouldn't
matter either.



I vaguely wondered how such barriers had managed to dissolve away,
but cranking my brain up to think about barriers takes far too long.


That barrier was one too many :)

However I believe the fastpath barrier can go away because the PG_locked
operation is depending on the same cacheline as PG_waiters.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-03 Thread Hugh Dickins
On Thu, 3 May 2007, Nick Piggin wrote:
> >>@@ -568,6 +570,11 @@ __lock_page (diff -p would tell us!)
> > > {
> > >  DEFINE_WAIT_BIT(wait, >flags, PG_locked);
> > >
> > >+  set_bit(PG_waiters, >flags);
> > >+  if (unlikely(!TestSetPageLocked(page))) {
> > 
> > What happens if another cpu is coming through __lock_page at the
> > same time, did its set_bit, now finds PageLocked, and so proceeds
> > to the __wait_on_bit_lock?  But this cpu now clears PG_waiters,
> > so this task's unlock_page won't wake the other?
> 
> You're right, we can't clear the bit here. Doubt it mattered much anyway?

Ah yes, that's a good easy answer.  In fact, just remove this whole
test and block (we already tried TestSetPageLocked outside just a
short while ago, so this repeat won't often save anything).

> 
> BTW. I also forgot an smp_mb__after_clear_bit() before the wake_up_page
> above... that barrier is in the slow path as well though, so it shouldn't
> matter either.

I vaguely wondered how such barriers had managed to dissolve away,
but cranking my brain up to think about barriers takes far too long.

> > >+  clear_bit(PG_waiters, >flags);
> > >+  return;
> > >+  }
> > >  __wait_on_bit_lock(page_waitqueue(page), , sync_page,
> > >TASK_UNINTERRUPTIBLE);
> >> }
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-03 Thread Nick Piggin

Christoph Hellwig wrote:

On Thu, May 03, 2007 at 11:32:23AM +1000, Nick Piggin wrote:


The attached patch gets performance up a bit by avoiding some
barriers and some cachelines:

G5
pagefault   fork  exec
2.6.21   1.49-1.51   164.6-170.8   741.8-760.3
+patch   1.71-1.73   175.2-180.8   780.5-794.2
+patch2  1.61-1.63   169.8-175.0   748.6-757.0

So that brings the fork/exec hits down to much less than 5%, and
would likely speed up other things that lock the page, like write
or page reclaim.



Is that every fork/exec or just under certain cicumstances?
A 5% regression on every fork/exec is not acceptable.


Well after patch2, G5 fork is 3% and exec is 1%, I'd say the P4
numbers will be improved as well with that patch. Then if we have
specific lock/unlock bitops, I hope it should reduce that further.

The overhead that is there should just be coming from the extra
overhead in the file backed fault handler. For noop fork/execs,
I think that tends to be more pronounced, it is hard to see any
difference on any non-micro benchmark.

The other thing is that I think there could be some cache effects
happening -- for example the exec numbers on the 2nd line are
disproportionately large.

It definitely isn't a good thing to drop performance anywhere
though, so I'll keep looking for improvements.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-03 Thread Nick Piggin

Hugh Dickins wrote:

On Thu, 3 May 2007, Nick Piggin wrote:


The problem is that lock/unlock_page is expensive on powerpc, and
if we improve that, we improve more than just the fault handler...

The attached patch gets performance up a bit by avoiding some
barriers and some cachelines:



There's a strong whiff of raciness about this...
but I could very easily be wrong.



Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c 2007-05-02 15:00:26.0 +1000
+++ linux-2.6/mm/filemap.c  2007-05-03 08:34:32.0 +1000
@@ -532,11 +532,13 @@
 */
void fastcall unlock_page(struct page *page)
{
+   VM_BUG_ON(!PageLocked(page));
smp_mb__before_clear_bit();
-   if (!TestClearPageLocked(page))
-   BUG();
-	smp_mb__after_clear_bit(); 
-	wake_up_page(page, PG_locked);

+   ClearPageLocked(page);
+   if (unlikely(test_bit(PG_waiters, >flags))) {
+   clear_bit(PG_waiters, >flags);
+   wake_up_page(page, PG_locked);
+   }
}
EXPORT_SYMBOL(unlock_page);

@@ -568,6 +570,11 @@ __lock_page (diff -p would tell us!)
{
DEFINE_WAIT_BIT(wait, >flags, PG_locked);

+   set_bit(PG_waiters, >flags);
+   if (unlikely(!TestSetPageLocked(page))) {



What happens if another cpu is coming through __lock_page at the
same time, did its set_bit, now finds PageLocked, and so proceeds
to the __wait_on_bit_lock?  But this cpu now clears PG_waiters,
so this task's unlock_page won't wake the other?


You're right, we can't clear the bit here. Doubt it mattered much anyway?

BTW. I also forgot an smp_mb__after_clear_bit() before the wake_up_page
above... that barrier is in the slow path as well though, so it shouldn't
matter either.





+   clear_bit(PG_waiters, >flags);
+   return;
+   }
__wait_on_bit_lock(page_waitqueue(page), , sync_page,
TASK_UNINTERRUPTIBLE);
}






--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-03 Thread Hugh Dickins
On Thu, 3 May 2007, Nick Piggin wrote:
> 
> The problem is that lock/unlock_page is expensive on powerpc, and
> if we improve that, we improve more than just the fault handler...
> 
> The attached patch gets performance up a bit by avoiding some
> barriers and some cachelines:

There's a strong whiff of raciness about this...
but I could very easily be wrong.

> Index: linux-2.6/mm/filemap.c
> ===
> --- linux-2.6.orig/mm/filemap.c   2007-05-02 15:00:26.0 +1000
> +++ linux-2.6/mm/filemap.c2007-05-03 08:34:32.0 +1000
> @@ -532,11 +532,13 @@
>   */
>  void fastcall unlock_page(struct page *page)
>  {
> + VM_BUG_ON(!PageLocked(page));
>   smp_mb__before_clear_bit();
> - if (!TestClearPageLocked(page))
> - BUG();
> - smp_mb__after_clear_bit(); 
> - wake_up_page(page, PG_locked);
> + ClearPageLocked(page);
> + if (unlikely(test_bit(PG_waiters, >flags))) {
> + clear_bit(PG_waiters, >flags);
> + wake_up_page(page, PG_locked);
> + }
>  }
>  EXPORT_SYMBOL(unlock_page);
>  
> @@ -568,6 +570,11 @@ __lock_page (diff -p would tell us!)
>  {
>   DEFINE_WAIT_BIT(wait, >flags, PG_locked);
>  
> + set_bit(PG_waiters, >flags);
> + if (unlikely(!TestSetPageLocked(page))) {

What happens if another cpu is coming through __lock_page at the
same time, did its set_bit, now finds PageLocked, and so proceeds
to the __wait_on_bit_lock?  But this cpu now clears PG_waiters,
so this task's unlock_page won't wake the other?

> + clear_bit(PG_waiters, >flags);
> + return;
> + }
>   __wait_on_bit_lock(page_waitqueue(page), , sync_page,
>   TASK_UNINTERRUPTIBLE);
>  }
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-03 Thread Christoph Hellwig
On Thu, May 03, 2007 at 11:32:23AM +1000, Nick Piggin wrote:
> The attached patch gets performance up a bit by avoiding some
> barriers and some cachelines:
> 
> G5
>  pagefault   fork  exec
> 2.6.21   1.49-1.51   164.6-170.8   741.8-760.3
> +patch   1.71-1.73   175.2-180.8   780.5-794.2
> +patch2  1.61-1.63   169.8-175.0   748.6-757.0
> 
> So that brings the fork/exec hits down to much less than 5%, and
> would likely speed up other things that lock the page, like write
> or page reclaim.

Is that every fork/exec or just under certain cicumstances?
A 5% regression on every fork/exec is not acceptable.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-03 Thread Christoph Hellwig
On Thu, May 03, 2007 at 11:32:23AM +1000, Nick Piggin wrote:
 The attached patch gets performance up a bit by avoiding some
 barriers and some cachelines:
 
 G5
  pagefault   fork  exec
 2.6.21   1.49-1.51   164.6-170.8   741.8-760.3
 +patch   1.71-1.73   175.2-180.8   780.5-794.2
 +patch2  1.61-1.63   169.8-175.0   748.6-757.0
 
 So that brings the fork/exec hits down to much less than 5%, and
 would likely speed up other things that lock the page, like write
 or page reclaim.

Is that every fork/exec or just under certain cicumstances?
A 5% regression on every fork/exec is not acceptable.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-03 Thread Hugh Dickins
On Thu, 3 May 2007, Nick Piggin wrote:
 
 The problem is that lock/unlock_page is expensive on powerpc, and
 if we improve that, we improve more than just the fault handler...
 
 The attached patch gets performance up a bit by avoiding some
 barriers and some cachelines:

There's a strong whiff of raciness about this...
but I could very easily be wrong.

 Index: linux-2.6/mm/filemap.c
 ===
 --- linux-2.6.orig/mm/filemap.c   2007-05-02 15:00:26.0 +1000
 +++ linux-2.6/mm/filemap.c2007-05-03 08:34:32.0 +1000
 @@ -532,11 +532,13 @@
   */
  void fastcall unlock_page(struct page *page)
  {
 + VM_BUG_ON(!PageLocked(page));
   smp_mb__before_clear_bit();
 - if (!TestClearPageLocked(page))
 - BUG();
 - smp_mb__after_clear_bit(); 
 - wake_up_page(page, PG_locked);
 + ClearPageLocked(page);
 + if (unlikely(test_bit(PG_waiters, page-flags))) {
 + clear_bit(PG_waiters, page-flags);
 + wake_up_page(page, PG_locked);
 + }
  }
  EXPORT_SYMBOL(unlock_page);
  
 @@ -568,6 +570,11 @@ __lock_page (diff -p would tell us!)
  {
   DEFINE_WAIT_BIT(wait, page-flags, PG_locked);
  
 + set_bit(PG_waiters, page-flags);
 + if (unlikely(!TestSetPageLocked(page))) {

What happens if another cpu is coming through __lock_page at the
same time, did its set_bit, now finds PageLocked, and so proceeds
to the __wait_on_bit_lock?  But this cpu now clears PG_waiters,
so this task's unlock_page won't wake the other?

 + clear_bit(PG_waiters, page-flags);
 + return;
 + }
   __wait_on_bit_lock(page_waitqueue(page), wait, sync_page,
   TASK_UNINTERRUPTIBLE);
  }
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-03 Thread Nick Piggin

Hugh Dickins wrote:

On Thu, 3 May 2007, Nick Piggin wrote:


The problem is that lock/unlock_page is expensive on powerpc, and
if we improve that, we improve more than just the fault handler...

The attached patch gets performance up a bit by avoiding some
barriers and some cachelines:



There's a strong whiff of raciness about this...
but I could very easily be wrong.



Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c 2007-05-02 15:00:26.0 +1000
+++ linux-2.6/mm/filemap.c  2007-05-03 08:34:32.0 +1000
@@ -532,11 +532,13 @@
 */
void fastcall unlock_page(struct page *page)
{
+   VM_BUG_ON(!PageLocked(page));
smp_mb__before_clear_bit();
-   if (!TestClearPageLocked(page))
-   BUG();
-	smp_mb__after_clear_bit(); 
-	wake_up_page(page, PG_locked);

+   ClearPageLocked(page);
+   if (unlikely(test_bit(PG_waiters, page-flags))) {
+   clear_bit(PG_waiters, page-flags);
+   wake_up_page(page, PG_locked);
+   }
}
EXPORT_SYMBOL(unlock_page);

@@ -568,6 +570,11 @@ __lock_page (diff -p would tell us!)
{
DEFINE_WAIT_BIT(wait, page-flags, PG_locked);

+   set_bit(PG_waiters, page-flags);
+   if (unlikely(!TestSetPageLocked(page))) {



What happens if another cpu is coming through __lock_page at the
same time, did its set_bit, now finds PageLocked, and so proceeds
to the __wait_on_bit_lock?  But this cpu now clears PG_waiters,
so this task's unlock_page won't wake the other?


You're right, we can't clear the bit here. Doubt it mattered much anyway?

BTW. I also forgot an smp_mb__after_clear_bit() before the wake_up_page
above... that barrier is in the slow path as well though, so it shouldn't
matter either.





+   clear_bit(PG_waiters, page-flags);
+   return;
+   }
__wait_on_bit_lock(page_waitqueue(page), wait, sync_page,
TASK_UNINTERRUPTIBLE);
}






--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-03 Thread Nick Piggin

Christoph Hellwig wrote:

On Thu, May 03, 2007 at 11:32:23AM +1000, Nick Piggin wrote:


The attached patch gets performance up a bit by avoiding some
barriers and some cachelines:

G5
pagefault   fork  exec
2.6.21   1.49-1.51   164.6-170.8   741.8-760.3
+patch   1.71-1.73   175.2-180.8   780.5-794.2
+patch2  1.61-1.63   169.8-175.0   748.6-757.0

So that brings the fork/exec hits down to much less than 5%, and
would likely speed up other things that lock the page, like write
or page reclaim.



Is that every fork/exec or just under certain cicumstances?
A 5% regression on every fork/exec is not acceptable.


Well after patch2, G5 fork is 3% and exec is 1%, I'd say the P4
numbers will be improved as well with that patch. Then if we have
specific lock/unlock bitops, I hope it should reduce that further.

The overhead that is there should just be coming from the extra
overhead in the file backed fault handler. For noop fork/execs,
I think that tends to be more pronounced, it is hard to see any
difference on any non-micro benchmark.

The other thing is that I think there could be some cache effects
happening -- for example the exec numbers on the 2nd line are
disproportionately large.

It definitely isn't a good thing to drop performance anywhere
though, so I'll keep looking for improvements.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-03 Thread Hugh Dickins
On Thu, 3 May 2007, Nick Piggin wrote:
 @@ -568,6 +570,11 @@ __lock_page (diff -p would tell us!)
   {
DEFINE_WAIT_BIT(wait, page-flags, PG_locked);
  
  +  set_bit(PG_waiters, page-flags);
  +  if (unlikely(!TestSetPageLocked(page))) {
  
  What happens if another cpu is coming through __lock_page at the
  same time, did its set_bit, now finds PageLocked, and so proceeds
  to the __wait_on_bit_lock?  But this cpu now clears PG_waiters,
  so this task's unlock_page won't wake the other?
 
 You're right, we can't clear the bit here. Doubt it mattered much anyway?

Ah yes, that's a good easy answer.  In fact, just remove this whole
test and block (we already tried TestSetPageLocked outside just a
short while ago, so this repeat won't often save anything).

 
 BTW. I also forgot an smp_mb__after_clear_bit() before the wake_up_page
 above... that barrier is in the slow path as well though, so it shouldn't
 matter either.

I vaguely wondered how such barriers had managed to dissolve away,
but cranking my brain up to think about barriers takes far too long.

  +  clear_bit(PG_waiters, page-flags);
  +  return;
  +  }
__wait_on_bit_lock(page_waitqueue(page), wait, sync_page,
  TASK_UNINTERRUPTIBLE);
  }
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-03 Thread Nick Piggin

Hugh Dickins wrote:

On Thu, 3 May 2007, Nick Piggin wrote:


@@ -568,6 +570,11 @@ __lock_page (diff -p would tell us!)
{
DEFINE_WAIT_BIT(wait, page-flags, PG_locked);

+   set_bit(PG_waiters, page-flags);
+   if (unlikely(!TestSetPageLocked(page))) {


What happens if another cpu is coming through __lock_page at the
same time, did its set_bit, now finds PageLocked, and so proceeds
to the __wait_on_bit_lock?  But this cpu now clears PG_waiters,
so this task's unlock_page won't wake the other?


You're right, we can't clear the bit here. Doubt it mattered much anyway?



Ah yes, that's a good easy answer.  In fact, just remove this whole
test and block (we already tried TestSetPageLocked outside just a
short while ago, so this repeat won't often save anything).


Yeah, I was getting too clever for my own boots :)

I think the patch has merit though. Unfortunate that it uses another page
flag, however it seemed to have quite a bit speedup on unlock_page (probably
from both the barriers and an extra random cacheline load (from the hash)).

I guess it has to get good results from more benchmarks...



BTW. I also forgot an smp_mb__after_clear_bit() before the wake_up_page
above... that barrier is in the slow path as well though, so it shouldn't
matter either.



I vaguely wondered how such barriers had managed to dissolve away,
but cranking my brain up to think about barriers takes far too long.


That barrier was one too many :)

However I believe the fastpath barrier can go away because the PG_locked
operation is depending on the same cacheline as PG_waiters.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-03 Thread Andrew Morton
On Thu, 03 May 2007 11:32:23 +1000 Nick Piggin [EMAIL PROTECTED] wrote:

  void fastcall unlock_page(struct page *page)
  {
 + VM_BUG_ON(!PageLocked(page));
   smp_mb__before_clear_bit();
 - if (!TestClearPageLocked(page))
 - BUG();
 - smp_mb__after_clear_bit(); 
 - wake_up_page(page, PG_locked);
 + ClearPageLocked(page);
 + if (unlikely(test_bit(PG_waiters, page-flags))) {
 + clear_bit(PG_waiters, page-flags);
 + wake_up_page(page, PG_locked);
 + }
  }

Why is that significantly faster than plain old wake_up_page(), which
tests waitqueue_active()?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-03 Thread Nick Piggin

Andrew Morton wrote:

On Thu, 03 May 2007 11:32:23 +1000 Nick Piggin [EMAIL PROTECTED] wrote:



void fastcall unlock_page(struct page *page)
{
+   VM_BUG_ON(!PageLocked(page));
smp_mb__before_clear_bit();
-   if (!TestClearPageLocked(page))
-   BUG();
-	smp_mb__after_clear_bit(); 
-	wake_up_page(page, PG_locked);

+   ClearPageLocked(page);
+   if (unlikely(test_bit(PG_waiters, page-flags))) {
+   clear_bit(PG_waiters, page-flags);
+   wake_up_page(page, PG_locked);
+   }
}



Why is that significantly faster than plain old wake_up_page(), which
tests waitqueue_active()?


Because it needs fewer barriers and doesn't touch random a random hash
cacheline in the fastpath.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-02 Thread Nick Piggin

Hugh Dickins wrote:

On Wed, 2 May 2007, Nick Piggin wrote:


[snip]


More on-topic, since you suggest doing more within vmtruncate_range
than the filesystem: no, I'm afraid that's misdesigned, and I want
to move almost all of it into the filesystem ->truncate_range.
Because, if what vmtruncate_range is doing before it gets to the
filesystem isn't to be just a waste of time, the filesystem needs
to know what's going on in advance - just as notify_change warns
the filesystem about a coming truncation.  But easier than inventing
some new notification is to move it all into the filesystem, with
unmap_mapping_range+truncate_inode_pages_range its library helpers.


Well I would prefer it to follow the same pattern as regular
truncate. I don't think it is misdesigned to call the filesystem
_first_, but I think if you do that then the filesystem should
call the vm to prepare / finish truncate, rather than open code
calls to unmap itself.



But I'm pretty sure (to use your words!) regular truncate was not racy
before: I believe Andrea's sequence count was handling that case fine,
without a second unmap_mapping_range.


OK, I think you're right. I _think_ it should also be OK with the
lock_page version as well: we should not be able to have any pages
after the first unmap_mapping_range call, because of the i_size
write. So if we have no pages, there is nothing to 'cow' from.



I'd be delighted if you can remove those later unmap_mapping_ranges.
As I recall, the important thing for the copy pages is to be holding
the page lock (or whatever other serialization) on the copied page
still while the copy page is inserted into pagetable: that looks
to be so in your __do_fault.


Yeah, I think my thought process went wrong on those... I'll
revisit.



But it is a shame, and leaves me wondering what you gained with the
page lock there.

One thing gained is ease of understanding, and if your later patches
build an edifice upon the knowledge of holding that page lock while
faulting, I've no wish to undermine that foundation.


It also fixes a bug, doesn't it? ;)



Well, I'd come to think that perhaps the bugs would be solved by
that second unmap_mapping_range alone, so the pagelock changes
just a misleading diversion.

I'm not sure how I feel about that: calling unmap_mapping_range a
second time feels such a cheat, but if (big if) it does solve the
races, and the pagelock method is as expensive as your numbers
now suggest...


Well aside from being terribly ugly, it means we can still drop
the dirty bit where we'd otherwise rather not, so I don't think
we can do that.

I think there may be some way we can do this without taking the
page lock, and I was going to look at it, but I think it is
quite neat to just lock the page...

I don't think performance is _that_ bad. On the P4 it is a couple
of % on the microbenchmarks. The G5 is worse, but even then I
don't think it is I'll try to improve that and get back to you.

The problem is that lock/unlock_page is expensive on powerpc, and
if we improve that, we improve more than just the fault handler...

The attached patch gets performance up a bit by avoiding some
barriers and some cachelines:

G5
 pagefault   fork  exec
2.6.21   1.49-1.51   164.6-170.8   741.8-760.3
+patch   1.71-1.73   175.2-180.8   780.5-794.2
+patch2  1.61-1.63   169.8-175.0   748.6-757.0

So that brings the fork/exec hits down to much less than 5%, and
would likely speed up other things that lock the page, like write
or page reclaim.

I think we could get further performance improvement by
implementing arch specific bitops for lock/unlock operations,
so we don't need to use things like smb_mb__before_clear_bit()
if they aren't needed or full barriers in the test_and_set_bit().

--
SUSE Labs, Novell Inc.

Index: linux-2.6/include/linux/page-flags.h
===
--- linux-2.6.orig/include/linux/page-flags.h   2007-04-24 10:39:56.0 
+1000
+++ linux-2.6/include/linux/page-flags.h2007-05-03 08:38:53.0 
+1000
@@ -91,6 +91,8 @@
 #define PG_nosave_free 18  /* Used for system suspend/resume */
 #define PG_buddy   19  /* Page is free, on buddy lists */
 
+#define PG_waiters 20  /* Page has PG_locked waiters */
+
 /* PG_owner_priv_1 users should have descriptive aliases */
 #define PG_checked PG_owner_priv_1 /* Used by some filesystems */
 
Index: linux-2.6/include/linux/pagemap.h
===
--- linux-2.6.orig/include/linux/pagemap.h  2007-04-24 10:39:56.0 
+1000
+++ linux-2.6/include/linux/pagemap.h   2007-05-03 08:35:08.0 +1000
@@ -141,7 +141,7 @@
 static inline void lock_page(struct page *page)
 {
might_sleep();
-   if (TestSetPageLocked(page))
+   if (unlikely(TestSetPageLocked(page)))
__lock_page(page);
 }
 
@@ -152,7 +152,7 @@
 static inline void 

Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-02 Thread Hugh Dickins
On Wed, 2 May 2007, Nick Piggin wrote:
> Hugh Dickins wrote:
> > 
> > But I was quite disappointed when
> > mm-fix-fault-vs-invalidate-race-for-linear-mappings-fix.patch
> > appeared, putting double unmap_mapping_range calls in.  Certainly
> > you were wrong to take the one out, but a pity to end up with two.
> > 
> > Your comment says/said:
> > The nopage vs invalidate race fix patch did not take care of truncating
> > private COW pages. Mind you, I'm pretty sure this was previously racy
> > even for regular truncate, not to mention vmtruncate_range.
> > 
> > vmtruncate_range (holepunch) was deficient I agree, and though we
> > can now take out your second unmap_mapping_range there, that's only
> > because I've slipped one into shmem_truncate_range.  In due course it
> > needs to be properly handled by noting the range in shmem inode info.
> > 
> > (I think you couldn't take that approach, noting invalid range in
> > ->mapping while invalidating, because NFS has/had some cases of
> > invalidate_whatever without i_mutex?)
> 
> Sorry, I didn't parse this?

I meant that i_size is used to protect against truncation races, but
we have no equivalent inval_start,inval_end in the struct inode or
struct address_space, such as could be used for similar protection
against races while invalidating.

And that IIRC there are places where NFS was doing the invalidation
without i_mutex: so there could be concurrent invalidations, so one
inval_start,inval_end in the structure wouldn't be enough anyway.

> But I wonder whether it is better to do
> it in vmtruncate_range than the filesystem? Private COWed pages are
> not really a filesystem "thing"...

It wasn't the thought of private COWed pages which made me put a
second unmap_mapping_range in shmem_truncate_range, it was its own
internal file<->swap consistency which needed that (as a quick fix).
The real fix to be having a trunc_start,trunc_end or whatever in
the shmem_inode_info (assuming it's not wanted in the common inode:
might be if holepunch spreads e.g. it's been mentioned with fallocate).

Re private COWed pages and holepunch: Miklos and I agree that really
it would be better for holepunch _not_ to remove them - but that's
rather off-topic.

More on-topic, since you suggest doing more within vmtruncate_range
than the filesystem: no, I'm afraid that's misdesigned, and I want
to move almost all of it into the filesystem ->truncate_range.
Because, if what vmtruncate_range is doing before it gets to the
filesystem isn't to be just a waste of time, the filesystem needs
to know what's going on in advance - just as notify_change warns
the filesystem about a coming truncation.  But easier than inventing
some new notification is to move it all into the filesystem, with
unmap_mapping_range+truncate_inode_pages_range its library helpers.

> 
> > But I'm pretty sure (to use your words!) regular truncate was not racy
> > before: I believe Andrea's sequence count was handling that case fine,
> > without a second unmap_mapping_range.
> 
> OK, I think you're right. I _think_ it should also be OK with the
> lock_page version as well: we should not be able to have any pages
> after the first unmap_mapping_range call, because of the i_size
> write. So if we have no pages, there is nothing to 'cow' from.

I'd be delighted if you can remove those later unmap_mapping_ranges.
As I recall, the important thing for the copy pages is to be holding
the page lock (or whatever other serialization) on the copied page
still while the copy page is inserted into pagetable: that looks
to be so in your __do_fault.

> > But it is a shame, and leaves me wondering what you gained with the
> > page lock there.
> > 
> > One thing gained is ease of understanding, and if your later patches
> > build an edifice upon the knowledge of holding that page lock while
> > faulting, I've no wish to undermine that foundation.
> 
> It also fixes a bug, doesn't it? ;)

Well, I'd come to think that perhaps the bugs would be solved by
that second unmap_mapping_range alone, so the pagelock changes
just a misleading diversion.

I'm not sure how I feel about that: calling unmap_mapping_range a
second time feels such a cheat, but if (big if) it does solve the
races, and the pagelock method is as expensive as your numbers
now suggest...

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-02 Thread Nick Piggin

Nick Piggin wrote:

Hugh Dickins wrote:


On Tue, 1 May 2007, Nick Piggin wrote:




There were concerns that we could do this more cheaply, but I think it
is important to start with a base that is simple and more likely to
be correct and build on that. My testing didn't show any obvious
problems with performance.




I don't see _problems_ with performance, but I do consistently see the
same kind of ~5% degradation in lmbench fork, exec, sh, mmap latency
and page fault tests on SMP, several machines, just as I did last year.



OK. I did run some tests at one stage which didn't show a regression
on my P4, however I don't know that they were statistically significant.
I'll try a couple more runs and post numbers.


I didn't have enough time tonight to get means/stddev, etc, but the runs
are pretty stable.

Patch tested was just the lock page one.

SMP kernel, tasks bound to 1 CPU:

P4 Xeon
 pagefault   fork  exec
2.6.21   1.67-1.69   140.7-142.0   449.5-460.8
+patch   1.75-1.77   144.0-145.5   456.2-463.0

So it's taken on nearly 5% on pagefault, but looks like less than 2% on
fork, so not as bad as your numbers (phew).

G5
 pagefault   fork  exec
2.6.21   1.49-1.51   164.6-170.8   741.8-760.3
+patch   1.71-1.73   175.2-180.8   780.5-794.2

Bigger hit there.

Page faults can be improved a tiny bit by not using a test and clear op
in unlock_page (less barriers for the G5).

I don't think that's really a blocker problem for a merge, but I wonder
what we can do to improve it. Lockless pagecache shaves quite a bit of
straight line find_get_page performance there.

Going to a non-sleeping lock might be one way to go in the long term, but
it would require quite a lot of restructuring.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-02 Thread Nick Piggin

Nick Piggin wrote:

Hugh Dickins wrote:


On Tue, 1 May 2007, Nick Piggin wrote:




There were concerns that we could do this more cheaply, but I think it
is important to start with a base that is simple and more likely to
be correct and build on that. My testing didn't show any obvious
problems with performance.




I don't see _problems_ with performance, but I do consistently see the
same kind of ~5% degradation in lmbench fork, exec, sh, mmap latency
and page fault tests on SMP, several machines, just as I did last year.



OK. I did run some tests at one stage which didn't show a regression
on my P4, however I don't know that they were statistically significant.
I'll try a couple more runs and post numbers.


I didn't have enough time tonight to get means/stddev, etc, but the runs
are pretty stable.

Patch tested was just the lock page one.

SMP kernel, tasks bound to 1 CPU:

P4 Xeon
 pagefault   fork  exec
2.6.21   1.67-1.69   140.7-142.0   449.5-460.8
+patch   1.75-1.77   144.0-145.5   456.2-463.0

So it's taken on nearly 5% on pagefault, but looks like less than 2% on
fork, so not as bad as your numbers (phew).

G5
 pagefault   fork  exec
2.6.21   1.49-1.51   164.6-170.8   741.8-760.3
+patch   1.71-1.73   175.2-180.8   780.5-794.2

Bigger hit there.

Page faults can be improved a tiny bit by not using a test and clear op
in unlock_page (less barriers for the G5).

I don't think that's really a blocker problem for a merge, but I wonder
what we can do to improve it. Lockless pagecache shaves quite a bit of
straight line find_get_page performance there.

Going to a non-sleeping lock might be one way to go in the long term, but
it would require quite a lot of restructuring.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-02 Thread Hugh Dickins
On Wed, 2 May 2007, Nick Piggin wrote:
 Hugh Dickins wrote:
  
  But I was quite disappointed when
  mm-fix-fault-vs-invalidate-race-for-linear-mappings-fix.patch
  appeared, putting double unmap_mapping_range calls in.  Certainly
  you were wrong to take the one out, but a pity to end up with two.
  
  Your comment says/said:
  The nopage vs invalidate race fix patch did not take care of truncating
  private COW pages. Mind you, I'm pretty sure this was previously racy
  even for regular truncate, not to mention vmtruncate_range.
  
  vmtruncate_range (holepunch) was deficient I agree, and though we
  can now take out your second unmap_mapping_range there, that's only
  because I've slipped one into shmem_truncate_range.  In due course it
  needs to be properly handled by noting the range in shmem inode info.
  
  (I think you couldn't take that approach, noting invalid range in
  -mapping while invalidating, because NFS has/had some cases of
  invalidate_whatever without i_mutex?)
 
 Sorry, I didn't parse this?

I meant that i_size is used to protect against truncation races, but
we have no equivalent inval_start,inval_end in the struct inode or
struct address_space, such as could be used for similar protection
against races while invalidating.

And that IIRC there are places where NFS was doing the invalidation
without i_mutex: so there could be concurrent invalidations, so one
inval_start,inval_end in the structure wouldn't be enough anyway.

 But I wonder whether it is better to do
 it in vmtruncate_range than the filesystem? Private COWed pages are
 not really a filesystem thing...

It wasn't the thought of private COWed pages which made me put a
second unmap_mapping_range in shmem_truncate_range, it was its own
internal file-swap consistency which needed that (as a quick fix).
The real fix to be having a trunc_start,trunc_end or whatever in
the shmem_inode_info (assuming it's not wanted in the common inode:
might be if holepunch spreads e.g. it's been mentioned with fallocate).

Re private COWed pages and holepunch: Miklos and I agree that really
it would be better for holepunch _not_ to remove them - but that's
rather off-topic.

More on-topic, since you suggest doing more within vmtruncate_range
than the filesystem: no, I'm afraid that's misdesigned, and I want
to move almost all of it into the filesystem -truncate_range.
Because, if what vmtruncate_range is doing before it gets to the
filesystem isn't to be just a waste of time, the filesystem needs
to know what's going on in advance - just as notify_change warns
the filesystem about a coming truncation.  But easier than inventing
some new notification is to move it all into the filesystem, with
unmap_mapping_range+truncate_inode_pages_range its library helpers.

 
  But I'm pretty sure (to use your words!) regular truncate was not racy
  before: I believe Andrea's sequence count was handling that case fine,
  without a second unmap_mapping_range.
 
 OK, I think you're right. I _think_ it should also be OK with the
 lock_page version as well: we should not be able to have any pages
 after the first unmap_mapping_range call, because of the i_size
 write. So if we have no pages, there is nothing to 'cow' from.

I'd be delighted if you can remove those later unmap_mapping_ranges.
As I recall, the important thing for the copy pages is to be holding
the page lock (or whatever other serialization) on the copied page
still while the copy page is inserted into pagetable: that looks
to be so in your __do_fault.

  But it is a shame, and leaves me wondering what you gained with the
  page lock there.
  
  One thing gained is ease of understanding, and if your later patches
  build an edifice upon the knowledge of holding that page lock while
  faulting, I've no wish to undermine that foundation.
 
 It also fixes a bug, doesn't it? ;)

Well, I'd come to think that perhaps the bugs would be solved by
that second unmap_mapping_range alone, so the pagelock changes
just a misleading diversion.

I'm not sure how I feel about that: calling unmap_mapping_range a
second time feels such a cheat, but if (big if) it does solve the
races, and the pagelock method is as expensive as your numbers
now suggest...

Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-02 Thread Nick Piggin

Hugh Dickins wrote:

On Wed, 2 May 2007, Nick Piggin wrote:


[snip]


More on-topic, since you suggest doing more within vmtruncate_range
than the filesystem: no, I'm afraid that's misdesigned, and I want
to move almost all of it into the filesystem -truncate_range.
Because, if what vmtruncate_range is doing before it gets to the
filesystem isn't to be just a waste of time, the filesystem needs
to know what's going on in advance - just as notify_change warns
the filesystem about a coming truncation.  But easier than inventing
some new notification is to move it all into the filesystem, with
unmap_mapping_range+truncate_inode_pages_range its library helpers.


Well I would prefer it to follow the same pattern as regular
truncate. I don't think it is misdesigned to call the filesystem
_first_, but I think if you do that then the filesystem should
call the vm to prepare / finish truncate, rather than open code
calls to unmap itself.



But I'm pretty sure (to use your words!) regular truncate was not racy
before: I believe Andrea's sequence count was handling that case fine,
without a second unmap_mapping_range.


OK, I think you're right. I _think_ it should also be OK with the
lock_page version as well: we should not be able to have any pages
after the first unmap_mapping_range call, because of the i_size
write. So if we have no pages, there is nothing to 'cow' from.



I'd be delighted if you can remove those later unmap_mapping_ranges.
As I recall, the important thing for the copy pages is to be holding
the page lock (or whatever other serialization) on the copied page
still while the copy page is inserted into pagetable: that looks
to be so in your __do_fault.


Yeah, I think my thought process went wrong on those... I'll
revisit.



But it is a shame, and leaves me wondering what you gained with the
page lock there.

One thing gained is ease of understanding, and if your later patches
build an edifice upon the knowledge of holding that page lock while
faulting, I've no wish to undermine that foundation.


It also fixes a bug, doesn't it? ;)



Well, I'd come to think that perhaps the bugs would be solved by
that second unmap_mapping_range alone, so the pagelock changes
just a misleading diversion.

I'm not sure how I feel about that: calling unmap_mapping_range a
second time feels such a cheat, but if (big if) it does solve the
races, and the pagelock method is as expensive as your numbers
now suggest...


Well aside from being terribly ugly, it means we can still drop
the dirty bit where we'd otherwise rather not, so I don't think
we can do that.

I think there may be some way we can do this without taking the
page lock, and I was going to look at it, but I think it is
quite neat to just lock the page...

I don't think performance is _that_ bad. On the P4 it is a couple
of % on the microbenchmarks. The G5 is worse, but even then I
don't think it is I'll try to improve that and get back to you.

The problem is that lock/unlock_page is expensive on powerpc, and
if we improve that, we improve more than just the fault handler...

The attached patch gets performance up a bit by avoiding some
barriers and some cachelines:

G5
 pagefault   fork  exec
2.6.21   1.49-1.51   164.6-170.8   741.8-760.3
+patch   1.71-1.73   175.2-180.8   780.5-794.2
+patch2  1.61-1.63   169.8-175.0   748.6-757.0

So that brings the fork/exec hits down to much less than 5%, and
would likely speed up other things that lock the page, like write
or page reclaim.

I think we could get further performance improvement by
implementing arch specific bitops for lock/unlock operations,
so we don't need to use things like smb_mb__before_clear_bit()
if they aren't needed or full barriers in the test_and_set_bit().

--
SUSE Labs, Novell Inc.

Index: linux-2.6/include/linux/page-flags.h
===
--- linux-2.6.orig/include/linux/page-flags.h   2007-04-24 10:39:56.0 
+1000
+++ linux-2.6/include/linux/page-flags.h2007-05-03 08:38:53.0 
+1000
@@ -91,6 +91,8 @@
 #define PG_nosave_free 18  /* Used for system suspend/resume */
 #define PG_buddy   19  /* Page is free, on buddy lists */
 
+#define PG_waiters 20  /* Page has PG_locked waiters */
+
 /* PG_owner_priv_1 users should have descriptive aliases */
 #define PG_checked PG_owner_priv_1 /* Used by some filesystems */
 
Index: linux-2.6/include/linux/pagemap.h
===
--- linux-2.6.orig/include/linux/pagemap.h  2007-04-24 10:39:56.0 
+1000
+++ linux-2.6/include/linux/pagemap.h   2007-05-03 08:35:08.0 +1000
@@ -141,7 +141,7 @@
 static inline void lock_page(struct page *page)
 {
might_sleep();
-   if (TestSetPageLocked(page))
+   if (unlikely(TestSetPageLocked(page)))
__lock_page(page);
 }
 
@@ -152,7 +152,7 @@
 static inline void 

Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-01 Thread Nick Piggin

Hugh Dickins wrote:

On Tue, 1 May 2007, Nick Piggin wrote:



There were concerns that we could do this more cheaply, but I think it
is important to start with a base that is simple and more likely to
be correct and build on that. My testing didn't show any obvious
problems with performance.



I don't see _problems_ with performance, but I do consistently see the
same kind of ~5% degradation in lmbench fork, exec, sh, mmap latency
and page fault tests on SMP, several machines, just as I did last year.


OK. I did run some tests at one stage which didn't show a regression
on my P4, however I don't know that they were statistically significant.
I'll try a couple more runs and post numbers.



I'm assuming this patch is the one responsible: at 2.6.20-rc4 time
you posted a set of 10 and a set of 7 patches I tried in versus out;
at 2.6.21-rc3-mm2 time you had a group of patches in -mm I tried in
versus out; with similar results.

I did check the graphs on test.kernel.org, I couldn't see any bad
behaviour there that correlated with this work; though each -mm
has such a variety of new work in it, it's very hard to attribute.
And nobody else has reported any regression from your patches.

I'm inclined to write it off as poorer performance in some micro-
benchmarks, against which we offset the improved understandabilty
of holding the page lock over the file fault.

But I was quite disappointed when 
mm-fix-fault-vs-invalidate-race-for-linear-mappings-fix.patch

appeared, putting double unmap_mapping_range calls in.  Certainly
you were wrong to take the one out, but a pity to end up with two.

Your comment says/said:
The nopage vs invalidate race fix patch did not take care of truncating
private COW pages. Mind you, I'm pretty sure this was previously racy
even for regular truncate, not to mention vmtruncate_range.

vmtruncate_range (holepunch) was deficient I agree, and though we
can now take out your second unmap_mapping_range there, that's only
because I've slipped one into shmem_truncate_range.  In due course it
needs to be properly handled by noting the range in shmem inode info.

(I think you couldn't take that approach, noting invalid range in
->mapping while invalidating, because NFS has/had some cases of
invalidate_whatever without i_mutex?)


Sorry, I didn't parse this? But I wonder whether it is better to do
it in vmtruncate_range than the filesystem? Private COWed pages are
not really a filesystem "thing"...



But I'm pretty sure (to use your words!) regular truncate was not racy
before: I believe Andrea's sequence count was handling that case fine,
without a second unmap_mapping_range.


OK, I think you're right. I _think_ it should also be OK with the
lock_page version as well: we should not be able to have any pages
after the first unmap_mapping_range call, because of the i_size
write. So if we have no pages, there is nothing to 'cow' from.



Well, I guess I've come to accept that, expensive as unmap_mapping_range
may be, truncating files while they're mmap'ed is perverse behaviour:
perhaps even deserving such punishment.

But it is a shame, and leaves me wondering what you gained with the
page lock there.

One thing gained is ease of understanding, and if your later patches
build an edifice upon the knowledge of holding that page lock while
faulting, I've no wish to undermine that foundation.


It also fixes a bug, doesn't it? ;)

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-01 Thread Hugh Dickins
On Tue, 1 May 2007, Nick Piggin wrote:
> Andrew Morton wrote:
> 
> >  mm-simplify-filemap_nopage.patch
> >  mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch
> >  mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch
> >  mm-merge-nopfn-into-fault.patch
> >  convert-hugetlbfs-to-use-vm_ops-fault.patch
> >  mm-remove-legacy-cruft.patch
> >  mm-debug-check-for-the-fault-vs-invalidate-race.patch
> 
> > mm-fix-clear_page_dirty_for_io-vs-fault-race.patch
> 
> > Miscish MM changes.  Will merge, dependent upon what still applies and works
> > if the moveable-zone patches get stalled.
> 
> These fix some bugs in the core vm, at least the former one we have
> seen numerous people hitting in production...
...
> 
> So, do you or anyone else have any problems with these patches going in
> 2.6.22? I haven't had much feedback for a while, but I was under the
> impression that people are more-or-less happy with them?
> 
> mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch
> 
> This patch fixes the core filemap_nopage vs invalidate_inode_pages2
> race by having filemap_nopage return a locked page to do_no_page,
> and removes the fairly complex (and inadequate) truncate_count
> synchronisation logic.
> 
> There were concerns that we could do this more cheaply, but I think it
> is important to start with a base that is simple and more likely to
> be correct and build on that. My testing didn't show any obvious
> problems with performance.

I don't see _problems_ with performance, but I do consistently see the
same kind of ~5% degradation in lmbench fork, exec, sh, mmap latency
and page fault tests on SMP, several machines, just as I did last year.

I'm assuming this patch is the one responsible: at 2.6.20-rc4 time
you posted a set of 10 and a set of 7 patches I tried in versus out;
at 2.6.21-rc3-mm2 time you had a group of patches in -mm I tried in
versus out; with similar results.

I did check the graphs on test.kernel.org, I couldn't see any bad
behaviour there that correlated with this work; though each -mm
has such a variety of new work in it, it's very hard to attribute.
And nobody else has reported any regression from your patches.

I'm inclined to write it off as poorer performance in some micro-
benchmarks, against which we offset the improved understandabilty
of holding the page lock over the file fault.

But I was quite disappointed when 
mm-fix-fault-vs-invalidate-race-for-linear-mappings-fix.patch
appeared, putting double unmap_mapping_range calls in.  Certainly
you were wrong to take the one out, but a pity to end up with two.

Your comment says/said:
The nopage vs invalidate race fix patch did not take care of truncating
private COW pages. Mind you, I'm pretty sure this was previously racy
even for regular truncate, not to mention vmtruncate_range.

vmtruncate_range (holepunch) was deficient I agree, and though we
can now take out your second unmap_mapping_range there, that's only
because I've slipped one into shmem_truncate_range.  In due course it
needs to be properly handled by noting the range in shmem inode info.

(I think you couldn't take that approach, noting invalid range in
->mapping while invalidating, because NFS has/had some cases of
invalidate_whatever without i_mutex?)

But I'm pretty sure (to use your words!) regular truncate was not racy
before: I believe Andrea's sequence count was handling that case fine,
without a second unmap_mapping_range.

Well, I guess I've come to accept that, expensive as unmap_mapping_range
may be, truncating files while they're mmap'ed is perverse behaviour:
perhaps even deserving such punishment.

But it is a shame, and leaves me wondering what you gained with the
page lock there.

One thing gained is ease of understanding, and if your later patches
build an edifice upon the knowledge of holding that page lock while
faulting, I've no wish to undermine that foundation.

> 
> mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch
> mm-merge-nopfn-into-fault.patch
> etc.
> 
> These move ->nopage, ->populate, ->nopfn (and soon, ->page_mkwrite)
> into a single, unified interface. Although this strictly closes some
> similar holes in nonlinear faults as well, they are very uncommon, so
> I wouldn't be so upset if these aren't merged in 2.6.22 (I don't see
> any reason not to, but at least they don't fix major bugs).

I don't have an opinion on these, but I think BenH and others
were strongly in favour, with various people waiting upon them.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-01 Thread Andrew Morton
On Tue, 01 May 2007 18:44:07 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote:

> >  mm-simplify-filemap_nopage.patch
> >  mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch
> >  mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch
> >  mm-merge-nopfn-into-fault.patch
> >  convert-hugetlbfs-to-use-vm_ops-fault.patch
> >  mm-remove-legacy-cruft.patch
> >  mm-debug-check-for-the-fault-vs-invalidate-race.patch
> 
> >  mm-fix-clear_page_dirty_for_io-vs-fault-race.patch
> 
> > Miscish MM changes.  Will merge, dependent upon what still applies and works
> > if the moveable-zone patches get stalled.
> 
> These fix some bugs in the core vm, at least the former one we have
> seen numerous people hitting in production...
> 
> I don't suppose you mean these are logically dependant on new features
> sitting below them in your patch stack, just that you don't want to
> spend time fixing a lot of rejects?

It'll probably be OK - I just haven't checked yet.  I'm fairly handy at
fixing rejects nowadays ;)


Nobody seems to be taking up this opportunity to provide us with review
and test results on the antifrag patches.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-01 Thread Nick Piggin

Andrew Morton wrote:


 mm-simplify-filemap_nopage.patch
 mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch
 mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch
 mm-merge-nopfn-into-fault.patch
 convert-hugetlbfs-to-use-vm_ops-fault.patch
 mm-remove-legacy-cruft.patch
 mm-debug-check-for-the-fault-vs-invalidate-race.patch



 mm-fix-clear_page_dirty_for_io-vs-fault-race.patch



Miscish MM changes.  Will merge, dependent upon what still applies and works
if the moveable-zone patches get stalled.


These fix some bugs in the core vm, at least the former one we have
seen numerous people hitting in production...

I don't suppose you mean these are logically dependant on new features
sitting below them in your patch stack, just that you don't want to
spend time fixing a lot of rejects? If so, I can help fix those up, but
I don't think there is anything major, IIRC the biggest annoyance is
just that changing some GFP_types throws some big hunks.

So, do you or anyone else have any problems with these patches going in
2.6.22? I haven't had much feedback for a while, but I was under the
impression that people are more-or-less happy with them?

mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch

This patch fixes the core filemap_nopage vs invalidate_inode_pages2
race by having filemap_nopage return a locked page to do_no_page,
and removes the fairly complex (and inadequate) truncate_count
synchronisation logic.

There were concerns that we could do this more cheaply, but I think it
is important to start with a base that is simple and more likely to
be correct and build on that. My testing didn't show any obvious
problems with performance.

mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch
mm-merge-nopfn-into-fault.patch
etc.

These move ->nopage, ->populate, ->nopfn (and soon, ->page_mkwrite)
into a single, unified interface. Although this strictly closes some
similar holes in nonlinear faults as well, they are very uncommon, so
I wouldn't be so upset if these aren't merged in 2.6.22 (I don't see
any reason not to, but at least they don't fix major bugs).

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-01 Thread Nick Piggin

Andrew Morton wrote:


 mm-simplify-filemap_nopage.patch
 mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch
 mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch
 mm-merge-nopfn-into-fault.patch
 convert-hugetlbfs-to-use-vm_ops-fault.patch
 mm-remove-legacy-cruft.patch
 mm-debug-check-for-the-fault-vs-invalidate-race.patch



 mm-fix-clear_page_dirty_for_io-vs-fault-race.patch



Miscish MM changes.  Will merge, dependent upon what still applies and works
if the moveable-zone patches get stalled.


These fix some bugs in the core vm, at least the former one we have
seen numerous people hitting in production...

I don't suppose you mean these are logically dependant on new features
sitting below them in your patch stack, just that you don't want to
spend time fixing a lot of rejects? If so, I can help fix those up, but
I don't think there is anything major, IIRC the biggest annoyance is
just that changing some GFP_types throws some big hunks.

So, do you or anyone else have any problems with these patches going in
2.6.22? I haven't had much feedback for a while, but I was under the
impression that people are more-or-less happy with them?

mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch

This patch fixes the core filemap_nopage vs invalidate_inode_pages2
race by having filemap_nopage return a locked page to do_no_page,
and removes the fairly complex (and inadequate) truncate_count
synchronisation logic.

There were concerns that we could do this more cheaply, but I think it
is important to start with a base that is simple and more likely to
be correct and build on that. My testing didn't show any obvious
problems with performance.

mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch
mm-merge-nopfn-into-fault.patch
etc.

These move -nopage, -populate, -nopfn (and soon, -page_mkwrite)
into a single, unified interface. Although this strictly closes some
similar holes in nonlinear faults as well, they are very uncommon, so
I wouldn't be so upset if these aren't merged in 2.6.22 (I don't see
any reason not to, but at least they don't fix major bugs).

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-01 Thread Andrew Morton
On Tue, 01 May 2007 18:44:07 +1000 Nick Piggin [EMAIL PROTECTED] wrote:

   mm-simplify-filemap_nopage.patch
   mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch
   mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch
   mm-merge-nopfn-into-fault.patch
   convert-hugetlbfs-to-use-vm_ops-fault.patch
   mm-remove-legacy-cruft.patch
   mm-debug-check-for-the-fault-vs-invalidate-race.patch
 
   mm-fix-clear_page_dirty_for_io-vs-fault-race.patch
 
  Miscish MM changes.  Will merge, dependent upon what still applies and works
  if the moveable-zone patches get stalled.
 
 These fix some bugs in the core vm, at least the former one we have
 seen numerous people hitting in production...
 
 I don't suppose you mean these are logically dependant on new features
 sitting below them in your patch stack, just that you don't want to
 spend time fixing a lot of rejects?

It'll probably be OK - I just haven't checked yet.  I'm fairly handy at
fixing rejects nowadays ;)


Nobody seems to be taking up this opportunity to provide us with review
and test results on the antifrag patches.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-01 Thread Hugh Dickins
On Tue, 1 May 2007, Nick Piggin wrote:
 Andrew Morton wrote:
 
   mm-simplify-filemap_nopage.patch
   mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch
   mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch
   mm-merge-nopfn-into-fault.patch
   convert-hugetlbfs-to-use-vm_ops-fault.patch
   mm-remove-legacy-cruft.patch
   mm-debug-check-for-the-fault-vs-invalidate-race.patch
 
  mm-fix-clear_page_dirty_for_io-vs-fault-race.patch
 
  Miscish MM changes.  Will merge, dependent upon what still applies and works
  if the moveable-zone patches get stalled.
 
 These fix some bugs in the core vm, at least the former one we have
 seen numerous people hitting in production...
...
 
 So, do you or anyone else have any problems with these patches going in
 2.6.22? I haven't had much feedback for a while, but I was under the
 impression that people are more-or-less happy with them?
 
 mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch
 
 This patch fixes the core filemap_nopage vs invalidate_inode_pages2
 race by having filemap_nopage return a locked page to do_no_page,
 and removes the fairly complex (and inadequate) truncate_count
 synchronisation logic.
 
 There were concerns that we could do this more cheaply, but I think it
 is important to start with a base that is simple and more likely to
 be correct and build on that. My testing didn't show any obvious
 problems with performance.

I don't see _problems_ with performance, but I do consistently see the
same kind of ~5% degradation in lmbench fork, exec, sh, mmap latency
and page fault tests on SMP, several machines, just as I did last year.

I'm assuming this patch is the one responsible: at 2.6.20-rc4 time
you posted a set of 10 and a set of 7 patches I tried in versus out;
at 2.6.21-rc3-mm2 time you had a group of patches in -mm I tried in
versus out; with similar results.

I did check the graphs on test.kernel.org, I couldn't see any bad
behaviour there that correlated with this work; though each -mm
has such a variety of new work in it, it's very hard to attribute.
And nobody else has reported any regression from your patches.

I'm inclined to write it off as poorer performance in some micro-
benchmarks, against which we offset the improved understandabilty
of holding the page lock over the file fault.

But I was quite disappointed when 
mm-fix-fault-vs-invalidate-race-for-linear-mappings-fix.patch
appeared, putting double unmap_mapping_range calls in.  Certainly
you were wrong to take the one out, but a pity to end up with two.

Your comment says/said:
The nopage vs invalidate race fix patch did not take care of truncating
private COW pages. Mind you, I'm pretty sure this was previously racy
even for regular truncate, not to mention vmtruncate_range.

vmtruncate_range (holepunch) was deficient I agree, and though we
can now take out your second unmap_mapping_range there, that's only
because I've slipped one into shmem_truncate_range.  In due course it
needs to be properly handled by noting the range in shmem inode info.

(I think you couldn't take that approach, noting invalid range in
-mapping while invalidating, because NFS has/had some cases of
invalidate_whatever without i_mutex?)

But I'm pretty sure (to use your words!) regular truncate was not racy
before: I believe Andrea's sequence count was handling that case fine,
without a second unmap_mapping_range.

Well, I guess I've come to accept that, expensive as unmap_mapping_range
may be, truncating files while they're mmap'ed is perverse behaviour:
perhaps even deserving such punishment.

But it is a shame, and leaves me wondering what you gained with the
page lock there.

One thing gained is ease of understanding, and if your later patches
build an edifice upon the knowledge of holding that page lock while
faulting, I've no wish to undermine that foundation.

 
 mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch
 mm-merge-nopfn-into-fault.patch
 etc.
 
 These move -nopage, -populate, -nopfn (and soon, -page_mkwrite)
 into a single, unified interface. Although this strictly closes some
 similar holes in nonlinear faults as well, they are very uncommon, so
 I wouldn't be so upset if these aren't merged in 2.6.22 (I don't see
 any reason not to, but at least they don't fix major bugs).

I don't have an opinion on these, but I think BenH and others
were strongly in favour, with various people waiting upon them.

Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22 -mm merge plans -- vm bugfixes

2007-05-01 Thread Nick Piggin

Hugh Dickins wrote:

On Tue, 1 May 2007, Nick Piggin wrote:



There were concerns that we could do this more cheaply, but I think it
is important to start with a base that is simple and more likely to
be correct and build on that. My testing didn't show any obvious
problems with performance.



I don't see _problems_ with performance, but I do consistently see the
same kind of ~5% degradation in lmbench fork, exec, sh, mmap latency
and page fault tests on SMP, several machines, just as I did last year.


OK. I did run some tests at one stage which didn't show a regression
on my P4, however I don't know that they were statistically significant.
I'll try a couple more runs and post numbers.



I'm assuming this patch is the one responsible: at 2.6.20-rc4 time
you posted a set of 10 and a set of 7 patches I tried in versus out;
at 2.6.21-rc3-mm2 time you had a group of patches in -mm I tried in
versus out; with similar results.

I did check the graphs on test.kernel.org, I couldn't see any bad
behaviour there that correlated with this work; though each -mm
has such a variety of new work in it, it's very hard to attribute.
And nobody else has reported any regression from your patches.

I'm inclined to write it off as poorer performance in some micro-
benchmarks, against which we offset the improved understandabilty
of holding the page lock over the file fault.

But I was quite disappointed when 
mm-fix-fault-vs-invalidate-race-for-linear-mappings-fix.patch

appeared, putting double unmap_mapping_range calls in.  Certainly
you were wrong to take the one out, but a pity to end up with two.

Your comment says/said:
The nopage vs invalidate race fix patch did not take care of truncating
private COW pages. Mind you, I'm pretty sure this was previously racy
even for regular truncate, not to mention vmtruncate_range.

vmtruncate_range (holepunch) was deficient I agree, and though we
can now take out your second unmap_mapping_range there, that's only
because I've slipped one into shmem_truncate_range.  In due course it
needs to be properly handled by noting the range in shmem inode info.

(I think you couldn't take that approach, noting invalid range in
-mapping while invalidating, because NFS has/had some cases of
invalidate_whatever without i_mutex?)


Sorry, I didn't parse this? But I wonder whether it is better to do
it in vmtruncate_range than the filesystem? Private COWed pages are
not really a filesystem thing...



But I'm pretty sure (to use your words!) regular truncate was not racy
before: I believe Andrea's sequence count was handling that case fine,
without a second unmap_mapping_range.


OK, I think you're right. I _think_ it should also be OK with the
lock_page version as well: we should not be able to have any pages
after the first unmap_mapping_range call, because of the i_size
write. So if we have no pages, there is nothing to 'cow' from.



Well, I guess I've come to accept that, expensive as unmap_mapping_range
may be, truncating files while they're mmap'ed is perverse behaviour:
perhaps even deserving such punishment.

But it is a shame, and leaves me wondering what you gained with the
page lock there.

One thing gained is ease of understanding, and if your later patches
build an edifice upon the knowledge of holding that page lock while
faulting, I've no wish to undermine that foundation.


It also fixes a bug, doesn't it? ;)

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/