Re: [RFC] Changing COW detection to be memory hotplug friendly

2005-02-14 Thread Andrew Morton
Andrea Arcangeli <[EMAIL PROTECTED]> wrote:
>
> you're right something might be different now
>  that we don't follow a swapout virtual address space order anymore

There's a patch in -mm which attempts to do so, and afair, succeeds.

However the performance seems to be crappy.  Its main benefit at present
is in reducing worst-case scheduling latencies (scan_swap_map).

ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.11-rc3/2.6.11-rc3-mm2/broken-out/swapspace-layout-improvements-fix.patch
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Changing COW detection to be memory hotplug friendly

2005-02-14 Thread Andrea Arcangeli
On Mon, Feb 14, 2005 at 06:36:43PM +, Hugh Dickins wrote:
> On Mon, 14 Feb 2005, Andrea Arcangeli wrote:
> > > By the way, while we're talking of remove_exclusive_swap_page:
> > > a more functional issue I sometimes wonder about, why don't we
> > > remove_exclusive_swap_page on write fault?  Keeping the swap slot
> > > is valuable if read fault, but once the page is dirtied, wouldn't
> > > it usually be better to free that slot and allocate another later?
> > 
> > Avoiding swap fragmentation is one reason to leave it allocated. So you
> > can swapin/swapout/swapin/swapout always in the same place on disk as
> > long as there's plenty of swap still available. I'm not sure how much
> > speedup this provides, but certainly it makes sense.
> 
> I rather thought it would tend to increase swap fragmentation: that
> the next time (if) this page has to be written out to swap, the disk
> has to seek back to some ancient position to write this page, when
> the rest of the cluster being written is more likely to come from a
> recently allocated block of contiguous swap pages (though if many of
> them are being rewritten rather than newly allocated, they'll all be
> all over the disk, no contiguity at all).
> 
> Of course, freeing as soon as dirty does leave a hole behind, which
> tends towards swap fragmentation: but I thought the swap allocator
> tried for contiguous clusters before it fell back on isolated pages
> (I haven't checked, and akpm has changes to swap allocation in -mm).
> 
> Hmm, I think you're thinking of the overall fragmentation of swap,
> and are correct about that; whereas I'm saying "fragmentation"
> when what I'm really concerned about is increased seeking.

Swapouts aren't the problem. The swapins with physical readahead are the
ones that benefits from the reduced overall fragmentation. Or at least
this was the case in 2.4, you're right something might be different now
that we don't follow a swapout virtual address space order anymore (but
there is probably still some localization effect during the ->nopage
faults).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Changing COW detection to be memory hotplug friendly

2005-02-14 Thread Hugh Dickins
On Mon, 14 Feb 2005, Andrea Arcangeli wrote:
> > By the way, while we're talking of remove_exclusive_swap_page:
> > a more functional issue I sometimes wonder about, why don't we
> > remove_exclusive_swap_page on write fault?  Keeping the swap slot
> > is valuable if read fault, but once the page is dirtied, wouldn't
> > it usually be better to free that slot and allocate another later?
> 
> Avoiding swap fragmentation is one reason to leave it allocated. So you
> can swapin/swapout/swapin/swapout always in the same place on disk as
> long as there's plenty of swap still available. I'm not sure how much
> speedup this provides, but certainly it makes sense.

I rather thought it would tend to increase swap fragmentation: that
the next time (if) this page has to be written out to swap, the disk
has to seek back to some ancient position to write this page, when
the rest of the cluster being written is more likely to come from a
recently allocated block of contiguous swap pages (though if many of
them are being rewritten rather than newly allocated, they'll all be
all over the disk, no contiguity at all).

Of course, freeing as soon as dirty does leave a hole behind, which
tends towards swap fragmentation: but I thought the swap allocator
tried for contiguous clusters before it fell back on isolated pages
(I haven't checked, and akpm has changes to swap allocation in -mm).

Hmm, I think you're thinking of the overall fragmentation of swap,
and are correct about that; whereas I'm saying "fragmentation"
when what I'm really concerned about is increased seeking.

But only research would tell the true story.  I suspect the change
from 2.4's linear vma scanning to 2.6's rmap against the bottom of
LRU may have changed the rules for swap layout strategy.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Changing COW detection to be memory hotplug friendly

2005-02-14 Thread Hugh Dickins
On Mon, 14 Feb 2005, Andrea Arcangeli wrote:
  By the way, while we're talking of remove_exclusive_swap_page:
  a more functional issue I sometimes wonder about, why don't we
  remove_exclusive_swap_page on write fault?  Keeping the swap slot
  is valuable if read fault, but once the page is dirtied, wouldn't
  it usually be better to free that slot and allocate another later?
 
 Avoiding swap fragmentation is one reason to leave it allocated. So you
 can swapin/swapout/swapin/swapout always in the same place on disk as
 long as there's plenty of swap still available. I'm not sure how much
 speedup this provides, but certainly it makes sense.

I rather thought it would tend to increase swap fragmentation: that
the next time (if) this page has to be written out to swap, the disk
has to seek back to some ancient position to write this page, when
the rest of the cluster being written is more likely to come from a
recently allocated block of contiguous swap pages (though if many of
them are being rewritten rather than newly allocated, they'll all be
all over the disk, no contiguity at all).

Of course, freeing as soon as dirty does leave a hole behind, which
tends towards swap fragmentation: but I thought the swap allocator
tried for contiguous clusters before it fell back on isolated pages
(I haven't checked, and akpm has changes to swap allocation in -mm).

Hmm, I think you're thinking of the overall fragmentation of swap,
and are correct about that; whereas I'm saying fragmentation
when what I'm really concerned about is increased seeking.

But only research would tell the true story.  I suspect the change
from 2.4's linear vma scanning to 2.6's rmap against the bottom of
LRU may have changed the rules for swap layout strategy.

Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Changing COW detection to be memory hotplug friendly

2005-02-14 Thread Andrea Arcangeli
On Mon, Feb 14, 2005 at 06:36:43PM +, Hugh Dickins wrote:
 On Mon, 14 Feb 2005, Andrea Arcangeli wrote:
   By the way, while we're talking of remove_exclusive_swap_page:
   a more functional issue I sometimes wonder about, why don't we
   remove_exclusive_swap_page on write fault?  Keeping the swap slot
   is valuable if read fault, but once the page is dirtied, wouldn't
   it usually be better to free that slot and allocate another later?
  
  Avoiding swap fragmentation is one reason to leave it allocated. So you
  can swapin/swapout/swapin/swapout always in the same place on disk as
  long as there's plenty of swap still available. I'm not sure how much
  speedup this provides, but certainly it makes sense.
 
 I rather thought it would tend to increase swap fragmentation: that
 the next time (if) this page has to be written out to swap, the disk
 has to seek back to some ancient position to write this page, when
 the rest of the cluster being written is more likely to come from a
 recently allocated block of contiguous swap pages (though if many of
 them are being rewritten rather than newly allocated, they'll all be
 all over the disk, no contiguity at all).
 
 Of course, freeing as soon as dirty does leave a hole behind, which
 tends towards swap fragmentation: but I thought the swap allocator
 tried for contiguous clusters before it fell back on isolated pages
 (I haven't checked, and akpm has changes to swap allocation in -mm).
 
 Hmm, I think you're thinking of the overall fragmentation of swap,
 and are correct about that; whereas I'm saying fragmentation
 when what I'm really concerned about is increased seeking.

Swapouts aren't the problem. The swapins with physical readahead are the
ones that benefits from the reduced overall fragmentation. Or at least
this was the case in 2.4, you're right something might be different now
that we don't follow a swapout virtual address space order anymore (but
there is probably still some localization effect during the -nopage
faults).
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Changing COW detection to be memory hotplug friendly

2005-02-14 Thread Andrew Morton
Andrea Arcangeli [EMAIL PROTECTED] wrote:

 you're right something might be different now
  that we don't follow a swapout virtual address space order anymore

There's a patch in -mm which attempts to do so, and afair, succeeds.

However the performance seems to be crappy.  Its main benefit at present
is in reducing worst-case scheduling latencies (scan_swap_map).

ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.11-rc3/2.6.11-rc3-mm2/broken-out/swapspace-layout-improvements-fix.patch
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Changing COW detection to be memory hotplug friendly

2005-02-11 Thread Hugh Dickins
On Fri, 11 Feb 2005, Andrea Arcangeli wrote:
> 
> Ok, I'm quite convinced it's correct now. The only thing that can make
> mapcount go up without the lock on the page without userspace
> intervention (and userspace intervention would make it an undefined
> behaviour like in my example with fork), was the swapin, and you covered
> it by moving the unlock after page_add_anon_rmap (so mapcount changes
> atomically with the page_swapcount there too). Swapoff was already doing
> it under the page lock.

Thanks a lot for thinking it through, yes, that's how it is.

(For a while I felt nervous about moving that unlock_page below
the arch-defined flush_icache_page; but then realized that since it's
already done with page_table spinlock, PG_locked cannot be an issue.)

> Then we should use the mapcount/swapcount in remove_exclusive_swap_page
> too.

Originally I thought so, but later wasn't so sure.  There might be
somewhere which stabilizes PageSwapCache by incrementing page_count,
rechecks it, waits to get lock_page, then assumes still PageSwapCache?
(Though it's hard to see why it would need to make such an assumption,
and in the equivalent file case would have to allow for truncation.)

It just needs a wider audit than the simpler can_share_swap_page case,
and can be done independently later on.

By the way, while we're talking of remove_exclusive_swap_page:
a more functional issue I sometimes wonder about, why don't we
remove_exclusive_swap_page on write fault?  Keeping the swap slot
is valuable if read fault, but once the page is dirtied, wouldn't
it usually be better to free that slot and allocate another later?

But I'm always scared of making such changes to swapping, because
I cannot imagine a good enough range of swap performance tests.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Changing COW detection to be memory hotplug friendly

2005-02-11 Thread Andrea Arcangeli
On Fri, Feb 11, 2005 at 07:23:09AM +, Hugh Dickins wrote:
> And it's fine for the behaviour to be somewhat undefined in this
> peculiar case: the important thing is just that the page must not
> be freed and reused while I/O occurs, hence get_user_page raising
> the page_count - which I'm _not_ proposing to change!

Ok, I'm quite convinced it's correct now. The only thing that can make
mapcount go up without the lock on the page without userspace
intervention (and userspace intervention would make it an undefined
behaviour like in my example with fork), was the swapin, and you covered
it by moving the unlock after page_add_anon_rmap (so mapcount changes
atomically with the page_swapcount there too). Swapoff was already doing
it under the page lock.

Then we should use the mapcount/swapcount in remove_exclusive_swap_page
too.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Changing COW detection to be memory hotplug friendly

2005-02-11 Thread Andrea Arcangeli
On Fri, Feb 11, 2005 at 07:23:09AM +, Hugh Dickins wrote:
 And it's fine for the behaviour to be somewhat undefined in this
 peculiar case: the important thing is just that the page must not
 be freed and reused while I/O occurs, hence get_user_page raising
 the page_count - which I'm _not_ proposing to change!

Ok, I'm quite convinced it's correct now. The only thing that can make
mapcount go up without the lock on the page without userspace
intervention (and userspace intervention would make it an undefined
behaviour like in my example with fork), was the swapin, and you covered
it by moving the unlock after page_add_anon_rmap (so mapcount changes
atomically with the page_swapcount there too). Swapoff was already doing
it under the page lock.

Then we should use the mapcount/swapcount in remove_exclusive_swap_page
too.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Changing COW detection to be memory hotplug friendly

2005-02-11 Thread Hugh Dickins
On Fri, 11 Feb 2005, Andrea Arcangeli wrote:
 
 Ok, I'm quite convinced it's correct now. The only thing that can make
 mapcount go up without the lock on the page without userspace
 intervention (and userspace intervention would make it an undefined
 behaviour like in my example with fork), was the swapin, and you covered
 it by moving the unlock after page_add_anon_rmap (so mapcount changes
 atomically with the page_swapcount there too). Swapoff was already doing
 it under the page lock.

Thanks a lot for thinking it through, yes, that's how it is.

(For a while I felt nervous about moving that unlock_page below
the arch-defined flush_icache_page; but then realized that since it's
already done with page_table spinlock, PG_locked cannot be an issue.)

 Then we should use the mapcount/swapcount in remove_exclusive_swap_page
 too.

Originally I thought so, but later wasn't so sure.  There might be
somewhere which stabilizes PageSwapCache by incrementing page_count,
rechecks it, waits to get lock_page, then assumes still PageSwapCache?
(Though it's hard to see why it would need to make such an assumption,
and in the equivalent file case would have to allow for truncation.)

It just needs a wider audit than the simpler can_share_swap_page case,
and can be done independently later on.

By the way, while we're talking of remove_exclusive_swap_page:
a more functional issue I sometimes wonder about, why don't we
remove_exclusive_swap_page on write fault?  Keeping the swap slot
is valuable if read fault, but once the page is dirtied, wouldn't
it usually be better to free that slot and allocate another later?

But I'm always scared of making such changes to swapping, because
I cannot imagine a good enough range of swap performance tests.

Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Changing COW detection to be memory hotplug friendly

2005-02-10 Thread Hugh Dickins
On Thu, 10 Feb 2005, Andrea Arcangeli wrote:
> On Thu, Feb 10, 2005 at 08:19:47PM +, Hugh Dickins wrote:
> > 
> > get_user_pages broke COW in advance of the I/O.  The reason for
> > subsequent COW if the page gets unmapped is precisely because
> > can_share_swap_page used page_count to judge exclusivity, and
> > get_user_pages has bumped that up, so the page appears (in danger
> > of being) non-exclusive when actually it is exclusive.  By changing
> > can_share_swap_page to use page_mapcount, that frustration vanishes.
> 
> What if a second thread does a fork() while the rawio is in progress?
> The page will be again no shareable, and if you unmap it the rawio will
> be currupt if the thread that executed the fork while the I/O is in
> progress writes to a part of the page again after it has been unmapped
> (obviously the part of the page that isn't under read-rawio, rawio works
> with the hardblocksize, not on the whole page). 

I think there's no difference here between the two techniques.

With the new can_share_swap_page, yes, "page_swapcount" goes up with
the fork, the page will be judged non-exclusive, a user write to
another part of the page will replace it by a copy page, and it's
undefined how much of the I/O will be captured in the copy.

But even with the old can_share_swap_page, and try_to_unmap_one
refusing to unmap the page, copy_page_range's COW checking would
still remove write permission from the pte of an anonymous
page, so a user write to another part of the page would find
raised page_count and replace it by a copy page: same behaviour.

And it's fine for the behaviour to be somewhat undefined in this
peculiar case: the important thing is just that the page must not
be freed and reused while I/O occurs, hence get_user_page raising
the page_count - which I'm _not_ proposing to change!

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Changing COW detection to be memory hotplug friendly

2005-02-10 Thread Andrea Arcangeli
On Thu, Feb 10, 2005 at 08:19:47PM +, Hugh Dickins wrote:
> On Thu, 10 Feb 2005, Andrea Arcangeli wrote:
> > 
> > The reason pinned pages cannot be unmapped is that if they're being
> > unmapped while a rawio read is in progress, a cow on some shared
> > swapcache under I/O could happen.
> > 
> > If a try_to_unmap + COW over a shared swapcache happens during a rawio
> > read, then the rawio reads will get lost generating data corruption.
> 
> Yes, but...
> 
> get_user_pages broke COW in advance of the I/O.  The reason for
> subsequent COW if the page gets unmapped is precisely because
> can_share_swap_page used page_count to judge exclusivity, and
> get_user_pages has bumped that up, so the page appears (in danger
> of being) non-exclusive when actually it is exclusive.  By changing
> can_share_swap_page to use page_mapcount, that frustration vanishes.

What if a second thread does a fork() while the rawio is in progress?
The page will be again no shareable, and if you unmap it the rawio will
be currupt if the thread that executed the fork while the I/O is in
progress writes to a part of the page again after it has been unmapped
(obviously the part of the page that isn't under read-rawio, rawio works
with the hardblocksize, not on the whole page). 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Changing COW detection to be memory hotplug friendly

2005-02-10 Thread Hugh Dickins
On Thu, 10 Feb 2005, Andrea Arcangeli wrote:
> 
> The reason pinned pages cannot be unmapped is that if they're being
> unmapped while a rawio read is in progress, a cow on some shared
> swapcache under I/O could happen.
> 
> If a try_to_unmap + COW over a shared swapcache happens during a rawio
> read, then the rawio reads will get lost generating data corruption.

Yes, but...

get_user_pages broke COW in advance of the I/O.  The reason for
subsequent COW if the page gets unmapped is precisely because
can_share_swap_page used page_count to judge exclusivity, and
get_user_pages has bumped that up, so the page appears (in danger
of being) non-exclusive when actually it is exclusive.  By changing
can_share_swap_page to use page_mapcount, that frustration vanishes.

(Of course, if the process forks while async I/O is in progress,
these I/O pages could still get COWed, and either parent or child
lose some of the data being read - but behaviour in that case is
anyway undefined, isn't it?  Just so long as kernel doesn't crash.)

> I had not much time to check the patch yet, but it's quite complex and
> could you explain again how do you prevent a cow to happen while the
> rawio is in flight if you don't pin the pte? The PG_locked bitflag
> cannot matter at all because the rawio read won't lock the page. What
> you have to prevent is the pte to be zeroed out, it must be kept
> writeable during the whole I/O. I don't see how you can allow unmapping
> without running into COWs later on.

The page_mapcount+page_swapcount test for exclusivity is simply a
better test for exclusivity than the page_count test.  Since it says
exclusive when the page is exclusive, no COW occurs (and I've just
remembered that we're actually talking of the odd case where the
process itself writes into another portion of the page under I/O,
to get that write fault: but that still works).

page_count does still play a vital role, in ensuring that the page
stays held in the swapcache, cannot actually be reused for something
else while it's unmapped.

> This is the only thing I care to understand really, since it's the only
> case that the pte pinning was fixing.
> 
> Overall I see nothing wrong in preventing memory removal while rawio is
> in flight. rawio cannot be in flight forever (ptrace and core dump as
> well should complete eventually). Why can't you simply drop pages from
> the freelist one by one until all of them are removed from the freelist?

I did ask Toshihiro about that earlier in the thread.  Best look up his
actual reply, but as I understand it, his test is sufficiently thorough
that it's actually doing direct I/Os in parallel, so there's never a
moment when the page is free.  He believes that an unprivileged process
should not be allowed to hold pages indefinitely unmigratable.

I have little appreciation of the hotplug/memmigration issues here.
But whatever the merits of that, I do think the new can_share_swap_page
is an improvement over the old (written in the days before page_mapcount
existed), and that it's preferable to remove the page_count heuristic
from try_to_unmap_one.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Changing COW detection to be memory hotplug friendly

2005-02-10 Thread Andrea Arcangeli
On Thu, Feb 10, 2005 at 11:16:44AM -0800, Dave Hansen wrote:
> We actually do that, in addition to the more active methods of capturing
> the memory that we're about to remove.

This sounds the best way to go. btw, is this a different codebase from
the one that Toshihiro is testing?

> You're right, I don't really see a problem with ignoring those pages, at
> least in the active migration code.  We would, of course, like to keep
> the number of things that we depend on good faith to get migrated to a
> minimum, but things under I/O are the least of our problems.

Indeed. It's very similar to locked pages. All pages can be pinned for a
transient amount of time, either in the pte or with
PG_pinned/PG_writeback (now perhaps Hugh found a way to drop the pin on
the pte [I'm still unconvinced about that], but sure you're left with
transient pinning with PG_locked or PG_writeback).

> The only thing we might want to do is put something in the rawio code to
> look for the PG_capture pages to ensure that it gives the migration code
> a shot at them every once in a while (when I/O is not in flight,
> obviously).

If there are persistent usages PG_capture sounds a good idea. Perhaps
the whole point that Toshihiro has problem with is that there are really
persistent users that require PG_capture? He mentioned direct IO, that's
not a long time, if something core dump could be a long time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Changing COW detection to be memory hotplug friendly

2005-02-10 Thread Dave Hansen
On Thu, 2005-02-10 at 20:05 +0100, Andrea Arcangeli wrote:
> Overall I see nothing wrong in preventing memory removal while rawio is
> in flight. rawio cannot be in flight forever (ptrace and core dump as
> well should complete eventually). Why can't you simply drop pages from
> the freelist one by one until all of them are removed from the freelist?

We actually do that, in addition to the more active methods of capturing
the memory that we're about to remove.

You're right, I don't really see a problem with ignoring those pages, at
least in the active migration code.  We would, of course, like to keep
the number of things that we depend on good faith to get migrated to a
minimum, but things under I/O are the least of our problems.

The only thing we might want to do is put something in the rawio code to
look for the PG_capture pages to ensure that it gives the migration code
a shot at them every once in a while (when I/O is not in flight,
obviously).

-- Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Changing COW detection to be memory hotplug friendly

2005-02-10 Thread Andrea Arcangeli
On Tue, Feb 08, 2005 at 04:26:26PM +, Hugh Dickins wrote:
> Seems it was okay after all, I got confused by an unrelated issue.
> Here's what I had in mind, what do you think?  Does it indeed help
> with your problem?  I'm copying Andrea because it was he who devised
> that fix to the get_user_pages issue, and also (I think, longer ago)
> can_share_swap_page itself.
> 
> This does rely on us moving 1 from mapcount to swapcount or back only
> while page is locked - there are places (e.g. exit) where mapcount
> comes down without page being locked, but that's not an issue: we just
> have to be sure that once we have mapcount, it can't go up while reading
> swapcount (I've probably changed more than is strictly necessary, but
> this seemed clearest to me).
> 
> I've left out how we ensure swapoff makes progress on a page with high
> mapcount, haven't quite made my mind out about that: it's less of an
> issue now there's an activate_page in unuse_pte, plus it's not an
> issue which will much bother anyone but me, can wait until after.
> 
> That PageAnon check in do_wp_page: seems worthwhile to avoid locking
> and unlocking the page if it's a file page, leaves can_share_swap_page
> simpler (a PageReserved is never PageAnon).  But the patch is against
> 2.6.11-rc3-mm1, so I appear to be breaking the intention of
> do_wp_page_mk_pte_writable ("on a shared-writable page"),
> believe that's already broken but need to study it more.

The reason pinned pages cannot be unmapped is that if they're being
unmapped while a rawio read is in progress, a cow on some shared
swapcache under I/O could happen.

If a try_to_unmap + COW over a shared swapcache happens during a rawio
read, then the rawio reads will get lost generating data corruption.

I had not much time to check the patch yet, but it's quite complex and
could you explain again how do you prevent a cow to happen while the
rawio is in flight if you don't pin the pte? The PG_locked bitflag
cannot matter at all because the rawio read won't lock the page. What
you have to prevent is the pte to be zeroed out, it must be kept
writeable during the whole I/O. I don't see how you can allow unmapping
without running into COWs later on.

This is the only thing I care to understand really, since it's the only
case that the pte pinning was fixing.

Overall I see nothing wrong in preventing memory removal while rawio is
in flight. rawio cannot be in flight forever (ptrace and core dump as
well should complete eventually). Why can't you simply drop pages from
the freelist one by one until all of them are removed from the freelist?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Changing COW detection to be memory hotplug friendly

2005-02-10 Thread IWAMOTO Toshihiro
At Tue, 8 Feb 2005 16:26:26 + (GMT),
Hugh Dickins wrote:
> 
> On Mon, 7 Feb 2005, Hugh Dickins wrote:
> > On Thu, 3 Feb 2005, IWAMOTO Toshihiro wrote:
> > > The current implementation of memory hotremoval relies on that pages
> > > can be unmapped from process spaces.  After successful unmapping,
> > > subsequent accesses to the pages are blocked and don't interfere
> > > the hotremoval operation.
> > > 
> > > However, this code
> > > 
> > > if (PageSwapCache(page) &&
> > > page_count(page) != page_mapcount(page) + 2) {
> > > ret = SWAP_FAIL;
> > > goto out_unmap;
> > > }
> > 
> > Yes, that is odd code.  It would be nice to have a solution without it.
> > 
> > > in try_to_unmap_one() prevents unmapping pages that are referenced via
> > > get_user_pages(), and such references can be held for a long time if
> > > they are due to such as direct IO.
> > > I've made a test program that issues multiple direct IO read requests
> > > against a single read buffer, and pages that belong to the buffer
> > > cannot be hotremoved because they aren't unmapped.
> > 
> > 
> > 
> > I was hoping to append my own patch to this response, but I haven't
> > got it working right yet (swap seems too full).  I hope tomorrow,
> > but thought I'd better not delay these comments any longer.
> 
> Seems it was okay after all, I got confused by an unrelated issue.
> Here's what I had in mind, what do you think?  Does it indeed help
> with your problem?  I'm copying Andrea because it was he who devised
> that fix to the get_user_pages issue, and also (I think, longer ago)
> can_share_swap_page itself.

I've tested with linux-2.6.10-rc2-mm3 + my hotremoval patch, and it
passed hotremoval tests including the direct IO related one.

Thanks.

> --- 2.6.11-rc3-mm1/mm/memory.c2005-02-05 07:09:40.0 +
> +++ linux/mm/memory.c 2005-02-07 19:50:47.0 +

--
IWAMOTO Toshihiro
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Changing COW detection to be memory hotplug friendly

2005-02-10 Thread IWAMOTO Toshihiro
At Tue, 8 Feb 2005 16:26:26 + (GMT),
Hugh Dickins wrote:
 
 On Mon, 7 Feb 2005, Hugh Dickins wrote:
  On Thu, 3 Feb 2005, IWAMOTO Toshihiro wrote:
   The current implementation of memory hotremoval relies on that pages
   can be unmapped from process spaces.  After successful unmapping,
   subsequent accesses to the pages are blocked and don't interfere
   the hotremoval operation.
   
   However, this code
   
   if (PageSwapCache(page) 
   page_count(page) != page_mapcount(page) + 2) {
   ret = SWAP_FAIL;
   goto out_unmap;
   }
  
  Yes, that is odd code.  It would be nice to have a solution without it.
  
   in try_to_unmap_one() prevents unmapping pages that are referenced via
   get_user_pages(), and such references can be held for a long time if
   they are due to such as direct IO.
   I've made a test program that issues multiple direct IO read requests
   against a single read buffer, and pages that belong to the buffer
   cannot be hotremoved because they aren't unmapped.
  
  
  
  I was hoping to append my own patch to this response, but I haven't
  got it working right yet (swap seems too full).  I hope tomorrow,
  but thought I'd better not delay these comments any longer.
 
 Seems it was okay after all, I got confused by an unrelated issue.
 Here's what I had in mind, what do you think?  Does it indeed help
 with your problem?  I'm copying Andrea because it was he who devised
 that fix to the get_user_pages issue, and also (I think, longer ago)
 can_share_swap_page itself.

I've tested with linux-2.6.10-rc2-mm3 + my hotremoval patch, and it
passed hotremoval tests including the direct IO related one.

Thanks.

 --- 2.6.11-rc3-mm1/mm/memory.c2005-02-05 07:09:40.0 +
 +++ linux/mm/memory.c 2005-02-07 19:50:47.0 +

--
IWAMOTO Toshihiro
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Changing COW detection to be memory hotplug friendly

2005-02-10 Thread Andrea Arcangeli
On Tue, Feb 08, 2005 at 04:26:26PM +, Hugh Dickins wrote:
 Seems it was okay after all, I got confused by an unrelated issue.
 Here's what I had in mind, what do you think?  Does it indeed help
 with your problem?  I'm copying Andrea because it was he who devised
 that fix to the get_user_pages issue, and also (I think, longer ago)
 can_share_swap_page itself.
 
 This does rely on us moving 1 from mapcount to swapcount or back only
 while page is locked - there are places (e.g. exit) where mapcount
 comes down without page being locked, but that's not an issue: we just
 have to be sure that once we have mapcount, it can't go up while reading
 swapcount (I've probably changed more than is strictly necessary, but
 this seemed clearest to me).
 
 I've left out how we ensure swapoff makes progress on a page with high
 mapcount, haven't quite made my mind out about that: it's less of an
 issue now there's an activate_page in unuse_pte, plus it's not an
 issue which will much bother anyone but me, can wait until after.
 
 That PageAnon check in do_wp_page: seems worthwhile to avoid locking
 and unlocking the page if it's a file page, leaves can_share_swap_page
 simpler (a PageReserved is never PageAnon).  But the patch is against
 2.6.11-rc3-mm1, so I appear to be breaking the intention of
 do_wp_page_mk_pte_writable (on a shared-writable page),
 believe that's already broken but need to study it more.

The reason pinned pages cannot be unmapped is that if they're being
unmapped while a rawio read is in progress, a cow on some shared
swapcache under I/O could happen.

If a try_to_unmap + COW over a shared swapcache happens during a rawio
read, then the rawio reads will get lost generating data corruption.

I had not much time to check the patch yet, but it's quite complex and
could you explain again how do you prevent a cow to happen while the
rawio is in flight if you don't pin the pte? The PG_locked bitflag
cannot matter at all because the rawio read won't lock the page. What
you have to prevent is the pte to be zeroed out, it must be kept
writeable during the whole I/O. I don't see how you can allow unmapping
without running into COWs later on.

This is the only thing I care to understand really, since it's the only
case that the pte pinning was fixing.

Overall I see nothing wrong in preventing memory removal while rawio is
in flight. rawio cannot be in flight forever (ptrace and core dump as
well should complete eventually). Why can't you simply drop pages from
the freelist one by one until all of them are removed from the freelist?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Changing COW detection to be memory hotplug friendly

2005-02-10 Thread Dave Hansen
On Thu, 2005-02-10 at 20:05 +0100, Andrea Arcangeli wrote:
 Overall I see nothing wrong in preventing memory removal while rawio is
 in flight. rawio cannot be in flight forever (ptrace and core dump as
 well should complete eventually). Why can't you simply drop pages from
 the freelist one by one until all of them are removed from the freelist?

We actually do that, in addition to the more active methods of capturing
the memory that we're about to remove.

You're right, I don't really see a problem with ignoring those pages, at
least in the active migration code.  We would, of course, like to keep
the number of things that we depend on good faith to get migrated to a
minimum, but things under I/O are the least of our problems.

The only thing we might want to do is put something in the rawio code to
look for the PG_capture pages to ensure that it gives the migration code
a shot at them every once in a while (when I/O is not in flight,
obviously).

-- Dave

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Changing COW detection to be memory hotplug friendly

2005-02-10 Thread Andrea Arcangeli
On Thu, Feb 10, 2005 at 11:16:44AM -0800, Dave Hansen wrote:
 We actually do that, in addition to the more active methods of capturing
 the memory that we're about to remove.

This sounds the best way to go. btw, is this a different codebase from
the one that Toshihiro is testing?

 You're right, I don't really see a problem with ignoring those pages, at
 least in the active migration code.  We would, of course, like to keep
 the number of things that we depend on good faith to get migrated to a
 minimum, but things under I/O are the least of our problems.

Indeed. It's very similar to locked pages. All pages can be pinned for a
transient amount of time, either in the pte or with
PG_pinned/PG_writeback (now perhaps Hugh found a way to drop the pin on
the pte [I'm still unconvinced about that], but sure you're left with
transient pinning with PG_locked or PG_writeback).

 The only thing we might want to do is put something in the rawio code to
 look for the PG_capture pages to ensure that it gives the migration code
 a shot at them every once in a while (when I/O is not in flight,
 obviously).

If there are persistent usages PG_capture sounds a good idea. Perhaps
the whole point that Toshihiro has problem with is that there are really
persistent users that require PG_capture? He mentioned direct IO, that's
not a long time, if something core dump could be a long time.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Changing COW detection to be memory hotplug friendly

2005-02-10 Thread Hugh Dickins
On Thu, 10 Feb 2005, Andrea Arcangeli wrote:
 
 The reason pinned pages cannot be unmapped is that if they're being
 unmapped while a rawio read is in progress, a cow on some shared
 swapcache under I/O could happen.
 
 If a try_to_unmap + COW over a shared swapcache happens during a rawio
 read, then the rawio reads will get lost generating data corruption.

Yes, but...

get_user_pages broke COW in advance of the I/O.  The reason for
subsequent COW if the page gets unmapped is precisely because
can_share_swap_page used page_count to judge exclusivity, and
get_user_pages has bumped that up, so the page appears (in danger
of being) non-exclusive when actually it is exclusive.  By changing
can_share_swap_page to use page_mapcount, that frustration vanishes.

(Of course, if the process forks while async I/O is in progress,
these I/O pages could still get COWed, and either parent or child
lose some of the data being read - but behaviour in that case is
anyway undefined, isn't it?  Just so long as kernel doesn't crash.)

 I had not much time to check the patch yet, but it's quite complex and
 could you explain again how do you prevent a cow to happen while the
 rawio is in flight if you don't pin the pte? The PG_locked bitflag
 cannot matter at all because the rawio read won't lock the page. What
 you have to prevent is the pte to be zeroed out, it must be kept
 writeable during the whole I/O. I don't see how you can allow unmapping
 without running into COWs later on.

The page_mapcount+page_swapcount test for exclusivity is simply a
better test for exclusivity than the page_count test.  Since it says
exclusive when the page is exclusive, no COW occurs (and I've just
remembered that we're actually talking of the odd case where the
process itself writes into another portion of the page under I/O,
to get that write fault: but that still works).

page_count does still play a vital role, in ensuring that the page
stays held in the swapcache, cannot actually be reused for something
else while it's unmapped.

 This is the only thing I care to understand really, since it's the only
 case that the pte pinning was fixing.
 
 Overall I see nothing wrong in preventing memory removal while rawio is
 in flight. rawio cannot be in flight forever (ptrace and core dump as
 well should complete eventually). Why can't you simply drop pages from
 the freelist one by one until all of them are removed from the freelist?

I did ask Toshihiro about that earlier in the thread.  Best look up his
actual reply, but as I understand it, his test is sufficiently thorough
that it's actually doing direct I/Os in parallel, so there's never a
moment when the page is free.  He believes that an unprivileged process
should not be allowed to hold pages indefinitely unmigratable.

I have little appreciation of the hotplug/memmigration issues here.
But whatever the merits of that, I do think the new can_share_swap_page
is an improvement over the old (written in the days before page_mapcount
existed), and that it's preferable to remove the page_count heuristic
from try_to_unmap_one.

Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Changing COW detection to be memory hotplug friendly

2005-02-10 Thread Andrea Arcangeli
On Thu, Feb 10, 2005 at 08:19:47PM +, Hugh Dickins wrote:
 On Thu, 10 Feb 2005, Andrea Arcangeli wrote:
  
  The reason pinned pages cannot be unmapped is that if they're being
  unmapped while a rawio read is in progress, a cow on some shared
  swapcache under I/O could happen.
  
  If a try_to_unmap + COW over a shared swapcache happens during a rawio
  read, then the rawio reads will get lost generating data corruption.
 
 Yes, but...
 
 get_user_pages broke COW in advance of the I/O.  The reason for
 subsequent COW if the page gets unmapped is precisely because
 can_share_swap_page used page_count to judge exclusivity, and
 get_user_pages has bumped that up, so the page appears (in danger
 of being) non-exclusive when actually it is exclusive.  By changing
 can_share_swap_page to use page_mapcount, that frustration vanishes.

What if a second thread does a fork() while the rawio is in progress?
The page will be again no shareable, and if you unmap it the rawio will
be currupt if the thread that executed the fork while the I/O is in
progress writes to a part of the page again after it has been unmapped
(obviously the part of the page that isn't under read-rawio, rawio works
with the hardblocksize, not on the whole page). 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Changing COW detection to be memory hotplug friendly

2005-02-10 Thread Hugh Dickins
On Thu, 10 Feb 2005, Andrea Arcangeli wrote:
 On Thu, Feb 10, 2005 at 08:19:47PM +, Hugh Dickins wrote:
  
  get_user_pages broke COW in advance of the I/O.  The reason for
  subsequent COW if the page gets unmapped is precisely because
  can_share_swap_page used page_count to judge exclusivity, and
  get_user_pages has bumped that up, so the page appears (in danger
  of being) non-exclusive when actually it is exclusive.  By changing
  can_share_swap_page to use page_mapcount, that frustration vanishes.
 
 What if a second thread does a fork() while the rawio is in progress?
 The page will be again no shareable, and if you unmap it the rawio will
 be currupt if the thread that executed the fork while the I/O is in
 progress writes to a part of the page again after it has been unmapped
 (obviously the part of the page that isn't under read-rawio, rawio works
 with the hardblocksize, not on the whole page). 

I think there's no difference here between the two techniques.

With the new can_share_swap_page, yes, page_swapcount goes up with
the fork, the page will be judged non-exclusive, a user write to
another part of the page will replace it by a copy page, and it's
undefined how much of the I/O will be captured in the copy.

But even with the old can_share_swap_page, and try_to_unmap_one
refusing to unmap the page, copy_page_range's COW checking would
still remove write permission from the pte of an anonymous
page, so a user write to another part of the page would find
raised page_count and replace it by a copy page: same behaviour.

And it's fine for the behaviour to be somewhat undefined in this
peculiar case: the important thing is just that the page must not
be freed and reused while I/O occurs, hence get_user_page raising
the page_count - which I'm _not_ proposing to change!

Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Changing COW detection to be memory hotplug friendly

2005-02-09 Thread IWAMOTO Toshihiro
At Mon, 7 Feb 2005 21:24:59 + (GMT),
Hugh Dickins wrote:
> 
> On Thu, 3 Feb 2005, IWAMOTO Toshihiro wrote:
> > The current implementation of memory hotremoval relies on that pages
> > can be unmapped from process spaces.  After successful unmapping,
> > subsequent accesses to the pages are blocked and don't interfere
> > the hotremoval operation.
> > 
> > However, this code
> > 
> > if (PageSwapCache(page) &&
> > page_count(page) != page_mapcount(page) + 2) {
> > ret = SWAP_FAIL;
> > goto out_unmap;
> > }
> 
> Yes, that is odd code.  It would be nice to have a solution without it.
> 
> > in try_to_unmap_one() prevents unmapping pages that are referenced via
> > get_user_pages(), and such references can be held for a long time if
> > they are due to such as direct IO.
> > I've made a test program that issues multiple direct IO read requests
> > against a single read buffer, and pages that belong to the buffer
> > cannot be hotremoved because they aren't unmapped.
> 
> I haven't looked at the rest of your hotremoval, so it's not obvious
> to me how a change here would help you - obviously you wouldn't want
> to be migrating pages while direct IO to them was in progress.
> 
> I presume your patch works for you by letting the page count fall
> to a point where migration moves it automatically as soon as the
> got_user_pages are put, where without your patch the count is held
> too high, and you keep doing scans which tend to miss the window
> in which those pages are put?

Yes.  And my test program has no such time window because IO requests
are issued without waiting for completion of older requests.
I think issuing IO requests in such a manner is nonsense, but
a misbehaving process shouldn't be able to prevent memory hotremoval.

If the hotremoval code can unmap a page from a process space, the
process will be blocked when it causes a page fault against the page.
So, a process cannot continuously issue direct IO requests to keep
page counts high.  (get_user_pages() causes a (simulated) page fault.)


> >   - The can_share_swap_page() call in do_swap_page() always returns
> > false.  It is inefficient but should be harmless.  Incrementing
> > page_mapcount before calling that function should fix the problem,
> > but it may cause bad side effects.
> 
> Odd that your patch moves it if it now doesn't even work!
> But I think some more movement should be able to solve that.

Moving swap_free() is necessary to avoid can_share_swap_page()
returning true when it shouldn't.  And, write access cases are handled
later with do_wp_page() call, anyway.

> >   - Another obvious solution to this issue is to find the "offending"
> > process from a un-unmappable page and suspend it until the page is
> > unmapped.  I'm afraid the implementation would be much more complicated.
> 
> Agreed, let's not get into that.

Nice to hear that.

> > --- mm/memory.c.orig2005-01-17 14:47:11.0 +0900
> > +++ mm/memory.c 2005-01-17 14:55:51.0 +0900
> > @@ -1786,10 +1786,6 @@ static int do_swap_page(struct mm_struct
> > }
> >  
> > /* The page isn't present yet, go ahead with the fault. */
> > -   
> > -   swap_free(entry);
> > -   if (vm_swap_full())
> > -   remove_exclusive_swap_page(page);
> >  
> > mm->rss++;
> > acct_update_integrals();
> > @@ -1800,6 +1796,10 @@ static int do_swap_page(struct mm_struct
> > pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> > write_access = 0;
> > }
> > +   
> > +   swap_free(entry);
> > +   if (vm_swap_full())
> > +   remove_exclusive_swap_page(page);
> > unlock_page(page);
> >  
> > flush_icache_page(vma, page);
> > --- mm/rmap.c.orig  2005-01-17 14:40:08.0 +0900
> > +++ mm/rmap.c   2005-01-21 12:34:06.0 +0900
> > @@ -569,8 +569,11 @@ static int try_to_unmap_one(struct page 
> >  */
> > if (PageSwapCache(page) &&
> > page_count(page) != page_mapcount(page) + 2) {
> > -   ret = SWAP_FAIL;
> > -   goto out_unmap;
> > +   if (page_mapcount(page) > 1) {  /* happens when COW is in 
> > progress */
> > +   ret = SWAP_FAIL;
> > +   goto out_unmap;
> > +   }
> > +   printk("unmapping GUPed page\n");
> > }
> >  
> > /* Nuke the page table entry. */
> 
> I'm disappointed to see you making this more complicated, it's
> even harder to understand than before.  I believe that if we're
> going to make good use of page_mapcount in can_share_swap_page,
> it should be possible to delete this block from try_to_unmap_one
> altogether.  Or did you try that, and find it goes wrong?

I just wanted to be conservative to get a working patch.
I think this block can be deleted.

> > --- mm/swapfile.c.orig  2005-01-17 14:47:11.0 +0900
> > +++ mm/swapfile.c   2005-01-31 16:59:11.0 +0900

> This can_share_swap_page 

Re: [RFC] Changing COW detection to be memory hotplug friendly

2005-02-09 Thread IWAMOTO Toshihiro
At Mon, 7 Feb 2005 21:24:59 + (GMT),
Hugh Dickins wrote:
 
 On Thu, 3 Feb 2005, IWAMOTO Toshihiro wrote:
  The current implementation of memory hotremoval relies on that pages
  can be unmapped from process spaces.  After successful unmapping,
  subsequent accesses to the pages are blocked and don't interfere
  the hotremoval operation.
  
  However, this code
  
  if (PageSwapCache(page) 
  page_count(page) != page_mapcount(page) + 2) {
  ret = SWAP_FAIL;
  goto out_unmap;
  }
 
 Yes, that is odd code.  It would be nice to have a solution without it.
 
  in try_to_unmap_one() prevents unmapping pages that are referenced via
  get_user_pages(), and such references can be held for a long time if
  they are due to such as direct IO.
  I've made a test program that issues multiple direct IO read requests
  against a single read buffer, and pages that belong to the buffer
  cannot be hotremoved because they aren't unmapped.
 
 I haven't looked at the rest of your hotremoval, so it's not obvious
 to me how a change here would help you - obviously you wouldn't want
 to be migrating pages while direct IO to them was in progress.
 
 I presume your patch works for you by letting the page count fall
 to a point where migration moves it automatically as soon as the
 got_user_pages are put, where without your patch the count is held
 too high, and you keep doing scans which tend to miss the window
 in which those pages are put?

Yes.  And my test program has no such time window because IO requests
are issued without waiting for completion of older requests.
I think issuing IO requests in such a manner is nonsense, but
a misbehaving process shouldn't be able to prevent memory hotremoval.

If the hotremoval code can unmap a page from a process space, the
process will be blocked when it causes a page fault against the page.
So, a process cannot continuously issue direct IO requests to keep
page counts high.  (get_user_pages() causes a (simulated) page fault.)


- The can_share_swap_page() call in do_swap_page() always returns
  false.  It is inefficient but should be harmless.  Incrementing
  page_mapcount before calling that function should fix the problem,
  but it may cause bad side effects.
 
 Odd that your patch moves it if it now doesn't even work!
 But I think some more movement should be able to solve that.

Moving swap_free() is necessary to avoid can_share_swap_page()
returning true when it shouldn't.  And, write access cases are handled
later with do_wp_page() call, anyway.

- Another obvious solution to this issue is to find the offending
  process from a un-unmappable page and suspend it until the page is
  unmapped.  I'm afraid the implementation would be much more complicated.
 
 Agreed, let's not get into that.

Nice to hear that.

  --- mm/memory.c.orig2005-01-17 14:47:11.0 +0900
  +++ mm/memory.c 2005-01-17 14:55:51.0 +0900
  @@ -1786,10 +1786,6 @@ static int do_swap_page(struct mm_struct
  }
   
  /* The page isn't present yet, go ahead with the fault. */
  -   
  -   swap_free(entry);
  -   if (vm_swap_full())
  -   remove_exclusive_swap_page(page);
   
  mm-rss++;
  acct_update_integrals();
  @@ -1800,6 +1796,10 @@ static int do_swap_page(struct mm_struct
  pte = maybe_mkwrite(pte_mkdirty(pte), vma);
  write_access = 0;
  }
  +   
  +   swap_free(entry);
  +   if (vm_swap_full())
  +   remove_exclusive_swap_page(page);
  unlock_page(page);
   
  flush_icache_page(vma, page);
  --- mm/rmap.c.orig  2005-01-17 14:40:08.0 +0900
  +++ mm/rmap.c   2005-01-21 12:34:06.0 +0900
  @@ -569,8 +569,11 @@ static int try_to_unmap_one(struct page 
   */
  if (PageSwapCache(page) 
  page_count(page) != page_mapcount(page) + 2) {
  -   ret = SWAP_FAIL;
  -   goto out_unmap;
  +   if (page_mapcount(page)  1) {  /* happens when COW is in 
  progress */
  +   ret = SWAP_FAIL;
  +   goto out_unmap;
  +   }
  +   printk(unmapping GUPed page\n);
  }
   
  /* Nuke the page table entry. */
 
 I'm disappointed to see you making this more complicated, it's
 even harder to understand than before.  I believe that if we're
 going to make good use of page_mapcount in can_share_swap_page,
 it should be possible to delete this block from try_to_unmap_one
 altogether.  Or did you try that, and find it goes wrong?

I just wanted to be conservative to get a working patch.
I think this block can be deleted.

  --- mm/swapfile.c.orig  2005-01-17 14:47:11.0 +0900
  +++ mm/swapfile.c   2005-01-31 16:59:11.0 +0900

 This can_share_swap_page looks messier than I'd want.
 
 I was hoping to append my own patch to this response, but I haven't
 got it working right yet (swap seems too full).  I hope tomorrow,
 but 

Re: [RFC] Changing COW detection to be memory hotplug friendly

2005-02-08 Thread Hugh Dickins
On Mon, 7 Feb 2005, Hugh Dickins wrote:
> On Thu, 3 Feb 2005, IWAMOTO Toshihiro wrote:
> > The current implementation of memory hotremoval relies on that pages
> > can be unmapped from process spaces.  After successful unmapping,
> > subsequent accesses to the pages are blocked and don't interfere
> > the hotremoval operation.
> > 
> > However, this code
> > 
> > if (PageSwapCache(page) &&
> > page_count(page) != page_mapcount(page) + 2) {
> > ret = SWAP_FAIL;
> > goto out_unmap;
> > }
> 
> Yes, that is odd code.  It would be nice to have a solution without it.
> 
> > in try_to_unmap_one() prevents unmapping pages that are referenced via
> > get_user_pages(), and such references can be held for a long time if
> > they are due to such as direct IO.
> > I've made a test program that issues multiple direct IO read requests
> > against a single read buffer, and pages that belong to the buffer
> > cannot be hotremoved because they aren't unmapped.
> 
> 
> 
> I was hoping to append my own patch to this response, but I haven't
> got it working right yet (swap seems too full).  I hope tomorrow,
> but thought I'd better not delay these comments any longer.

Seems it was okay after all, I got confused by an unrelated issue.
Here's what I had in mind, what do you think?  Does it indeed help
with your problem?  I'm copying Andrea because it was he who devised
that fix to the get_user_pages issue, and also (I think, longer ago)
can_share_swap_page itself.

This does rely on us moving 1 from mapcount to swapcount or back only
while page is locked - there are places (e.g. exit) where mapcount
comes down without page being locked, but that's not an issue: we just
have to be sure that once we have mapcount, it can't go up while reading
swapcount (I've probably changed more than is strictly necessary, but
this seemed clearest to me).

I've left out how we ensure swapoff makes progress on a page with high
mapcount, haven't quite made my mind out about that: it's less of an
issue now there's an activate_page in unuse_pte, plus it's not an
issue which will much bother anyone but me, can wait until after.

That PageAnon check in do_wp_page: seems worthwhile to avoid locking
and unlocking the page if it's a file page, leaves can_share_swap_page
simpler (a PageReserved is never PageAnon).  But the patch is against
2.6.11-rc3-mm1, so I appear to be breaking the intention of
do_wp_page_mk_pte_writable ("on a shared-writable page"),
believe that's already broken but need to study it more.

Hugh

--- 2.6.11-rc3-mm1/mm/memory.c  2005-02-05 07:09:40.0 +
+++ linux/mm/memory.c   2005-02-07 19:50:47.0 +
@@ -1339,7 +1339,7 @@ static int do_wp_page(struct mm_struct *
}
old_page = pfn_to_page(pfn);
 
-   if (!TestSetPageLocked(old_page)) {
+   if (PageAnon(old_page) && !TestSetPageLocked(old_page)) {
int reuse = can_share_swap_page(old_page);
unlock_page(old_page);
if (reuse) {
@@ -1779,10 +1779,6 @@ static int do_swap_page(struct mm_struct
}
 
/* The page isn't present yet, go ahead with the fault. */
-   
-   swap_free(entry);
-   if (vm_swap_full())
-   remove_exclusive_swap_page(page);
 
mm->rss++;
pte = mk_pte(page, vma->vm_page_prot);
@@ -1790,12 +1786,16 @@ static int do_swap_page(struct mm_struct
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
write_access = 0;
}
-   unlock_page(page);
 
flush_icache_page(vma, page);
set_pte(page_table, pte);
page_add_anon_rmap(page, vma, address);
 
+   swap_free(entry);
+   if (vm_swap_full())
+   remove_exclusive_swap_page(page);
+   unlock_page(page);
+
if (write_access) {
if (do_wp_page(mm, vma, address,
page_table, pmd, pte) == VM_FAULT_OOM)
--- 2.6.11-rc3-mm1/mm/rmap.c2005-02-05 07:09:40.0 +
+++ linux/mm/rmap.c 2005-02-07 12:59:21.0 +
@@ -551,27 +551,6 @@ static int try_to_unmap_one(struct page 
goto out_unmap;
}
 
-   /*
-* Don't pull an anonymous page out from under get_user_pages.
-* GUP carefully breaks COW and raises page count (while holding
-* page_table_lock, as we have here) to make sure that the page
-* cannot be freed.  If we unmap that page here, a user write
-* access to the virtual address will bring back the page, but
-* its raised count will (ironically) be taken to mean it's not
-* an exclusive swap page, do_wp_page will replace it by a copy
-* page, and the user never get to see the data GUP was holding
-* the original page for.
-*
-* This test is also useful for when swapoff (unuse_process) has
-* to drop page lock: its reference to the page stops existing
-   

Re: [RFC] Changing COW detection to be memory hotplug friendly

2005-02-08 Thread Hugh Dickins
On Mon, 7 Feb 2005, Hugh Dickins wrote:
 On Thu, 3 Feb 2005, IWAMOTO Toshihiro wrote:
  The current implementation of memory hotremoval relies on that pages
  can be unmapped from process spaces.  After successful unmapping,
  subsequent accesses to the pages are blocked and don't interfere
  the hotremoval operation.
  
  However, this code
  
  if (PageSwapCache(page) 
  page_count(page) != page_mapcount(page) + 2) {
  ret = SWAP_FAIL;
  goto out_unmap;
  }
 
 Yes, that is odd code.  It would be nice to have a solution without it.
 
  in try_to_unmap_one() prevents unmapping pages that are referenced via
  get_user_pages(), and such references can be held for a long time if
  they are due to such as direct IO.
  I've made a test program that issues multiple direct IO read requests
  against a single read buffer, and pages that belong to the buffer
  cannot be hotremoved because they aren't unmapped.
 
 
 
 I was hoping to append my own patch to this response, but I haven't
 got it working right yet (swap seems too full).  I hope tomorrow,
 but thought I'd better not delay these comments any longer.

Seems it was okay after all, I got confused by an unrelated issue.
Here's what I had in mind, what do you think?  Does it indeed help
with your problem?  I'm copying Andrea because it was he who devised
that fix to the get_user_pages issue, and also (I think, longer ago)
can_share_swap_page itself.

This does rely on us moving 1 from mapcount to swapcount or back only
while page is locked - there are places (e.g. exit) where mapcount
comes down without page being locked, but that's not an issue: we just
have to be sure that once we have mapcount, it can't go up while reading
swapcount (I've probably changed more than is strictly necessary, but
this seemed clearest to me).

I've left out how we ensure swapoff makes progress on a page with high
mapcount, haven't quite made my mind out about that: it's less of an
issue now there's an activate_page in unuse_pte, plus it's not an
issue which will much bother anyone but me, can wait until after.

That PageAnon check in do_wp_page: seems worthwhile to avoid locking
and unlocking the page if it's a file page, leaves can_share_swap_page
simpler (a PageReserved is never PageAnon).  But the patch is against
2.6.11-rc3-mm1, so I appear to be breaking the intention of
do_wp_page_mk_pte_writable (on a shared-writable page),
believe that's already broken but need to study it more.

Hugh

--- 2.6.11-rc3-mm1/mm/memory.c  2005-02-05 07:09:40.0 +
+++ linux/mm/memory.c   2005-02-07 19:50:47.0 +
@@ -1339,7 +1339,7 @@ static int do_wp_page(struct mm_struct *
}
old_page = pfn_to_page(pfn);
 
-   if (!TestSetPageLocked(old_page)) {
+   if (PageAnon(old_page)  !TestSetPageLocked(old_page)) {
int reuse = can_share_swap_page(old_page);
unlock_page(old_page);
if (reuse) {
@@ -1779,10 +1779,6 @@ static int do_swap_page(struct mm_struct
}
 
/* The page isn't present yet, go ahead with the fault. */
-   
-   swap_free(entry);
-   if (vm_swap_full())
-   remove_exclusive_swap_page(page);
 
mm-rss++;
pte = mk_pte(page, vma-vm_page_prot);
@@ -1790,12 +1786,16 @@ static int do_swap_page(struct mm_struct
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
write_access = 0;
}
-   unlock_page(page);
 
flush_icache_page(vma, page);
set_pte(page_table, pte);
page_add_anon_rmap(page, vma, address);
 
+   swap_free(entry);
+   if (vm_swap_full())
+   remove_exclusive_swap_page(page);
+   unlock_page(page);
+
if (write_access) {
if (do_wp_page(mm, vma, address,
page_table, pmd, pte) == VM_FAULT_OOM)
--- 2.6.11-rc3-mm1/mm/rmap.c2005-02-05 07:09:40.0 +
+++ linux/mm/rmap.c 2005-02-07 12:59:21.0 +
@@ -551,27 +551,6 @@ static int try_to_unmap_one(struct page 
goto out_unmap;
}
 
-   /*
-* Don't pull an anonymous page out from under get_user_pages.
-* GUP carefully breaks COW and raises page count (while holding
-* page_table_lock, as we have here) to make sure that the page
-* cannot be freed.  If we unmap that page here, a user write
-* access to the virtual address will bring back the page, but
-* its raised count will (ironically) be taken to mean it's not
-* an exclusive swap page, do_wp_page will replace it by a copy
-* page, and the user never get to see the data GUP was holding
-* the original page for.
-*
-* This test is also useful for when swapoff (unuse_process) has
-* to drop page lock: its reference to the page stops existing
-* ptes from being unmapped, so swapoff can make 

Re: [RFC] Changing COW detection to be memory hotplug friendly

2005-02-07 Thread Hugh Dickins
On Thu, 3 Feb 2005, IWAMOTO Toshihiro wrote:
> The current implementation of memory hotremoval relies on that pages
> can be unmapped from process spaces.  After successful unmapping,
> subsequent accesses to the pages are blocked and don't interfere
> the hotremoval operation.
> 
> However, this code
> 
> if (PageSwapCache(page) &&
> page_count(page) != page_mapcount(page) + 2) {
> ret = SWAP_FAIL;
> goto out_unmap;
> }

Yes, that is odd code.  It would be nice to have a solution without it.

> in try_to_unmap_one() prevents unmapping pages that are referenced via
> get_user_pages(), and such references can be held for a long time if
> they are due to such as direct IO.
> I've made a test program that issues multiple direct IO read requests
> against a single read buffer, and pages that belong to the buffer
> cannot be hotremoved because they aren't unmapped.

I haven't looked at the rest of your hotremoval, so it's not obvious
to me how a change here would help you - obviously you wouldn't want
to be migrating pages while direct IO to them was in progress.

I presume your patch works for you by letting the page count fall
to a point where migration moves it automatically as soon as the
got_user_pages are put, where without your patch the count is held
too high, and you keep doing scans which tend to miss the window
in which those pages are put?

> The following patch, which is against linux-2.6.11-rc1-mm1 and also
> tested witch linux-2.6.11-rc2-mm2, fixes this issue.  The purpose of
> this patch is to be able to unmap pages that have incremented
> page_count.  To do that consistently, the COW detection logic needs to
> be modified to not to rely on page_count.  I'm aware that such
> extensive use of page_mapcount is discouraged and there is a plan to
> kill page_mapcount (*), but I cannot think of a better alternative
> solution.
> 
> (*) c.f. http://www.ussg.iu.edu/hypermail/linux/kernel/0406.0/0483.html

I apologize for scaring you off page mapcount.  I have no current
plans to scrap it, and feel a lot more satisfied with it than at the
time of that comment.  Partly because it's now manipulated atomically
rather than under bitspin lock.  Partly because I realize that although
64-bit systems are overdue for an atomic64 page count and page mapcount,
we can actually just use one atomic64 for them both, keeping, say, lower
24 bits for count and upper 40 for mapcount (and not repeating mapcount
in count on these arches), so mapcount won't increase struct page size.

Go right ahead and use page mapcount if it's appropriate.

> Some notes about my code:
> 
>   - I think it's safe to rely on page_mapcount in do_swap_page(),
> because its use is protected by lock_page().

I think so too.

>   - The can_share_swap_page() call in do_swap_page() always returns
> false.  It is inefficient but should be harmless.  Incrementing
> page_mapcount before calling that function should fix the problem,
> but it may cause bad side effects.

Odd that your patch moves it if it now doesn't even work!
But I think some more movement should be able to solve that.

>   - Another obvious solution to this issue is to find the "offending"
> process from a un-unmappable page and suspend it until the page is
> unmapped.  I'm afraid the implementation would be much more complicated.

Agreed, let's not get into that.

>   - I could not test the following situation.  It should be possible
> to write some kernel code to do that, but please let me know if
> you know any such test cases.
> - A page_count is incremented by get_user_pages().
> - The page gets unmapped.
> - The process causes a write fault for the page, before the
>   incremented page_count is dropped.

I confess I don't have such a test case ready myself.

> Also, while I've tried carefully not to make mistakes and done some
> testing, I'm not very sure this is bug free.  Please comment.
> 
> --- mm/memory.c.orig  2005-01-17 14:47:11.0 +0900
> +++ mm/memory.c   2005-01-17 14:55:51.0 +0900
> @@ -1786,10 +1786,6 @@ static int do_swap_page(struct mm_struct
>   }
>  
>   /* The page isn't present yet, go ahead with the fault. */
> - 
> - swap_free(entry);
> - if (vm_swap_full())
> - remove_exclusive_swap_page(page);
>  
>   mm->rss++;
>   acct_update_integrals();
> @@ -1800,6 +1796,10 @@ static int do_swap_page(struct mm_struct
>   pte = maybe_mkwrite(pte_mkdirty(pte), vma);
>   write_access = 0;
>   }
> + 
> + swap_free(entry);
> + if (vm_swap_full())
> + remove_exclusive_swap_page(page);
>   unlock_page(page);
>  
>   flush_icache_page(vma, page);
> --- mm/rmap.c.orig2005-01-17 14:40:08.0 +0900
> +++ mm/rmap.c 2005-01-21 12:34:06.0 +0900
> @@ -569,8 +569,11 @@ static int try_to_unmap_one(struct page 
>*/
>  

Re: [RFC] Changing COW detection to be memory hotplug friendly

2005-02-07 Thread Hugh Dickins
On Thu, 3 Feb 2005, IWAMOTO Toshihiro wrote:
 The current implementation of memory hotremoval relies on that pages
 can be unmapped from process spaces.  After successful unmapping,
 subsequent accesses to the pages are blocked and don't interfere
 the hotremoval operation.
 
 However, this code
 
 if (PageSwapCache(page) 
 page_count(page) != page_mapcount(page) + 2) {
 ret = SWAP_FAIL;
 goto out_unmap;
 }

Yes, that is odd code.  It would be nice to have a solution without it.

 in try_to_unmap_one() prevents unmapping pages that are referenced via
 get_user_pages(), and such references can be held for a long time if
 they are due to such as direct IO.
 I've made a test program that issues multiple direct IO read requests
 against a single read buffer, and pages that belong to the buffer
 cannot be hotremoved because they aren't unmapped.

I haven't looked at the rest of your hotremoval, so it's not obvious
to me how a change here would help you - obviously you wouldn't want
to be migrating pages while direct IO to them was in progress.

I presume your patch works for you by letting the page count fall
to a point where migration moves it automatically as soon as the
got_user_pages are put, where without your patch the count is held
too high, and you keep doing scans which tend to miss the window
in which those pages are put?

 The following patch, which is against linux-2.6.11-rc1-mm1 and also
 tested witch linux-2.6.11-rc2-mm2, fixes this issue.  The purpose of
 this patch is to be able to unmap pages that have incremented
 page_count.  To do that consistently, the COW detection logic needs to
 be modified to not to rely on page_count.  I'm aware that such
 extensive use of page_mapcount is discouraged and there is a plan to
 kill page_mapcount (*), but I cannot think of a better alternative
 solution.
 
 (*) c.f. http://www.ussg.iu.edu/hypermail/linux/kernel/0406.0/0483.html

I apologize for scaring you off page mapcount.  I have no current
plans to scrap it, and feel a lot more satisfied with it than at the
time of that comment.  Partly because it's now manipulated atomically
rather than under bitspin lock.  Partly because I realize that although
64-bit systems are overdue for an atomic64 page count and page mapcount,
we can actually just use one atomic64 for them both, keeping, say, lower
24 bits for count and upper 40 for mapcount (and not repeating mapcount
in count on these arches), so mapcount won't increase struct page size.

Go right ahead and use page mapcount if it's appropriate.

 Some notes about my code:
 
   - I think it's safe to rely on page_mapcount in do_swap_page(),
 because its use is protected by lock_page().

I think so too.

   - The can_share_swap_page() call in do_swap_page() always returns
 false.  It is inefficient but should be harmless.  Incrementing
 page_mapcount before calling that function should fix the problem,
 but it may cause bad side effects.

Odd that your patch moves it if it now doesn't even work!
But I think some more movement should be able to solve that.

   - Another obvious solution to this issue is to find the offending
 process from a un-unmappable page and suspend it until the page is
 unmapped.  I'm afraid the implementation would be much more complicated.

Agreed, let's not get into that.

   - I could not test the following situation.  It should be possible
 to write some kernel code to do that, but please let me know if
 you know any such test cases.
 - A page_count is incremented by get_user_pages().
 - The page gets unmapped.
 - The process causes a write fault for the page, before the
   incremented page_count is dropped.

I confess I don't have such a test case ready myself.

 Also, while I've tried carefully not to make mistakes and done some
 testing, I'm not very sure this is bug free.  Please comment.
 
 --- mm/memory.c.orig  2005-01-17 14:47:11.0 +0900
 +++ mm/memory.c   2005-01-17 14:55:51.0 +0900
 @@ -1786,10 +1786,6 @@ static int do_swap_page(struct mm_struct
   }
  
   /* The page isn't present yet, go ahead with the fault. */
 - 
 - swap_free(entry);
 - if (vm_swap_full())
 - remove_exclusive_swap_page(page);
  
   mm-rss++;
   acct_update_integrals();
 @@ -1800,6 +1796,10 @@ static int do_swap_page(struct mm_struct
   pte = maybe_mkwrite(pte_mkdirty(pte), vma);
   write_access = 0;
   }
 + 
 + swap_free(entry);
 + if (vm_swap_full())
 + remove_exclusive_swap_page(page);
   unlock_page(page);
  
   flush_icache_page(vma, page);
 --- mm/rmap.c.orig2005-01-17 14:40:08.0 +0900
 +++ mm/rmap.c 2005-01-21 12:34:06.0 +0900
 @@ -569,8 +569,11 @@ static int try_to_unmap_one(struct page 
*/
   if (PageSwapCache(page) 
   page_count(page) != page_mapcount(page) + 2)