Re: [PATCH] mmu notifiers #v2

2008-01-17 Thread Andrea Arcangeli
On Thu, Jan 17, 2008 at 08:21:16PM +0200, Izik Eidus wrote:
> ohh i like it, this is cleaver solution, and i guess the cost of the 
> vmexits wont be too high if it will
> be not too much aggressive

Yes, and especially during swapping, the system isn't usually CPU
bound. The idea is to pay with some vmexit minor fault when the CPU
tends to be idle, to reduce the amount of swapouts. I say swapouts and
not swapins because it will mostly help avoiding writing out swapcache
to disk for no good reason. Swapins already have a chance not to
generate any read-I/O if the removed spte is really hot.

To make this work we still need notification from the VM about memory
pressure and perhaps the slab shrinker method is enough even if it has
a coarse granularity. Freeing sptes during memory pressure converges
also with the objective of releasing pinned slab memory so that the
spte cache can grow more freely (the 4k PAGE_SIZE for 0-order page
defrag philosophy will also appreciate that to work). There are lots
of details to figure out in an good implementation though but the
basic idea converges on two fairly important fronts.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mmu notifiers #v2

2008-01-17 Thread Izik Eidus

Andrea Arcangeli wrote:

On Wed, Jan 16, 2008 at 07:48:06PM +0200, Izik Eidus wrote:
  

Rik van Riel wrote:


On Sun, 13 Jan 2008 17:24:18 +0100
Andrea Arcangeli <[EMAIL PROTECTED]> wrote:

  
  

In my basic initial patch I only track the tlb flushes which should be
the minimum required to have a nice linux-VM controlled swapping
behavior of the KVM gphysical memory. 


I have a vaguely related question on KVM swapping.

Do page accesses inside KVM guests get propagated to the host
OS, so Linux can choose a reasonable page for eviction, or is
the pageout of KVM guest pages essentially random?
  


Right, selection of the guest OS pages to swap is partly random but
wait: _only_ for the long-cached and hot spte entries. It's certainly
not entirely random.
  
As the shadow-cache is a bit dynamic, every new instantiated spte will

refresh the PG_referenced bit in follow_page already (through minor
faults). not-present fault of swapped non-present sptes, can trigger
minor faults from swapcache too and they'll refresh young regular
ptes.

  
right now when kvm remove pte from the shadow cache, it mark as access the 
page that this pte pointed to.



Yes: the referenced bit in the mmu-notifier invalidate case isn't
useful because it's set right before freeing the page.

  
it was a good solution untill the mmut notifiers beacuse the pages were 
pinned and couldnt be swapped to disk



It probably still makes sense for sptes removed because of other
reasons (not mmu notifier invalidates).
  

agree
  
so now it will have to do something more sophisticated or at least mark as 
access every page pointed by pte

that get insrted to the shadow cache



I think that should already be the case, see the mark_page_accessed in
follow_page, isn't FOLL_TOUCH set, isn't it?
  

yes you are right FOLL_TOUCH is set.

The only thing we clearly miss is a logic that refreshes the
PG_referenced bitflag for "hot" sptes that remains instantiated and
cached for a long time. For regular linux ptes this is done by the cpu
through the young bitflag. But note that not all architectures have
the young bitflag support in hardware! So I suppose the swapping of
the KVM task, is like the swapping any other task but on an alpha
CPU. It works good enough in practice even if we clearly have room for
further optimizations in this area (like there would be on archs w/o
young bit updated in hardware too).

To refresh the PG_referenced bit for long lived hot sptes, I think the
easiest solution is to chain the sptes in a lru, and to start dropping
them when memory pressure start. We could drop one spte every X pages
collected by the VM. So the "age" time factor depends on the VM
velocity and we totally avoid useless shadow page faults when there's
no VM pressure. When VM pressure increases, the kvm non-present fault
will then take care to refresh the PG_referenced bit. This should
solve the aging-issue for long lived and hot sptes. This should
improve the responsiveness of the guest OS during "initial" swap
pressure (after the initial swap pressure, the working set finds
itself in ram again). So it should avoid some swapout/swapin not
required jitter during the initial swap. I see this mostly as a kvm
internal optimization, not strictly related to the mmu notifiers
though.
  
ohh i like it, this is cleaver solution, and i guess the cost of the 
vmexits wont be too high if it will

be not too much aggressive


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mmu notifiers #v2

2008-01-17 Thread Andrea Arcangeli
On Wed, Jan 16, 2008 at 07:48:06PM +0200, Izik Eidus wrote:
> Rik van Riel wrote:
>> On Sun, 13 Jan 2008 17:24:18 +0100
>> Andrea Arcangeli <[EMAIL PROTECTED]> wrote:
>>
>>   
>>> In my basic initial patch I only track the tlb flushes which should be
>>> the minimum required to have a nice linux-VM controlled swapping
>>> behavior of the KVM gphysical memory. 
>>
>> I have a vaguely related question on KVM swapping.
>>
>> Do page accesses inside KVM guests get propagated to the host
>> OS, so Linux can choose a reasonable page for eviction, or is
>> the pageout of KVM guest pages essentially random?

Right, selection of the guest OS pages to swap is partly random but
wait: _only_ for the long-cached and hot spte entries. It's certainly
not entirely random.

As the shadow-cache is a bit dynamic, every new instantiated spte will
refresh the PG_referenced bit in follow_page already (through minor
faults). not-present fault of swapped non-present sptes, can trigger
minor faults from swapcache too and they'll refresh young regular
ptes.

> right now when kvm remove pte from the shadow cache, it mark as access the 
> page that this pte pointed to.

Yes: the referenced bit in the mmu-notifier invalidate case isn't
useful because it's set right before freeing the page.

> it was a good solution untill the mmut notifiers beacuse the pages were 
> pinned and couldnt be swapped to disk

It probably still makes sense for sptes removed because of other
reasons (not mmu notifier invalidates).

> so now it will have to do something more sophisticated or at least mark as 
> access every page pointed by pte
> that get insrted to the shadow cache

I think that should already be the case, see the mark_page_accessed in
follow_page, isn't FOLL_TOUCH set, isn't it?

The only thing we clearly miss is a logic that refreshes the
PG_referenced bitflag for "hot" sptes that remains instantiated and
cached for a long time. For regular linux ptes this is done by the cpu
through the young bitflag. But note that not all architectures have
the young bitflag support in hardware! So I suppose the swapping of
the KVM task, is like the swapping any other task but on an alpha
CPU. It works good enough in practice even if we clearly have room for
further optimizations in this area (like there would be on archs w/o
young bit updated in hardware too).

To refresh the PG_referenced bit for long lived hot sptes, I think the
easiest solution is to chain the sptes in a lru, and to start dropping
them when memory pressure start. We could drop one spte every X pages
collected by the VM. So the "age" time factor depends on the VM
velocity and we totally avoid useless shadow page faults when there's
no VM pressure. When VM pressure increases, the kvm non-present fault
will then take care to refresh the PG_referenced bit. This should
solve the aging-issue for long lived and hot sptes. This should
improve the responsiveness of the guest OS during "initial" swap
pressure (after the initial swap pressure, the working set finds
itself in ram again). So it should avoid some swapout/swapin not
required jitter during the initial swap. I see this mostly as a kvm
internal optimization, not strictly related to the mmu notifiers
though.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mmu notifiers #v2

2008-01-17 Thread Andrea Arcangeli
On Wed, Jan 16, 2008 at 07:48:06PM +0200, Izik Eidus wrote:
 Rik van Riel wrote:
 On Sun, 13 Jan 2008 17:24:18 +0100
 Andrea Arcangeli [EMAIL PROTECTED] wrote:

   
 In my basic initial patch I only track the tlb flushes which should be
 the minimum required to have a nice linux-VM controlled swapping
 behavior of the KVM gphysical memory. 

 I have a vaguely related question on KVM swapping.

 Do page accesses inside KVM guests get propagated to the host
 OS, so Linux can choose a reasonable page for eviction, or is
 the pageout of KVM guest pages essentially random?

Right, selection of the guest OS pages to swap is partly random but
wait: _only_ for the long-cached and hot spte entries. It's certainly
not entirely random.

As the shadow-cache is a bit dynamic, every new instantiated spte will
refresh the PG_referenced bit in follow_page already (through minor
faults). not-present fault of swapped non-present sptes, can trigger
minor faults from swapcache too and they'll refresh young regular
ptes.

 right now when kvm remove pte from the shadow cache, it mark as access the 
 page that this pte pointed to.

Yes: the referenced bit in the mmu-notifier invalidate case isn't
useful because it's set right before freeing the page.

 it was a good solution untill the mmut notifiers beacuse the pages were 
 pinned and couldnt be swapped to disk

It probably still makes sense for sptes removed because of other
reasons (not mmu notifier invalidates).

 so now it will have to do something more sophisticated or at least mark as 
 access every page pointed by pte
 that get insrted to the shadow cache

I think that should already be the case, see the mark_page_accessed in
follow_page, isn't FOLL_TOUCH set, isn't it?

The only thing we clearly miss is a logic that refreshes the
PG_referenced bitflag for hot sptes that remains instantiated and
cached for a long time. For regular linux ptes this is done by the cpu
through the young bitflag. But note that not all architectures have
the young bitflag support in hardware! So I suppose the swapping of
the KVM task, is like the swapping any other task but on an alpha
CPU. It works good enough in practice even if we clearly have room for
further optimizations in this area (like there would be on archs w/o
young bit updated in hardware too).

To refresh the PG_referenced bit for long lived hot sptes, I think the
easiest solution is to chain the sptes in a lru, and to start dropping
them when memory pressure start. We could drop one spte every X pages
collected by the VM. So the age time factor depends on the VM
velocity and we totally avoid useless shadow page faults when there's
no VM pressure. When VM pressure increases, the kvm non-present fault
will then take care to refresh the PG_referenced bit. This should
solve the aging-issue for long lived and hot sptes. This should
improve the responsiveness of the guest OS during initial swap
pressure (after the initial swap pressure, the working set finds
itself in ram again). So it should avoid some swapout/swapin not
required jitter during the initial swap. I see this mostly as a kvm
internal optimization, not strictly related to the mmu notifiers
though.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mmu notifiers #v2

2008-01-17 Thread Andrea Arcangeli
On Thu, Jan 17, 2008 at 08:21:16PM +0200, Izik Eidus wrote:
 ohh i like it, this is cleaver solution, and i guess the cost of the 
 vmexits wont be too high if it will
 be not too much aggressive

Yes, and especially during swapping, the system isn't usually CPU
bound. The idea is to pay with some vmexit minor fault when the CPU
tends to be idle, to reduce the amount of swapouts. I say swapouts and
not swapins because it will mostly help avoiding writing out swapcache
to disk for no good reason. Swapins already have a chance not to
generate any read-I/O if the removed spte is really hot.

To make this work we still need notification from the VM about memory
pressure and perhaps the slab shrinker method is enough even if it has
a coarse granularity. Freeing sptes during memory pressure converges
also with the objective of releasing pinned slab memory so that the
spte cache can grow more freely (the 4k PAGE_SIZE for 0-order page
defrag philosophy will also appreciate that to work). There are lots
of details to figure out in an good implementation though but the
basic idea converges on two fairly important fronts.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mmu notifiers #v2

2008-01-17 Thread Izik Eidus

Andrea Arcangeli wrote:

On Wed, Jan 16, 2008 at 07:48:06PM +0200, Izik Eidus wrote:
  

Rik van Riel wrote:


On Sun, 13 Jan 2008 17:24:18 +0100
Andrea Arcangeli [EMAIL PROTECTED] wrote:

  
  

In my basic initial patch I only track the tlb flushes which should be
the minimum required to have a nice linux-VM controlled swapping
behavior of the KVM gphysical memory. 


I have a vaguely related question on KVM swapping.

Do page accesses inside KVM guests get propagated to the host
OS, so Linux can choose a reasonable page for eviction, or is
the pageout of KVM guest pages essentially random?
  


Right, selection of the guest OS pages to swap is partly random but
wait: _only_ for the long-cached and hot spte entries. It's certainly
not entirely random.
  
As the shadow-cache is a bit dynamic, every new instantiated spte will

refresh the PG_referenced bit in follow_page already (through minor
faults). not-present fault of swapped non-present sptes, can trigger
minor faults from swapcache too and they'll refresh young regular
ptes.

  
right now when kvm remove pte from the shadow cache, it mark as access the 
page that this pte pointed to.



Yes: the referenced bit in the mmu-notifier invalidate case isn't
useful because it's set right before freeing the page.

  
it was a good solution untill the mmut notifiers beacuse the pages were 
pinned and couldnt be swapped to disk



It probably still makes sense for sptes removed because of other
reasons (not mmu notifier invalidates).
  

agree
  
so now it will have to do something more sophisticated or at least mark as 
access every page pointed by pte

that get insrted to the shadow cache



I think that should already be the case, see the mark_page_accessed in
follow_page, isn't FOLL_TOUCH set, isn't it?
  

yes you are right FOLL_TOUCH is set.

The only thing we clearly miss is a logic that refreshes the
PG_referenced bitflag for hot sptes that remains instantiated and
cached for a long time. For regular linux ptes this is done by the cpu
through the young bitflag. But note that not all architectures have
the young bitflag support in hardware! So I suppose the swapping of
the KVM task, is like the swapping any other task but on an alpha
CPU. It works good enough in practice even if we clearly have room for
further optimizations in this area (like there would be on archs w/o
young bit updated in hardware too).

To refresh the PG_referenced bit for long lived hot sptes, I think the
easiest solution is to chain the sptes in a lru, and to start dropping
them when memory pressure start. We could drop one spte every X pages
collected by the VM. So the age time factor depends on the VM
velocity and we totally avoid useless shadow page faults when there's
no VM pressure. When VM pressure increases, the kvm non-present fault
will then take care to refresh the PG_referenced bit. This should
solve the aging-issue for long lived and hot sptes. This should
improve the responsiveness of the guest OS during initial swap
pressure (after the initial swap pressure, the working set finds
itself in ram again). So it should avoid some swapout/swapin not
required jitter during the initial swap. I see this mostly as a kvm
internal optimization, not strictly related to the mmu notifiers
though.
  
ohh i like it, this is cleaver solution, and i guess the cost of the 
vmexits wont be too high if it will

be not too much aggressive


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mmu notifiers #v2

2008-01-16 Thread Izik Eidus

Rik van Riel wrote:

On Sun, 13 Jan 2008 17:24:18 +0100
Andrea Arcangeli <[EMAIL PROTECTED]> wrote:

  

In my basic initial patch I only track the tlb flushes which should be
the minimum required to have a nice linux-VM controlled swapping
behavior of the KVM gphysical memory. 



I have a vaguely related question on KVM swapping.

Do page accesses inside KVM guests get propagated to the host
OS, so Linux can choose a reasonable page for eviction, or is
the pageout of KVM guest pages essentially random?

  
right now when kvm remove pte from the shadow cache, it mark as access 
the page that this pte pointed to.
it was a good solution untill the mmut notifiers beacuse the pages were 
pinned and couldnt be swapped to disk
so now it will have to do something more sophisticated or at least mark 
as access every page pointed by pte

that get insrted to the shadow cache

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mmu notifiers #v2

2008-01-16 Thread Rik van Riel
On Sun, 13 Jan 2008 17:24:18 +0100
Andrea Arcangeli <[EMAIL PROTECTED]> wrote:

> In my basic initial patch I only track the tlb flushes which should be
> the minimum required to have a nice linux-VM controlled swapping
> behavior of the KVM gphysical memory. 

I have a vaguely related question on KVM swapping.

Do page accesses inside KVM guests get propagated to the host
OS, so Linux can choose a reasonable page for eviction, or is
the pageout of KVM guest pages essentially random?

-- 
All rights reversed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mmu notifiers #v2

2008-01-16 Thread Andrea Arcangeli
On Wed, Jan 16, 2008 at 10:01:32AM +0100, Brice Goglin wrote:
> One of the difference with my patch is that you attach the notifier list to 
> the mm_struct while my code attached it to vmas. But I now don't think it 
> was such a good idea since it probably didn't reduce the number of notifier 
> calls a lot.

Thanks for raising this topic.

Notably KVM also would be a bit more optimal with the notifier in the
vma and that was the original implementation too. It's not a sure
thing that it has to be in the mm.

The quadrics patch does a mixture, it attaches it to the mm but then
it pretends to pass the vma down to the method, and it's broken doing
so, like during munmap where it passes the first vma being unmapped
but not all the later ones in the munmap range.

If we want to attach it to the vma, I think the vma should be passed
as parameter instead of the mm. In some places like
apply_to_page_range the vma isn't even available and I found a little
dirty to run a find_vma inside a #ifdef CONFIG_MMU_NOTIFIER.

The only thing the vma could be interesting about are the protection
bits for things like update_range in the quadrics patch where they
prefetch their secondary tlb. But again if we want to do that, we need
to hook inside unmap_vmas and to pass all the different vmas and not
just the first one touched by unmap_vmas. unmap_vmas is _plural_ not
singular ;).

In the end attaching to mm avoided solving all the above troubles and
provided a strightforward implementation where I would need a single
call to mmu_notifier_register and other minor advantages like that and
not much downside.

But certainly the mm vs vma decision wasn't trivial (I switched back
and forth a few times from vma to mm and back) and if people thinks
this shall be in the vma I can try again but it won't be as a
strightforward patch as for the mm.

One benefit is for example is that it could go in the memslot and
effectively the notifier->memslot conversion would be just a
containerof instead of a "search" over the memslots. Locking aside.

> Also, one thing that I looked at in vmaspy was notifying fork. I am not 
> sure what happens on Copy-on-write with your code, but for sure C-o-w is 
> problematic for shadow page tables. I thought shadow pages should just be 
> invalidated when a fork happens and the caller would refill them after 
> forcing C-o-w or so. So adding a notifier call there too might be nice.

There can't be any cows right now in KVM VM backing store, that's why
it's enough to get full swapping working fine. For example I think
we'll need to add more notifiers to handle swapping of MAP_PRIVATE non
linear tmpfs shared pages properly (and it won't be an issue with
fork() but with after the fact sharing).

Right now I'm more interested in the interface, for the invalidates,
things like mm vs vma, the places where we hook under pte spinlock,
things like that, then the patch can hopefully be merged and extended
with more methods like ->change_protection_page/range and added to cow
etc...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mmu notifiers #v2

2008-01-16 Thread Brice Goglin

Andrea Arcangeli wrote:

This patch is last version of a basic implementation of the mmu
notifiers.

In short when the linux VM decides to free a page, it will unmap it
from the linux pagetables. However when a page is mapped not just by
the regular linux ptes, but also from the shadow pagetables, it's
currently unfreeable by the linux VM.

This patch allows the shadow pagetables to be dropped and the page to
be freed after that, if the linux VM decides to unmap the page from
the main ptes because it wants to swap out the page.

[...]

Comments welcome... especially from SGI/IBM/Quadrics and all other
potential users of this functionality.
  


For HPC, this should be very interesting. Managing the registration 
cache of high-speed networks from user-space is a huge mess. This 
approach should help a lot. In fact, back in 2004, I implemented 
something similar called vmaspy to update the regcache of Myrinet 
drivers. I never submitted any patch because Infiniband would have been 
the only user in the mainline kernel and they were reluctant to these 
ideas [1]. In the meantime, some of them apparently changed their mind 
since they implemented some vmops-overriding hack to do something 
similar [2]. This patch should simplify all this.


One of the difference with my patch is that you attach the notifier list 
to the mm_struct while my code attached it to vmas. But I now don't 
think it was such a good idea since it probably didn't reduce the number 
of notifier calls a lot.


Also, one thing that I looked at in vmaspy was notifying fork. I am not 
sure what happens on Copy-on-write with your code, but for sure C-o-w is 
problematic for shadow page tables. I thought shadow pages should just 
be invalidated when a fork happens and the caller would refill them 
after forcing C-o-w or so. So adding a notifier call there too might be 
nice.


Brice

[1] http://lkml.org/lkml/2005/4/29/175
[2] http://www.osc.edu/~pw/papers/wyckoff-memreg-ccgrid05.pdf

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mmu notifiers #v2

2008-01-16 Thread Brice Goglin

Andrea Arcangeli wrote:

This patch is last version of a basic implementation of the mmu
notifiers.

In short when the linux VM decides to free a page, it will unmap it
from the linux pagetables. However when a page is mapped not just by
the regular linux ptes, but also from the shadow pagetables, it's
currently unfreeable by the linux VM.

This patch allows the shadow pagetables to be dropped and the page to
be freed after that, if the linux VM decides to unmap the page from
the main ptes because it wants to swap out the page.

[...]

Comments welcome... especially from SGI/IBM/Quadrics and all other
potential users of this functionality.
  


For HPC, this should be very interesting. Managing the registration 
cache of high-speed networks from user-space is a huge mess. This 
approach should help a lot. In fact, back in 2004, I implemented 
something similar called vmaspy to update the regcache of Myrinet 
drivers. I never submitted any patch because Infiniband would have been 
the only user in the mainline kernel and they were reluctant to these 
ideas [1]. In the meantime, some of them apparently changed their mind 
since they implemented some vmops-overriding hack to do something 
similar [2]. This patch should simplify all this.


One of the difference with my patch is that you attach the notifier list 
to the mm_struct while my code attached it to vmas. But I now don't 
think it was such a good idea since it probably didn't reduce the number 
of notifier calls a lot.


Also, one thing that I looked at in vmaspy was notifying fork. I am not 
sure what happens on Copy-on-write with your code, but for sure C-o-w is 
problematic for shadow page tables. I thought shadow pages should just 
be invalidated when a fork happens and the caller would refill them 
after forcing C-o-w or so. So adding a notifier call there too might be 
nice.


Brice

[1] http://lkml.org/lkml/2005/4/29/175
[2] http://www.osc.edu/~pw/papers/wyckoff-memreg-ccgrid05.pdf

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mmu notifiers #v2

2008-01-16 Thread Andrea Arcangeli
On Wed, Jan 16, 2008 at 10:01:32AM +0100, Brice Goglin wrote:
 One of the difference with my patch is that you attach the notifier list to 
 the mm_struct while my code attached it to vmas. But I now don't think it 
 was such a good idea since it probably didn't reduce the number of notifier 
 calls a lot.

Thanks for raising this topic.

Notably KVM also would be a bit more optimal with the notifier in the
vma and that was the original implementation too. It's not a sure
thing that it has to be in the mm.

The quadrics patch does a mixture, it attaches it to the mm but then
it pretends to pass the vma down to the method, and it's broken doing
so, like during munmap where it passes the first vma being unmapped
but not all the later ones in the munmap range.

If we want to attach it to the vma, I think the vma should be passed
as parameter instead of the mm. In some places like
apply_to_page_range the vma isn't even available and I found a little
dirty to run a find_vma inside a #ifdef CONFIG_MMU_NOTIFIER.

The only thing the vma could be interesting about are the protection
bits for things like update_range in the quadrics patch where they
prefetch their secondary tlb. But again if we want to do that, we need
to hook inside unmap_vmas and to pass all the different vmas and not
just the first one touched by unmap_vmas. unmap_vmas is _plural_ not
singular ;).

In the end attaching to mm avoided solving all the above troubles and
provided a strightforward implementation where I would need a single
call to mmu_notifier_register and other minor advantages like that and
not much downside.

But certainly the mm vs vma decision wasn't trivial (I switched back
and forth a few times from vma to mm and back) and if people thinks
this shall be in the vma I can try again but it won't be as a
strightforward patch as for the mm.

One benefit is for example is that it could go in the memslot and
effectively the notifier-memslot conversion would be just a
containerof instead of a search over the memslots. Locking aside.

 Also, one thing that I looked at in vmaspy was notifying fork. I am not 
 sure what happens on Copy-on-write with your code, but for sure C-o-w is 
 problematic for shadow page tables. I thought shadow pages should just be 
 invalidated when a fork happens and the caller would refill them after 
 forcing C-o-w or so. So adding a notifier call there too might be nice.

There can't be any cows right now in KVM VM backing store, that's why
it's enough to get full swapping working fine. For example I think
we'll need to add more notifiers to handle swapping of MAP_PRIVATE non
linear tmpfs shared pages properly (and it won't be an issue with
fork() but with after the fact sharing).

Right now I'm more interested in the interface, for the invalidates,
things like mm vs vma, the places where we hook under pte spinlock,
things like that, then the patch can hopefully be merged and extended
with more methods like -change_protection_page/range and added to cow
etc...
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mmu notifiers #v2

2008-01-16 Thread Rik van Riel
On Sun, 13 Jan 2008 17:24:18 +0100
Andrea Arcangeli [EMAIL PROTECTED] wrote:

 In my basic initial patch I only track the tlb flushes which should be
 the minimum required to have a nice linux-VM controlled swapping
 behavior of the KVM gphysical memory. 

I have a vaguely related question on KVM swapping.

Do page accesses inside KVM guests get propagated to the host
OS, so Linux can choose a reasonable page for eviction, or is
the pageout of KVM guest pages essentially random?

-- 
All rights reversed.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mmu notifiers #v2

2008-01-16 Thread Izik Eidus

Rik van Riel wrote:

On Sun, 13 Jan 2008 17:24:18 +0100
Andrea Arcangeli [EMAIL PROTECTED] wrote:

  

In my basic initial patch I only track the tlb flushes which should be
the minimum required to have a nice linux-VM controlled swapping
behavior of the KVM gphysical memory. 



I have a vaguely related question on KVM swapping.

Do page accesses inside KVM guests get propagated to the host
OS, so Linux can choose a reasonable page for eviction, or is
the pageout of KVM guest pages essentially random?

  
right now when kvm remove pte from the shadow cache, it mark as access 
the page that this pte pointed to.
it was a good solution untill the mmut notifiers beacuse the pages were 
pinned and couldnt be swapped to disk
so now it will have to do something more sophisticated or at least mark 
as access every page pointed by pte

that get insrted to the shadow cache

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mmu notifiers #v2

2008-01-15 Thread Andrea Arcangeli
On Wed, Jan 16, 2008 at 07:18:53AM +1100, Benjamin Herrenschmidt wrote:
> Do you have cases where it's -not- called with the PTE lock held ?

For invalidate_page no because currently it's only called next to the
ptep_get_and_clear that modifies the pte and requires the pte
lock. invalidate_range/release are called w/o pte lock held.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mmu notifiers #v2

2008-01-15 Thread Benjamin Herrenschmidt

On Tue, 2008-01-15 at 13:44 +0100, Andrea Arcangeli wrote:
> On Mon, Jan 14, 2008 at 12:02:42PM -0800, Christoph Lameter wrote:
> > Hmmm... In most of the callsites we hold a writelock on mmap_sem right?
> 
> Not in all, like Marcelo pointed out in kvm-devel, so the lowlevel
> locking can't relay on the VM locks.
> 
> About your request to schedule in the mmu notifier methods this is not
> feasible right now, the notifier is often called with the pte
> spinlocks held. I wonder if you can simply post/queue an event like a
> softirq/pdflush.

Do you have cases where it's -not- called with the PTE lock held ?
 
Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mmu notifiers #v2

2008-01-15 Thread Andrea Arcangeli
On Mon, Jan 14, 2008 at 12:02:42PM -0800, Christoph Lameter wrote:
> Hmmm... In most of the callsites we hold a writelock on mmap_sem right?

Not in all, like Marcelo pointed out in kvm-devel, so the lowlevel
locking can't relay on the VM locks.

About your request to schedule in the mmu notifier methods this is not
feasible right now, the notifier is often called with the pte
spinlocks held. I wonder if you can simply post/queue an event like a
softirq/pdflush.

> Passing mm is fine as long as mmap_sem is held.

mmap_sem is not held, but don't worry "mm" can't go away under the mmu
notifier, so it's ok. It's just that the KVM methods never uses "mm"
at all (containerof translates the struct mmu_notifier to a struct
kvm, and there is the mm in kvm->mm too). Perhaps others don't save
the "mm" in their container where the mmu_notifier is embedded into,
so I left mm as parameter to the methods.

> Hmmm... this is ptep_clear_flush? What about the other uses of 
> flush_tlb_page in asm-generic/pgtable.h and related uses in arch code?

This is not necessarily a 1:1 relationship with the tlb
flushes. Otherwise they'd be the tlb-notifiers not the mmu-notifiers.

The other methods in the pgtable.h are not dropping an user page from
the "mm". That's the invalidate case right now. Other methods will not
call into invalidate_page, but you're welcome to add other methods and
call them from other ptep_* functions if you're interested about being
notified about more than just the invalidates of the "mm".

Is invalidate_page/range a clear enough method name to explain when
the ptes and tlb entries have been dropped for such page/range mapped
in userland in that address/range?

> (would help if your patches would mention the function name in the diff 
> headers)

my patches uses git diff defaults I guess, and they mention the
function name in all other places, it's just git isn't smart enough
there to catch the function name in that single place, it's ok.

> > +#define mmu_notifier(function, mm, args...)
> > \
> > +   do {\
> > +   struct mmu_notifier *__mn;  \
> > +   struct hlist_node *__n; \
> > +   \
> > +   hlist_for_each_entry(__mn, __n, &(mm)->mmu_notifier, hlist) \
> > +   if (__mn->ops->function)\
> > +   __mn->ops->function(__mn, mm, args);\
> > +   } while (0)
> 
> Does this have to be inline? ptep_clear_flush will become quite big

Inline makes the patch smaller and it avoids a call in the common case
that the mmu_notifier list is empty. Perhaps I could add a:

 if (unlikely(!list_empty(&(mm)->mmu_notifier)) {
...
 }

so gcc could offload the internal block in a cold-icache region of .text.

I think at least an unlikely(!list_empty(&(mm)->mmu_notifier)) check
has to be inline. Currently there isn't such check because I'm unsure
if it really makes sense. The idea is that if you really care to
optimize this you'll use self-modifying code to turn a nop into a call
when a certain method is armed. That's an extreme optimization though,
current code shouldn't be measurable already when disarmed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mmu notifiers #v2

2008-01-15 Thread Andrea Arcangeli
On Mon, Jan 14, 2008 at 12:02:42PM -0800, Christoph Lameter wrote:
 Hmmm... In most of the callsites we hold a writelock on mmap_sem right?

Not in all, like Marcelo pointed out in kvm-devel, so the lowlevel
locking can't relay on the VM locks.

About your request to schedule in the mmu notifier methods this is not
feasible right now, the notifier is often called with the pte
spinlocks held. I wonder if you can simply post/queue an event like a
softirq/pdflush.

 Passing mm is fine as long as mmap_sem is held.

mmap_sem is not held, but don't worry mm can't go away under the mmu
notifier, so it's ok. It's just that the KVM methods never uses mm
at all (containerof translates the struct mmu_notifier to a struct
kvm, and there is the mm in kvm-mm too). Perhaps others don't save
the mm in their container where the mmu_notifier is embedded into,
so I left mm as parameter to the methods.

 Hmmm... this is ptep_clear_flush? What about the other uses of 
 flush_tlb_page in asm-generic/pgtable.h and related uses in arch code?

This is not necessarily a 1:1 relationship with the tlb
flushes. Otherwise they'd be the tlb-notifiers not the mmu-notifiers.

The other methods in the pgtable.h are not dropping an user page from
the mm. That's the invalidate case right now. Other methods will not
call into invalidate_page, but you're welcome to add other methods and
call them from other ptep_* functions if you're interested about being
notified about more than just the invalidates of the mm.

Is invalidate_page/range a clear enough method name to explain when
the ptes and tlb entries have been dropped for such page/range mapped
in userland in that address/range?

 (would help if your patches would mention the function name in the diff 
 headers)

my patches uses git diff defaults I guess, and they mention the
function name in all other places, it's just git isn't smart enough
there to catch the function name in that single place, it's ok.

  +#define mmu_notifier(function, mm, args...)
  \
  +   do {\
  +   struct mmu_notifier *__mn;  \
  +   struct hlist_node *__n; \
  +   \
  +   hlist_for_each_entry(__mn, __n, (mm)-mmu_notifier, hlist) \
  +   if (__mn-ops-function)\
  +   __mn-ops-function(__mn, mm, args);\
  +   } while (0)
 
 Does this have to be inline? ptep_clear_flush will become quite big

Inline makes the patch smaller and it avoids a call in the common case
that the mmu_notifier list is empty. Perhaps I could add a:

 if (unlikely(!list_empty((mm)-mmu_notifier)) {
...
 }

so gcc could offload the internal block in a cold-icache region of .text.

I think at least an unlikely(!list_empty((mm)-mmu_notifier)) check
has to be inline. Currently there isn't such check because I'm unsure
if it really makes sense. The idea is that if you really care to
optimize this you'll use self-modifying code to turn a nop into a call
when a certain method is armed. That's an extreme optimization though,
current code shouldn't be measurable already when disarmed.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mmu notifiers #v2

2008-01-15 Thread Benjamin Herrenschmidt

On Tue, 2008-01-15 at 13:44 +0100, Andrea Arcangeli wrote:
 On Mon, Jan 14, 2008 at 12:02:42PM -0800, Christoph Lameter wrote:
  Hmmm... In most of the callsites we hold a writelock on mmap_sem right?
 
 Not in all, like Marcelo pointed out in kvm-devel, so the lowlevel
 locking can't relay on the VM locks.
 
 About your request to schedule in the mmu notifier methods this is not
 feasible right now, the notifier is often called with the pte
 spinlocks held. I wonder if you can simply post/queue an event like a
 softirq/pdflush.

Do you have cases where it's -not- called with the PTE lock held ?
 
Ben.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mmu notifiers #v2

2008-01-15 Thread Andrea Arcangeli
On Wed, Jan 16, 2008 at 07:18:53AM +1100, Benjamin Herrenschmidt wrote:
 Do you have cases where it's -not- called with the PTE lock held ?

For invalidate_page no because currently it's only called next to the
ptep_get_and_clear that modifies the pte and requires the pte
lock. invalidate_range/release are called w/o pte lock held.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mmu notifiers #v2

2008-01-14 Thread Benjamin Herrenschmidt

On Mon, 2008-01-14 at 12:02 -0800, Christoph Lameter wrote:
> On Sun, 13 Jan 2008, Andrea Arcangeli wrote:
> 
> > About the locking perhaps I'm underestimating it, but by following the
> > TLB flushing analogy, by simply clearing the shadow ptes (with kvm
> > mmu_lock spinlock to avoid racing with other vcpu spte accesses of
> > course) and flushing the shadow-pte after clearing the main linux pte,
> > it should be enough to serialize against shadow-pte page faults that
> > would call into get_user_pages. Flushing the host TLB before or after
> > the shadow-ptes shouldn't matter.
> 
> Hmmm... In most of the callsites we hold a writelock on mmap_sem right?

Not in unmap_mapping_range() afaik.

> > Comments welcome... especially from SGI/IBM/Quadrics and all other
> > potential users of this functionality.
> 
> > There are also certain details I'm uncertain about, like passing 'mm'
> > to the lowlevel methods, my KVM usage of the invalidate_page()
> > notifier for example only uses 'mm' for a BUG_ON for example:
> 
> Passing mm is fine as long as mmap_sem is held.

Passing mm is always a good idea, regardless of the mmap_sem, it can be
useful for lots of other things :-)

> > diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> > --- a/include/asm-generic/pgtable.h
> > +++ b/include/asm-generic/pgtable.h
> > @@ -86,6 +86,7 @@ do {  
> > \
> > pte_t __pte;\
> > __pte = ptep_get_and_clear((__vma)->vm_mm, __address, __ptep);  \
> > flush_tlb_page(__vma, __address);   \
> > +   mmu_notifier(invalidate_page, (__vma)->vm_mm, __address);   \
> > __pte;  \
> >  })
> >  #endif
> 
> Hmmm... this is ptep_clear_flush? What about the other uses of 
> flush_tlb_page in asm-generic/pgtable.h and related uses in arch code?
> (would help if your patches would mention the function name in the diff 
> headers)

Note that last I looked, a lot of these were stale. Might be time to
resume my spring/summer cleaning of page table accessors...

> > +#define mmu_notifier(function, mm, args...)
> > \
> > +   do {\
> > +   struct mmu_notifier *__mn;  \
> > +   struct hlist_node *__n; \
> > +   \
> > +   hlist_for_each_entry(__mn, __n, &(mm)->mmu_notifier, hlist) \
> > +   if (__mn->ops->function)\
> > +   __mn->ops->function(__mn, mm, args);\
> > +   } while (0)
> 
> Does this have to be inline? ptep_clear_flush will become quite big
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [EMAIL PROTECTED]  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"[EMAIL PROTECTED]"> [EMAIL PROTECTED] 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mmu notifiers #v2

2008-01-14 Thread Christoph Lameter
On Sun, 13 Jan 2008, Andrea Arcangeli wrote:

> About the locking perhaps I'm underestimating it, but by following the
> TLB flushing analogy, by simply clearing the shadow ptes (with kvm
> mmu_lock spinlock to avoid racing with other vcpu spte accesses of
> course) and flushing the shadow-pte after clearing the main linux pte,
> it should be enough to serialize against shadow-pte page faults that
> would call into get_user_pages. Flushing the host TLB before or after
> the shadow-ptes shouldn't matter.

Hmmm... In most of the callsites we hold a writelock on mmap_sem right?

> Comments welcome... especially from SGI/IBM/Quadrics and all other
> potential users of this functionality.

> There are also certain details I'm uncertain about, like passing 'mm'
> to the lowlevel methods, my KVM usage of the invalidate_page()
> notifier for example only uses 'mm' for a BUG_ON for example:

Passing mm is fine as long as mmap_sem is held.

> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -86,6 +86,7 @@ do {
> \
>   pte_t __pte;\
>   __pte = ptep_get_and_clear((__vma)->vm_mm, __address, __ptep);  \
>   flush_tlb_page(__vma, __address);   \
> + mmu_notifier(invalidate_page, (__vma)->vm_mm, __address);   \
>   __pte;  \
>  })
>  #endif

Hmmm... this is ptep_clear_flush? What about the other uses of 
flush_tlb_page in asm-generic/pgtable.h and related uses in arch code?
(would help if your patches would mention the function name in the diff 
headers)

> +#define mmu_notifier(function, mm, args...)  \
> + do {\
> + struct mmu_notifier *__mn;  \
> + struct hlist_node *__n; \
> + \
> + hlist_for_each_entry(__mn, __n, &(mm)->mmu_notifier, hlist) \
> + if (__mn->ops->function)\
> + __mn->ops->function(__mn, mm, args);\
> + } while (0)

Does this have to be inline? ptep_clear_flush will become quite big
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mmu notifiers #v2

2008-01-14 Thread Christoph Lameter
On Sun, 13 Jan 2008, Andrea Arcangeli wrote:

 About the locking perhaps I'm underestimating it, but by following the
 TLB flushing analogy, by simply clearing the shadow ptes (with kvm
 mmu_lock spinlock to avoid racing with other vcpu spte accesses of
 course) and flushing the shadow-pte after clearing the main linux pte,
 it should be enough to serialize against shadow-pte page faults that
 would call into get_user_pages. Flushing the host TLB before or after
 the shadow-ptes shouldn't matter.

Hmmm... In most of the callsites we hold a writelock on mmap_sem right?

 Comments welcome... especially from SGI/IBM/Quadrics and all other
 potential users of this functionality.

 There are also certain details I'm uncertain about, like passing 'mm'
 to the lowlevel methods, my KVM usage of the invalidate_page()
 notifier for example only uses 'mm' for a BUG_ON for example:

Passing mm is fine as long as mmap_sem is held.

 diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
 --- a/include/asm-generic/pgtable.h
 +++ b/include/asm-generic/pgtable.h
 @@ -86,6 +86,7 @@ do {
 \
   pte_t __pte;\
   __pte = ptep_get_and_clear((__vma)-vm_mm, __address, __ptep);  \
   flush_tlb_page(__vma, __address);   \
 + mmu_notifier(invalidate_page, (__vma)-vm_mm, __address);   \
   __pte;  \
  })
  #endif

Hmmm... this is ptep_clear_flush? What about the other uses of 
flush_tlb_page in asm-generic/pgtable.h and related uses in arch code?
(would help if your patches would mention the function name in the diff 
headers)

 +#define mmu_notifier(function, mm, args...)  \
 + do {\
 + struct mmu_notifier *__mn;  \
 + struct hlist_node *__n; \
 + \
 + hlist_for_each_entry(__mn, __n, (mm)-mmu_notifier, hlist) \
 + if (__mn-ops-function)\
 + __mn-ops-function(__mn, mm, args);\
 + } while (0)

Does this have to be inline? ptep_clear_flush will become quite big
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mmu notifiers #v2

2008-01-14 Thread Benjamin Herrenschmidt

On Mon, 2008-01-14 at 12:02 -0800, Christoph Lameter wrote:
 On Sun, 13 Jan 2008, Andrea Arcangeli wrote:
 
  About the locking perhaps I'm underestimating it, but by following the
  TLB flushing analogy, by simply clearing the shadow ptes (with kvm
  mmu_lock spinlock to avoid racing with other vcpu spte accesses of
  course) and flushing the shadow-pte after clearing the main linux pte,
  it should be enough to serialize against shadow-pte page faults that
  would call into get_user_pages. Flushing the host TLB before or after
  the shadow-ptes shouldn't matter.
 
 Hmmm... In most of the callsites we hold a writelock on mmap_sem right?

Not in unmap_mapping_range() afaik.

  Comments welcome... especially from SGI/IBM/Quadrics and all other
  potential users of this functionality.
 
  There are also certain details I'm uncertain about, like passing 'mm'
  to the lowlevel methods, my KVM usage of the invalidate_page()
  notifier for example only uses 'mm' for a BUG_ON for example:
 
 Passing mm is fine as long as mmap_sem is held.

Passing mm is always a good idea, regardless of the mmap_sem, it can be
useful for lots of other things :-)

  diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
  --- a/include/asm-generic/pgtable.h
  +++ b/include/asm-generic/pgtable.h
  @@ -86,6 +86,7 @@ do {  
  \
  pte_t __pte;\
  __pte = ptep_get_and_clear((__vma)-vm_mm, __address, __ptep);  \
  flush_tlb_page(__vma, __address);   \
  +   mmu_notifier(invalidate_page, (__vma)-vm_mm, __address);   \
  __pte;  \
   })
   #endif
 
 Hmmm... this is ptep_clear_flush? What about the other uses of 
 flush_tlb_page in asm-generic/pgtable.h and related uses in arch code?
 (would help if your patches would mention the function name in the diff 
 headers)

Note that last I looked, a lot of these were stale. Might be time to
resume my spring/summer cleaning of page table accessors...

  +#define mmu_notifier(function, mm, args...)
  \
  +   do {\
  +   struct mmu_notifier *__mn;  \
  +   struct hlist_node *__n; \
  +   \
  +   hlist_for_each_entry(__mn, __n, (mm)-mmu_notifier, hlist) \
  +   if (__mn-ops-function)\
  +   __mn-ops-function(__mn, mm, args);\
  +   } while (0)
 
 Does this have to be inline? ptep_clear_flush will become quite big
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to [EMAIL PROTECTED]  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:[EMAIL PROTECTED] [EMAIL PROTECTED] /a

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mmu notifiers #v2

2008-01-13 Thread Benjamin Herrenschmidt

On Sun, 2008-01-13 at 17:24 +0100, Andrea Arcangeli wrote:
> Hello,
> 
> This patch is last version of a basic implementation of the mmu
> notifiers.
> 
> In short when the linux VM decides to free a page, it will unmap it
> from the linux pagetables. However when a page is mapped not just by
> the regular linux ptes, but also from the shadow pagetables, it's
> currently unfreeable by the linux VM.
> 
> This patch allows the shadow pagetables to be dropped and the page to
> be freed after that, if the linux VM decides to unmap the page from
> the main ptes because it wants to swap out the page.

Another potential user of that I can see is the DRM. Nowadays, graphic
cards essentially have an MMU on chip, and can do paging. It would be
nice to be able to map user objects in them without having to lock them
down using your callback to properly mark them cast out on the card.

Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mmu notifiers #v2

2008-01-13 Thread Benjamin Herrenschmidt

On Sun, 2008-01-13 at 17:24 +0100, Andrea Arcangeli wrote:
 Hello,
 
 This patch is last version of a basic implementation of the mmu
 notifiers.
 
 In short when the linux VM decides to free a page, it will unmap it
 from the linux pagetables. However when a page is mapped not just by
 the regular linux ptes, but also from the shadow pagetables, it's
 currently unfreeable by the linux VM.
 
 This patch allows the shadow pagetables to be dropped and the page to
 be freed after that, if the linux VM decides to unmap the page from
 the main ptes because it wants to swap out the page.

Another potential user of that I can see is the DRM. Nowadays, graphic
cards essentially have an MMU on chip, and can do paging. It would be
nice to be able to map user objects in them without having to lock them
down using your callback to properly mark them cast out on the card.

Ben.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/