Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-14 Thread Jack Steiner
On Tue, May 13, 2008 at 10:43:59PM -0700, Benjamin Herrenschmidt wrote:
> On Tue, 2008-05-13 at 22:14 +1000, Nick Piggin wrote:
> > ea.
> > 
> > I don't see why you're bending over so far backwards to accommodate
> > this GRU thing that we don't even have numbers for and could actually
> > potentially be batched up in other ways (eg. using mmu_gather or
> > mmu_gather-like idea).
> 
> I agree, we're better off generalizing the mmu_gather batching
> instead...

Unfortunately, we are at least several months away from being able to
provide numbers to justify batching - assuming it is really needed.  We need
large systems running real user workloads. I wish we had that available
right now, but we don't.

It also depends on what you mean by "no batching". If you mean that the
notifier gets called for each pte that is removed from the page table, then
the overhead is clearly very high for some operations. Consider the unmap of
a very large object. A TLB flush per page will be too costly.

However, something based on the mmu_gather seems like it should provide
exactly what is needed to do efficient flushing of the TLB. The GRU does not
require that it be called in a sleepable context. As long as the notifier
callout provides the mmu_gather and vaddr range being flushed, the GRU can
do the efficiently do the rest.

> 
> I had some never-finished patches to use the mmu_gather for pretty much
> everything except single page faults, tho various subtle differences
> between archs and lack of time caused me to let them take the dust and
> not finish them...
> 
> I can try to dig some of that out when I'm back from my current travel,
> though it's probably worth re-doing from scratch now.
> 
> Ben.
> 


-- jack

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 001/001] mmu-notifier-core v17

2008-05-12 Thread Jack Steiner
On Fri, May 09, 2008 at 09:32:30PM +0200, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <[EMAIL PROTECTED]>
> 
> With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to
> pages. There are secondary MMUs (with secondary sptes and secondary
> tlbs) too. sptes in the kvm case are shadow pagetables, but when I say
> spte in mmu-notifier context, I mean "secondary pte". In GRU case
> there's no actual secondary pte and there's only a secondary tlb
> because the GRU secondary MMU has no knowledge about sptes and every
> secondary tlb miss event in the MMU always generates a page fault that
> has to be resolved by the CPU (this is not the case of KVM where the a
> secondary tlb miss will walk sptes in hardware and it will refill the
>...

FYI, I applied to patch to a tree that has the GRU driver. All regression
tests passed.

--- jack

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 08 of 11] anon-vma-rwsem

2008-05-07 Thread Jack Steiner
> And I don't see a problem in making the conversion from
> spinlock->rwsem only if CONFIG_XPMEM=y as I doubt XPMEM works on
> anything but ia64.
 
That is currently true but we are also working on XPMEM for x86_64.
The new XPMEM code should be posted within a few weeks.

--- jack

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 11] mmu-notifier-core

2008-05-05 Thread Jack Steiner
On Mon, May 05, 2008 at 08:34:05PM +0200, Andrea Arcangeli wrote:
> On Mon, May 05, 2008 at 12:25:06PM -0500, Jack Steiner wrote:
> > Agree. My apologies... I should have caught it.
> 
> No problem.
> 
> > __mmu_notifier_register/__mmu_notifier_unregister seems like a better way to
> > go, although either is ok.
> 
> If you also like __mmu_notifier_register more I'll go with it. The
> bitflags seems like a bit of overkill as I can't see the need of any
> other bitflag other than this one and they also can't be removed as
> easily in case you'll find a way to call it outside the lock later.
> 
> > Let me finish my testing. At one time, I did not use ->release but
> > with all the locking & teardown changes, I need to do some reverification.

I finished testing & everything looks good. I do use the ->release callout but
mainly as a performance hint that teardown is in progress & that TLB flushing is
no longer needed. (GRU TLB entries are tagged with a task-specific ID that will
not be reused until a full TLB purge is done. This eliminates the requirement
to purge at task-exit.)


Normally, a notifier is registered when a GRU segment is mmaped, and 
unregistered
when the segment is unmapped. Well behaved tasks will not have a GRU or
a notifier when exit starts.

If a task fails to unmap a GRU segment, they still exist at the start of
exit. On the ->release callout, I set a flag in the container of my
mmu_notifier that exit has started. As VMA are cleaned up, TLB flushes
are skipped because of the flag is set. When the GRU VMA is deleted, I free
my structure containing the notifier.

I _think_ works. Do you see any problems?

I should also mention that I have an open-coded function that possibly
belongs in mmu_notifier.c. A user is allowed to have multiple GRU segments.
Each GRU has a couple of data structures linked to the VMA. All, however,
need to share the same notifier. I currently open code a function that
scans the notifier list to determine if a GRU notifier already exists.
If it does, I update a refcnt & use it. Otherwise, I register a new
one. All of this is protected by the mmap_sem.

Just in case I mangled the above description, I'll attach a copy of the GRU 
mmuops
code.

--- jack


z
Description: Unix compressed data
-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 11] mmu-notifier-core

2008-05-05 Thread Jack Steiner
On Mon, May 05, 2008 at 07:14:34PM +0200, Andrea Arcangeli wrote:
> On Mon, May 05, 2008 at 11:21:13AM -0500, Jack Steiner wrote:
> > The GRU does the registration/deregistration of mmu notifiers from 
> > mmap/munmap.
> > At this point, the mmap_sem is already held writeable. I hit a deadlock
> > in mm_lock.
> 
> It'd been better to know about this detail earlier,

Agree. My apologies... I should have caught it.


> but frankly this
> is a minor problem, the important thing is we all agree together on
> the more difficult parts ;).
> 
> > A quick fix would be to do one of the following:
> > 
> > - move the mmap_sem locking to the caller of the [de]registration 
> > routines.
> >   Since the first/last thing done in mm_lock/mm_unlock is to
> >   acquire/release mmap_sem, this change does not cause major changes.
> 
> I don't like this solution very much. Nor GRU nor KVM will call
> mmu_notifier_register inside the mmap_sem protected sections, so I
> think the default mmu_notifier_register should be smp safe by itself
> without requiring additional locks to be artificially taken externally
> (especially because the need for mmap_sem in write mode is a very
> mmu_notifier internal detail).
> 
> > - add a flag to mmu_notifier_[un]register routines to indicate
> >   if mmap_sem is already locked.
> 
> The interface would change like this:
> 
> #define MMU_NOTIFIER_REGISTER_MMAP_SEM (1<<0)
> void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm,
>  unsigned long mmu_notifier_flags);

That works...


> 
> A third solution is to add:
> 
> /*
>  * This must can be called instead of mmu_notifier_register after
>  * taking the mmap_sem in write mode (read mode isn't enough).
>  */
> void __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm);
> 
> Do you still prefer the bitflag or you prefer
> __mmu_notifier_register. It's ok either ways, except
> __mmu_notifier_reigster could be removed in a backwards compatible
> way, the bitflag can't.
> 
> > I've temporarily deleted the mm_lock locking of mmap_sem and am continuing 
> > to
> > test. More later

__mmu_notifier_register/__mmu_notifier_unregister seems like a better way to
go, although either is ok.


> 
> Sure! In the meantime go ahead this way.
> 
> Another very minor change I've been thinking about is to make
> ->release not mandatory. It happens that with KVM ->release isn't
> strictly required because after mm_users reaches 0, no guest could
> possibly run anymore. So I'm using ->release only for debugging by
> placing -1UL in the root shadow pagetable, to be sure ;). So because
> at least one user won't strictly require ->release being consistent in
> having all method optional may be nicer. Alternatively we could make
> them all mandatory and if somebody doesn't need one of the methods it
> should implement it as a dummy function. Both ways have pros and cons,
> but they don't make any difference to us in practice. If I've to
> change the patch for the mmap_sem taken during registration I may as
> well cleanup this minor bit.
 
Let me finish my testing. At one time, I did not use ->release but
with all the locking & teardown changes, I need to do some reverification.


--- jack

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 11] mmu-notifier-core

2008-05-05 Thread Jack Steiner
On Fri, May 02, 2008 at 05:05:04PM +0200, Andrea Arcangeli wrote:
> # HG changeset patch
> # User Andrea Arcangeli <[EMAIL PROTECTED]>
> # Date 1209740175 -7200
> # Node ID 1489529e7b53d3f2dab8431372aa4850ec821caa
> # Parent  5026689a3bc323a26d33ad882c34c4c9c9a3ecd8
> mmu-notifier-core
 


I upgraded to the latest mmu notifier patch & hit a deadlock. (Sorry -
I should have seen this  earlier but I haven't tracked the last couple
of patches).

The GRU does the registration/deregistration of mmu notifiers from mmap/munmap.
At this point, the mmap_sem is already held writeable. I hit a deadlock
in mm_lock.

A quick fix would be to do one of the following:

- move the mmap_sem locking to the caller of the [de]registration 
routines.
  Since the first/last thing done in mm_lock/mm_unlock is to
  acquire/release mmap_sem, this change does not cause major changes.

- add a flag to mmu_notifier_[un]register routines to indicate
  if mmap_sem is already locked.


I've temporarily deleted the mm_lock locking of mmap_sem and am continuing to
test. More later


--- jack

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 00 of 11] mmu notifier #v15

2008-05-03 Thread Jack Steiner
On Fri, May 02, 2008 at 05:05:03PM +0200, Andrea Arcangeli wrote:
> Hello everyone,
> 
> 1/11 is the latest version of the mmu-notifier-core patch.
> 
> As usual all later 2-11/11 patches follows but those aren't meant for 2.6.26.
> 

Not sure why -mm is different, but I get compile errors w/o the following...

--- jack


Index: linux/mm/mmu_notifier.c
===
--- linux.orig/mm/mmu_notifier.c2008-05-02 16:54:52.780576831 -0500
+++ linux/mm/mmu_notifier.c 2008-05-02 16:56:38.817719509 -0500
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * This function can't run concurrently against mmu_notifier_register

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-23 Thread Jack Steiner

You may have spotted this already. If so, just ignore this.

It looks like there is a bug in copy_page_range() around line 667.
It's possible to do a mmu_notifier_invalidate_range_start(), then
return -ENOMEM w/o doing a corresponding mmu_notifier_invalidate_range_end().

--- jack



-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 00 of 12] mmu notifier #v13

2008-04-22 Thread Jack Steiner
On Tue, Apr 22, 2008 at 03:51:16PM +0200, Andrea Arcangeli wrote:
> Hello,
> 
> This is the latest and greatest version of the mmu notifier patch #v13.
> 

FWIW, I have updated the GRU driver to use this patch (plus the fixeups).
No problems. AFAICT, everything works.


--- jack

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-22 Thread Jack Steiner
On Tue, Apr 22, 2008 at 06:07:27PM -0500, Robin Holt wrote:
> > The only other change I did has been to move mmu_notifier_unregister
> > at the end of the patchset after getting more questions about its
> > reliability and I documented a bit the rmmod requirements for
> > ->release. we'll think later if it makes sense to add it, nobody's
> > using it anyway.
> 
> XPMEM is using it.  GRU will be as well (probably already does).

Yeppp.

The GRU driver unregisters the notifier when all GRU mappings
are unmapped. I could make it work either way - either with or without
an unregister function. However, unregister is the most logical
action to take when all mappings have been destroyed.


--- jack

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] mmu notifiers #v8

2008-03-04 Thread Jack Steiner
On Tue, Mar 04, 2008 at 11:35:32AM +0100, Peter Zijlstra wrote:
> 
> On Mon, 2008-03-03 at 13:15 -0600, Jack Steiner wrote:
> 
> > I haven't thought about locking requirements for the radix tree. Most 
> > accesses
> > would be read-only & updates infrequent. Any chance of an RCU-based radix
> > implementation?  Otherwise, don't we add the potential for hot 
> > locks/cachelines
> > for threaded applications ???
> 
> The current radix tree implementation in the kernel is RCU capable. We
> just don't have many RCU users yet.

Ahhh. You are right. I thought I looked but obviously missed it.

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] mmu notifiers #v8

2008-03-03 Thread Jack Steiner
On Mon, Mar 03, 2008 at 07:45:17PM +0100, Nick Piggin wrote:
> On Mon, Mar 03, 2008 at 12:06:05PM -0600, Jack Steiner wrote:
> > On Mon, Mar 03, 2008 at 05:59:10PM +0100, Nick Piggin wrote:
> > > > Maintaining a long-term reference on a page is a problem. The GRU does 
> > > > not
> > > > currently maintain tables to track the pages for which dropins have 
> > > > been done.
> > > > 
> > > > The GRU has a large internal TLB and is designed to reference up to 8PB 
> > > > of
> > > > memory. The size of the tables to track this many referenced pages 
> > > > would be
> > > > a problem (at best).
> > > 
> > > Is it any worse a problem than the pagetables of the processes which have
> > > their virtual memory exported to GRU? AFAIKS, no; it is on the same
> > > magnitude of difficulty. So you could do it without introducing any
> > > fundamental problem (memory usage might be increased by some constant
> > > factor, but I think we can cope with that in order to make the core patch
> > > really nice and simple).
> > 
> > Functionally, the GRU is very close to what I would consider to be the
> > "standard TLB" model. Dropins and flushs map closely to processor dropins
> > and flushes for cpus.  The internal structure of the GRU TLB is identical to
> > the TLB of existing cpus.  Requiring the GRU driver to track dropins with
> > long term page references seems to me a deviation from having the basic
> > mmuops support a "standard TLB" model. AFAIK, no other processor requires
> > this.
> 
> That is because the CPU TLBs have the mmu_gather batching APIs which
> avoid the problem. It would be possible to do something similar for
> GRU which would involve taking a reference for each page-to-be-invalidated
> in invalidate_page, and release them when you invalidate_range. Or else
> do some other scheme which makes mmu notifiers work similarly to the
> mmu gather API. But not just go an invent something completely different
> in the form of this invalidate_begin,clear linux pte,invalidate_end API.

Correct. If the mmu_gather were passed on the mmuops callout and the callout 
were
done at the same point as the tlb_finish_mmu(), the GRU could
efficiently work w/o the range invalidates. A range invalidate might still
be slightly more efficient but not measureable so. The net difference is
not worth the extra complexity of range callouts.


> 
> 
> > Tracking TLB dropins (and long term page references) could be done but it
> > adds significant complexity and scaling issues. The size of the tables to
> > track many TB (to PB) of memory can get large. If the memory is being
> > referenced by highly threaded applications, then the problem becomes even
> > more complex. Either tables must be replicated per-thread (and require even
> > more memory), or the table structure becomes even more complex to deal with
> > node locality, cacheline bouncing, etc.
> 
> I don't think it would be that significant in terms of complexity or
> scaling.
> 
> For a quick solution, you could stick a radix tree in each of your mmu
> notifiers registered (ie. one per mm), which is indexed on virtual address
> >> PAGE_SHIFT, and returns the struct page *. Size is no different than
> page tables, and locking is pretty scalable.
> 
> After that, I would really like to see whether the numbers justify
> larger changes.

I'm still concerned about performance. Each dropin would first have to access
an additional data structure that would most likely be non-node-local and
non-cache-resident. The net effect would be measurable but not a killer.

I haven't thought about locking requirements for the radix tree. Most accesses
would be read-only & updates infrequent. Any chance of an RCU-based radix
implementation?  Otherwise, don't we add the potential for hot locks/cachelines
for threaded applications ???

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] mmu notifiers #v8

2008-03-03 Thread Jack Steiner
On Mon, Mar 03, 2008 at 08:09:49PM +0200, Avi Kivity wrote:
> Jack Steiner wrote:
> >The range invalidates have a performance advantage for the GRU. TLB 
> >invalidates
> >on the GRU are relatively slow (usec) and interfere somewhat with the 
> >performance
> >of other active GRU instructions. Invalidating a large chunk of addresses 
> >with
> >a single GRU TLBINVAL operation is must faster than issuing a stream of 
> >single
> >page TLBINVALs.
> >
> >I expect this performance advantage will also apply to other users of 
> >mmuops.
> >  
> 
> In theory this would apply to kvm as well (coalesce tlb flush IPIs, 
> lookup shadow page table once), but is it really a fast path?  What 
> triggers range operations for your use cases?
 

Although not frequent, an unmap of a multiple TB object could be quite painful
if each page was invalidated individually instead of 1 invalidate for the 
entire range.
This is even worse if the application is threaded and the object has been 
reference by
many GRUs (there are 16 GRU ports per node - each potentially has to be 
invalidated).

Forks (again, not frequent) would be another case.




-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] mmu notifiers #v8

2008-03-03 Thread Jack Steiner
On Mon, Mar 03, 2008 at 05:59:10PM +0100, Nick Piggin wrote:
> On Mon, Mar 03, 2008 at 09:18:59AM -0600, Jack Steiner wrote:
> > On Mon, Mar 03, 2008 at 02:10:17PM +0100, Nick Piggin wrote:
> > > On Mon, Mar 03, 2008 at 01:51:53PM +0100, Andrea Arcangeli wrote:
> > > > On Mon, Mar 03, 2008 at 04:29:34AM +0100, Nick Piggin wrote:
> > > > > to something I prefer. Others may not, but I'll post them for debate
> > > > > anyway.
> > > > 
> > > > Sure, thanks!
> > > > 
> > > > > > I didn't drop invalidate_page, because invalidate_range_begin/end
> > > > > > would be slower for usages like KVM/GRU (we don't need a begin/end
> > > > > > there because where invalidate_page is called, the VM holds a
> > > > > > reference on the page). do_wp_page should also use invalidate_page
> > > > > > since it can free the page after dropping the PT lock without losing
> > > > > > any performance (that's not true for the places where 
> > > > > > invalidate_range
> > > > > > is called).
> > > > > 
> > > > > I'm still not completely happy with this. I had a very quick look
> > > > > at the GRU driver, but I don't see why it can't be implemented
> > > > > more like the regular TLB model, and have TLB insertions depend on
> > > > > the linux pte, and do invalidates _after_ restricting permissions
> > > > > to the pte.
> > > > > 
> > > > > Ie. I'd still like to get rid of invalidate_range_begin, and get
> > > > > rid of invalidate calls from places where permissions are relaxed.
> > > > 
> > > > _begin exists because by the time _end is called, the VM already
> > > > dropped the reference on the page. This way we can do a single
> > > > invalidate no matter how large the range is. I don't see ways to
> > > > remove _begin while still invoking _end a single time for the whole
> > > > range.

The range invalidates have a performance advantage for the GRU. TLB invalidates
on the GRU are relatively slow (usec) and interfere somewhat with the 
performance
of other active GRU instructions. Invalidating a large chunk of addresses with
a single GRU TLBINVAL operation is must faster than issuing a stream of single
page TLBINVALs.

I expect this performance advantage will also apply to other users of mmuops.

> > > 
> > > Is this just a GRU problem? Can't we just require them to take a ref
> > > on the page (IIRC Jack said GRU could be changed to more like a TLB
> > > model).
> > 
> > Maintaining a long-term reference on a page is a problem. The GRU does not
> > currently maintain tables to track the pages for which dropins have been 
> > done.
> > 
> > The GRU has a large internal TLB and is designed to reference up to 8PB of
> > memory. The size of the tables to track this many referenced pages would be
> > a problem (at best).
> 
> Is it any worse a problem than the pagetables of the processes which have
> their virtual memory exported to GRU? AFAIKS, no; it is on the same
> magnitude of difficulty. So you could do it without introducing any
> fundamental problem (memory usage might be increased by some constant
> factor, but I think we can cope with that in order to make the core patch
> really nice and simple).

Functionally, the GRU is very close to what I would consider to be the
"standard TLB" model. Dropins and flushs map closely to processor dropins
and flushes for cpus.  The internal structure of the GRU TLB is identical to
the TLB of existing cpus.  Requiring the GRU driver to track dropins with
long term page references seems to me a deviation from having the basic
mmuops support a "standard TLB" model. AFAIK, no other processor requires
this.

Tracking TLB dropins (and long term page references) could be done but it
adds significant complexity and scaling issues. The size of the tables to
track many TB (to PB) of memory can get large. If the memory is being
referenced by highly threaded applications, then the problem becomes even
more complex. Either tables must be replicated per-thread (and require even
more memory), or the table structure becomes even more complex to deal with
node locality, cacheline bouncing, etc.

Try to avoid a requirement to track dropins with long term page references.


> It is going to be really easy to add more weird and wonderful notifiers
> later that deviate from our standard TLB model. It would be much harder to
> remove them. So I really want to see everyone conform to this model first.

Agree.

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] mmu notifiers #v8

2008-03-03 Thread Jack Steiner
On Mon, Mar 03, 2008 at 02:10:17PM +0100, Nick Piggin wrote:
> On Mon, Mar 03, 2008 at 01:51:53PM +0100, Andrea Arcangeli wrote:
> > On Mon, Mar 03, 2008 at 04:29:34AM +0100, Nick Piggin wrote:
> > > to something I prefer. Others may not, but I'll post them for debate
> > > anyway.
> > 
> > Sure, thanks!
> > 
> > > > I didn't drop invalidate_page, because invalidate_range_begin/end
> > > > would be slower for usages like KVM/GRU (we don't need a begin/end
> > > > there because where invalidate_page is called, the VM holds a
> > > > reference on the page). do_wp_page should also use invalidate_page
> > > > since it can free the page after dropping the PT lock without losing
> > > > any performance (that's not true for the places where invalidate_range
> > > > is called).
> > > 
> > > I'm still not completely happy with this. I had a very quick look
> > > at the GRU driver, but I don't see why it can't be implemented
> > > more like the regular TLB model, and have TLB insertions depend on
> > > the linux pte, and do invalidates _after_ restricting permissions
> > > to the pte.
> > > 
> > > Ie. I'd still like to get rid of invalidate_range_begin, and get
> > > rid of invalidate calls from places where permissions are relaxed.
> > 
> > _begin exists because by the time _end is called, the VM already
> > dropped the reference on the page. This way we can do a single
> > invalidate no matter how large the range is. I don't see ways to
> > remove _begin while still invoking _end a single time for the whole
> > range.
> 
> Is this just a GRU problem? Can't we just require them to take a ref
> on the page (IIRC Jack said GRU could be changed to more like a TLB
> model).

Maintaining a long-term reference on a page is a problem. The GRU does not
currently maintain tables to track the pages for which dropins have been done.

The GRU has a large internal TLB and is designed to reference up to 8PB of
memory. The size of the tables to track this many referenced pages would be
a problem (at best).

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] mmu notifiers #v7

2008-02-28 Thread Jack Steiner
> > The release should be called much earlier to allow the driver to release 
> > all resources in one go. This way each vma must be processed individually. 
> > For our gobs of memory this method may create a scaling problem on exit().
> 
> Good point, it has to be called earlier for GRU, but it's not a
> performance issue. GRU doesn't pin the pages so it should make the
> global invalidate in ->release _before_ unmap_vmas. Linux can't fault
> in the ptes anymore because mm_users is zero so there's no need of a
> ->release_begin/end, the _begin is enough.
> 

I disagree. The location of the callout IS a performance issue. In simple
comparisons of the 2 patches (Christoph's vs. Andrea's), Andrea's has a 7X
increase in the number of TLB purges being issued to the GRU. TLB flushing
is slow and can impact the performance of of tasks using the GRU.

--- jack

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-02-27 Thread Jack Steiner
>  
> > Also, what we are going to need here are not skeleton drivers
> > that just do all the *easy* bits (of registering their callbacks),
> > but actual fully working examples that do everything that any
> > real driver will need to do. If not for the sanity of the driver
> > writer, then for the sanity of the VM developers (I don't want
> > to have to understand xpmem or infiniband in order to understand
> > how the VM works).
> 
> There are 3 different drivers that can already use it but the code is 
> complex and not easy to review. Skeletons are easy to allow people to get 
> started with it.


I posted the full GRU driver late last week. It is a lot of
code & somewhat difficult to understand w/o access to full chip
specs (sorry). The code is fairly well commented &  the
parts related to TLB management should be understandable.



-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] mmu notifiers #v6

2008-02-21 Thread Jack Steiner
> I really want suggestions on Jack's concern about issuing an
> invalidate per pte entry or per-pte instead of per-range. I'll answer
> that in a separate email. For KVM my patch is already close to optimal
> because each single spte invalidate requires a fixed amount of work,
> but for GRU a large invalidate-range would be more efficient.
>
> To address the GRU _valid_ concern, I can create a second version of
> my patch with range_begin/end instead of invalidate_pages, that still

I don't know how much significance to place on this data, but it is
a real data point.

I ran the GRU regression test suite on kernels with both types of
mmu_notifiers. The kernel/driver using Christoph's patch had
1/7 the number of TLB invalidates as Andrea's patch.

This reduction is due to both differences I mentioned yesterday:
- different location of callout for address space teardown
- range callouts

Unfortunately, the current driver does not allow me to quantify
which of the differences is most significant.

Also, I'll try to post the driver within the next few days. It is
still in development but it compiles and can successfully run most
workloads on a system simulator.

--- jack

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] mmu notifiers #v6

2008-02-20 Thread Jack Steiner
On Wed, Feb 20, 2008 at 11:39:42AM +0100, Andrea Arcangeli wrote:
> Given Nick's comments I ported my version of the mmu notifiers to
> latest mainline. There are no known bugs AFIK and it's obviously safe
> (nothing is allowed to schedule inside rcu_read_lock taken by
> mmu_notifier() with my patch).
> 

I ported the GRU driver to use the latest #v6 patch and ran a series of
tests on it using our system simulator. The simulator is slow so true
stress or swapping is not possible - at least within a finite amount of
time.

Functionally, the #v6 patch seems to work for the GRU. However, I did
notice two significant differences that make the #v6 performance worse for
the GRU than Christoph's patch.  I think one difference is easily fixable
but the other is more difficult:

- the location of the mmu_notifier_release() callout is at a
  different place in the 2 patches. Christoph has the callout
  BEFORE the call to unmap_vmas() whereas you have it AFTER. The
  net result is that the GRU does a LOT of 1-page TLB flushes
  during process teardown.  These flushes are not done with
  Christops's patch.

- the range callouts in Christoph's patch benefit the GRU because
  multiple TLB entries can be flushed with a single GRU
  instruction (the GRU hardware supports a range flush using a
  vaddr & length).  The #v6 patch does a TLB flush for each page in
  the range.  Flushing on the GRU is slow so being able to flush
  multiple pages with a single request is a benefit.

Seems like the latter difference could be significant for other users
of mmu notifiers.


--- jack

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch] my mmu notifiers

2008-02-19 Thread Jack Steiner
On Wed, Feb 20, 2008 at 12:11:57AM +0100, Nick Piggin wrote:
> On Tue, Feb 19, 2008 at 02:58:51PM +0100, Andrea Arcangeli wrote:
> > On Tue, Feb 19, 2008 at 09:43:57AM +0100, Nick Piggin wrote:
> > > anything when changing the pte to be _more_ permissive, and I don't
> > 
> > Note that in my patch the invalidate_pages in mprotect can be
> > trivially switched to a mprotect_pages with proper params. This will
> > prevent page faults completely in the secondary MMU (there will only
> > be tlb misses after the tlb flush just like for the core linux pte),
> > and it'll allow all the secondary MMU pte blocks (512/1024 at time
> > with my PT lock design) to be updated to have proper permissions
> > matching the core linux pte.
> 
> Sorry, I realise I still didn't get this through my head yet (and also
> have not seen your patch recently). So I don't know exactly what you
> are doing...
> 
> But why does _anybody_ (why does Christoph's patches) need to invalidate
> when they are going to be more permissive? This should be done lazily by
> the driver, I would have thought.


Agree. Although for most real applications, the performance difference
is probably negligible.

--- jack

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch] my mmu notifiers

2008-02-19 Thread Jack Steiner
> On Tue, Feb 19, 2008 at 02:58:51PM +0100, Andrea Arcangeli wrote:
> > understand the need for invalidate_begin/invalidate_end pairs at all.
> 
> The need of the pairs is crystal clear to me: range_begin is needed
> for GRU _but_only_if_ range_end is called after releasing the
> reference that the VM holds on the page. _begin will flush the GRU tlb
> and at the same time it will take a mutex that will block further GRU
> tlb-miss-interrupts (no idea how they manange those nightmare locking,
> I didn't even try to add more locking to KVM and I get away with the
> fact KVM takes the pin on the page itself).

As it turns out, no actual mutex is required. _begin_ simply increments a
count of active range invalidates, _end_ decrements the count. New TLB
dropins are deferred while range callouts are active.

This would appear to be racy but the GRU has special hardware that
simplifies locking. When the GRU sees a TLB invalidate, all outstanding
misses & potentially inflight TLB dropins are marked by the GRU with a
"kill" bit. When the dropin finally occurs, the dropin is ignored & the
instruction is simply restarted. The instruction will fault again & the TLB
dropin will be repeated.  This is optimized for the case where invalidates
are rare - true for users of the GRU.


In general, though, I agree. Most users of mmu_notifiers would likely
required a mutex or something equivalent.


--- jack




-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 0/6] MMU Notifiers V6

2008-02-13 Thread Jack Steiner
> GRU
> - Simple additional hardware TLB (possibly covering multiple instances of
>   Linux)
> - Needs TLB shootdown when the VM unmaps pages.
> - Determines page address via follow_page (from interrupt context) but can
>   fall back to get_user_pages().
> - No page reference possible since no page status is kept..

I applied the latest mmuops patch to a 2.6.24 kernel & updated the
GRU driver to use it. As far as I can tell, everything works ok.
Although more testing is needed, all current tests of driver functionality
are working on both a system simulator and a hardware simulator.

The driver itself is still a few weeks from being ready to post but I can
send code fragments of the portions related to mmuops or external TLB
management if anyone is interested.


--- jack

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] mmu notifiers #v5

2008-02-02 Thread Jack Steiner
On Sun, Feb 03, 2008 at 03:17:04AM +0100, Andrea Arcangeli wrote:
> On Fri, Feb 01, 2008 at 11:23:57AM -0800, Christoph Lameter wrote:
> > Yes so your invalidate_range is still some sort of dysfunctional 
> > optimization? Gazillions of invalidate_page's will have to be executed 
> > when tearing down large memory areas.
> 
> I don't know if gru can flush the external TLB references with a
> single instruction like the cpu can do by overwriting %cr3. As far as

The mechanism obviously is a little different, but the GRU can flush an
external TLB in a single read/write/flush of a cacheline (should take a few
100 nsec). Typically, an application uses a single GRU. Threaded
applications, however, could use one GRU per thread, so it is possible that
multiple TLBs must be flushed. In some cases, the external flush can be
avoided if the GRU is not currently in use by the thread.

Also, most (but not all) applications that use the GRU do not usually do
anything that requires frequent flushing (fortunately). The GRU is intended
for HPC-like applications. These don't usually do frequent map/unmap
operations or anything else that requires a lot of flushes.

I expect that KVM is a lot different.

I have most of the GRU code working with the latest mmuops patch. I still
have a list of loose ends that I'll get to next week. The most important is
the exact handling of the range invalidates. The code that I currently have
works (mostly) but has a few endcases that will cause problems. Once I
finish, I'll be glad to send you snippets of the code (or all of it) if you
would like to take a look.


> vmx/svm are concerned, you got to do some fixed amount of work on the
> rmap structures and each spte, for each 4k invalidated regardless of
> page/pages/range/ranges. You can only share the cost of the lock and
> of the memslot lookup, so you lookup and take the lock once and you
> drop 512 sptes instead of just 1. Similar to what we do in the main
> linux VM by taking the PT lock for every 512 sptes.
> 
> So you worry about gazillions of invalidate_page without taking into
> account that even if you call invalidate_range_end(0, -1ULL) KVM will
> have to mangle over 4503599627370496 rmap entries anyway. Yes calling
> 1 invalidate_range_end instead of 4503599627370496 invalidate_page,
> will be faster, but not much faster. For KVM it remains an O(N)
> operation, where N is the number of pages. I'm not so sure we need to
> worry about my invalidate_pages being limited to 512 entries.
> 
> Perhaps GRU is different, I'm just describing the KVM situation here.
> 
> As far as KVM is concerned it's more sensible to be able to scale when
> there are 1024 kvm threads on 1024 cpus and each kvm-vcpu is accessing
> a different guest physical page (especially when new chunks of memory
> are allocated for the first time) that won't collide the whole time on
> naturally aligned 2M chunks of virtual addresses.
> 
> > And that would not be enough to hold of new references? With small tweaks 
> > this should work with a common scheme. We could also redefine the role 
> > of _start and _end slightly to just require that the refs are removed when 
> > _end completes. That would allow the KVM page count ref to work as is now 
> > and would avoid the individual invalidate_page() callouts.
> 
> I can already only use _end and ignore _start, only remaining problem
> is this won't be 100% coherent (my patch is 100% coherent thanks to PT
> lock implicitly serializing follow_page/get_user_pages of the KVM/GRU
> secondary MMU faults). My approach give a better peace of mind with
> full scalability and no lock recursion when it's the KVM page fault
> that invokes get_user_pages that invokes the linux page fault routines
> that will try to execute _start taking the lock to block the page
> fault that is already running...
> 
> > > > The GRU has no page table on its own. It populates TLB entries on 
> > > > demand 
> > > > using the linux page table. There is no way it can figure out when to 
> > > > drop page counts again. The invalidate calls are turned directly into 
> > > > tlb 
> > > > flushes.
> > > 
> > > Yes, this is why it can't serialize follow_page with only the PT lock
> > > with your patch. KVM may do it once you add start,end to range_end
> > > only thanks to the additional pin on the page.
> > 
> > Right but that pin requires taking a refcount which we cannot do.
> 
> GRU can use my patch without the pin. XPMEM obviously can't use my
> patch as my invalidate_page[s] are under the PT lock (a feature to fit
> GRU/KVM in the simplest way), this is why an incremental patch adding
> invalidate_range_start/end would be required to support XPMEM too.

--- jack

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing 

Re: [kvm-devel] [patch 1/3] mmu_notifier: Core code

2008-01-31 Thread Jack Steiner
On Thu, Jan 31, 2008 at 06:39:19PM -0800, Christoph Lameter wrote:
> On Thu, 31 Jan 2008, Robin Holt wrote:
> 
> > Jack has repeatedly pointed out needing an unregister outside the
> > mmap_sem.  I still don't see the benefit to not having the lock in the mm.
> 
> I never understood why this would be needed. ->release removes the 
> mmu_notifier right now.

Christoph -

We discussed this earlier this week. Here is part of the mail:



> > There currently is no __mmu_notifier_unregister(). Oversite???
>
> No need. mmu_notifier_release implies an unregister and I think that is
> the most favored way to release resources since it deals with the RCU
> quiescent period.


I currently unlink the mmu_notifier when the last GRU mapping is closed. For
example, if a user does a:

gru_create_context();
...
gru_destroy_context();

the mmu_notifier is unlinked and all task tables allocated
by the driver are freed. Are you suggesting that I leave tables
allocated until the task terminates??

Why is that better? What problem do I cause by trying
to free tables as soon as they are not needed?


---

> Christoph responded:
> > the mmu_notifier is unlinked and all task tables allocated
> > by the driver are freed. Are you suggesting that I leave tables
> > allocated until the task terminates??
>
> You need to leave the mmu_notifier structure allocated until the next
> quiescent rcu period unless you use the release notifier.

I assumed that I would need to use call_rcu() or synchronize_rcu()
before the table is actually freed. That's still on my TODO list.



--- jack

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 1/3] mmu_notifier: Core code

2008-01-31 Thread Jack Steiner
On Thu, Jan 31, 2008 at 08:24:44PM -0600, Robin Holt wrote:
> On Thu, Jan 31, 2008 at 07:56:12PM -0600, Jack Steiner wrote:
> > > @@ -2033,6 +2034,7 @@ void exit_mmap(struct mm_struct *mm)
> > >   unsigned long end;
> > >  
> > >   /* mm's last user has gone, and its about to be pulled down */
> > > + mmu_notifier(invalidate_all, mm, 0);
> > >   arch_exit_mmap(mm);
> > >  
> > 
> > The name of the "invalidate_all" callout is not very descriptive.
> > Why not use "exit_mmap". That matches the function name, the arch callout
> > and better describes what is happening.
> 
> This got removed in a later patch.  We now only do the release.

Found it, thanks.

Christoph, is it time to post a new series of patches? I've got
as many fixup patches as I have patches in the original posting.



-- jack


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 1/3] mmu_notifier: Core code

2008-01-31 Thread Jack Steiner
> @@ -2033,6 +2034,7 @@ void exit_mmap(struct mm_struct *mm)
>   unsigned long end;
>  
>   /* mm's last user has gone, and its about to be pulled down */
> + mmu_notifier(invalidate_all, mm, 0);
>   arch_exit_mmap(mm);
>  

The name of the "invalidate_all" callout is not very descriptive.
Why not use "exit_mmap". That matches the function name, the arch callout
and better describes what is happening.


--- jack



-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-30 Thread Jack Steiner
On Wed, Jan 30, 2008 at 11:41:29AM -0800, Christoph Lameter wrote:
> On Wed, 30 Jan 2008, Jack Steiner wrote:
> 
> > I see what you mean. I need to review to mail to see why this changed
> > but in the original discussions with Christoph, the invalidate_range
> > callouts were suppose to be made BEFORE the pages were put on the freelist.
> 
> Seems that we cannot rely on the invalidate_ranges for correctness at all?
> We need to have invalidate_page() always. invalidate_range() is only an 
> optimization.
> 

I don't understand your point "an optimization". How would invalidate_range
as currently defined be correctly used?

It _looks_ like it would work only if xpmem/gru/etc takes a refcnt on
the page & drops it when invalidate_range is called. That may work (not sure)
for xpmem but not for the GRU.

--- jack


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 1/6] mmu_notifier: Core code

2008-01-30 Thread Jack Steiner
On Wed, Jan 30, 2008 at 04:37:49PM +0100, Andrea Arcangeli wrote:
> On Tue, Jan 29, 2008 at 06:29:10PM -0800, Christoph Lameter wrote:
> > +void mmu_notifier_release(struct mm_struct *mm)
> > +{
> > +   struct mmu_notifier *mn;
> > +   struct hlist_node *n, *t;
> > +
> > +   if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
> > +   rcu_read_lock();
> > +   hlist_for_each_entry_safe_rcu(mn, n, t,
> > + &mm->mmu_notifier.head, hlist) {
> > +   hlist_del_rcu(&mn->hlist);
> 
> This will race and kernel crash against mmu_notifier_register in
> SMP. You should resurrect the per-mmu_notifier_head lock in my last
> patch (except it can be converted from a rwlock_t to a regular
> spinlock_t) and drop the mmap_sem from
> mmu_notifier_register/unregister.

Agree.

That will also resolve the problem we discussed yesterday. 
I want to unregister my mmu_notifier when a GRU segment is
unmapped. This would not necessarily be at task termination.

However, the mmap_sem is already held for write by the core
VM at the point I would call the unregister function.
Currently, there is no __mmu_notifier_unregister() defined.

Moving to a different lock solves the problem.


-- jack

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-30 Thread Jack Steiner
On Wed, Jan 30, 2008 at 02:37:20PM +0100, Andrea Arcangeli wrote:
> On Tue, Jan 29, 2008 at 06:28:05PM -0600, Jack Steiner wrote:
> > On Tue, Jan 29, 2008 at 04:20:50PM -0800, Christoph Lameter wrote:
> > > On Wed, 30 Jan 2008, Andrea Arcangeli wrote:
> > > 
> > > > > invalidate_range after populate allows access to memory for which 
> > > > > ptes 
> > > > > were zapped and the refcount was released.
> > > > 
> > > > The last refcount is released by the invalidate_range itself.
> > > 
> > > That is true for your implementation and to address Robin's issues. Jack: 
> > > Is that true for the GRU?
> > 
> > I'm not sure I understand the question. The GRU never (currently) takes
> > a reference on a page. It has no mechanism for tracking pages that
> > were exported to the external TLBs.
> 
> If you don't have a pin, then things like invalidate_range in
> remap_file_pages can't be safe as writes through the external TLBs can
> keep going on pages in the freelist. For you to be safe w/o a
> page-pin, you need to return in the direction of invalidate_page
> inside ptep_clear_flush (or anyway before
> page_cache_release/__free_page/put_page...). You're generally not safe
> with any invalidate_range that may run after the page pointed by the
> pte has been freed (or can be freed by the VM anytime because of being
> unpinned cache).

Yuck

I see what you mean. I need to review to mail to see why this changed
but in the original discussions with Christoph, the invalidate_range
callouts were suppose to be made BEFORE the pages were put on the freelist.


--- jack

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

2008-01-29 Thread Jack Steiner
On Tue, Jan 29, 2008 at 04:20:50PM -0800, Christoph Lameter wrote:
> On Wed, 30 Jan 2008, Andrea Arcangeli wrote:
> 
> > > invalidate_range after populate allows access to memory for which ptes 
> > > were zapped and the refcount was released.
> > 
> > The last refcount is released by the invalidate_range itself.
> 
> That is true for your implementation and to address Robin's issues. Jack: 
> Is that true for the GRU?

I'm not sure I understand the question. The GRU never (currently) takes
a reference on a page. It has no mechanism for tracking pages that
were exported to the external TLBs.

--- jack

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel