Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-05 Thread Derek Murray

Hi Gerd,

Gerd Hoffmann wrote:
Want reproduce?  Here we go:


  * grab xenner 0.8 from http://dl.bytesex.org/releases/xenner/
  * grab a xenified dom0 kernel without blktap driver (either not
compiled or module not loaded).
  * start xend
  * start blkbackd from xenner package (you probably want the -d switch
for debug output, twice for more).
  * run xm block-attach 0 tap:aio:/path/to/some/file xvda r
  * watch it blow up ;)


Thanks for the repro details. I'll have a go at this later. One thing we 
haven't tested AFAIK is mapping grants in the same domain: could you 
check to see if the bug is the same if you attach a block device to a 
domain other than Dom0? Also, could you send any Xen console output, if 
it contains errors or warnings?



I can't help wondering if this is a hint that now is the time to find a
better API, which doesn't have the requirement (a) that seems to be
causing such trouble?  Are other PV guests --- *BSD, Solaris --- going
to have the same problems with their VM layers if they try to implement
this API?  Upstream Linux pv_ops certainly will, and it would be good if
we could avoid tying unprivileged guests to ABIs which cannot hope to be
merged into pv_ops.


And I fear the problems I've trapped into up to now is only the tip of
the iceberg.  What happens if an application with active grant table
mappings calls fork() ?


Ultimately, fork calls dup_mm, which calls, dup_mmap, which calls 
copy_{page,pud,pmd,pte}_range, which calls copy_one_pte, which calls 
set_pte_at, which hypercalls HYPERVISOR_update_va_mapping.


The hypercall will not succeed and will return an error code indicating 
the reason for this. Therefore the PTE will not be set. There appears to 
be no way to propagate this error through the Linux VM code, because 
there is no concept of a PTE update failing. I could add return codes to 
all those functions, but I don't fancy their chances upstream


A possibility for solving that might be to carry out the mappings upon a 
page fault: I believe this would be compatible with copy_page_range.


(In fact, it's possible that a forked process would attempt to 
demand-page in the granted page, bypassing the copy_page_range code. 
Since there is no nopage handler for a gntdev VMA, that would lead to an 
anonymous page being mapped into memory instead.)


So, as far as I can tell, there would be no kernel BUG() or 
domain_crash() in the event of a fork(). It looks like implementing 
nopage in gntdev would enable grants to be remapped after a fork() and 
the correct behaviour to happen.


Regards,

Derek.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-05 Thread Derek Murray

Gerd,

Can you try the attached patch against linux-2.6.18-xen.hg?

I think the problem was that the gntdev VMA is not marked as being 
VM_PFNMAP, therefore it tries to get a struct page_struct for each 
granted page when it is unmapped (and maybe sometimes succeeds 
(incorrectly), which could be why I haven't seen the bug). With this 
flag, vm_normal_page will return NULL in zap_pte_range, and so the code 
that decrements that reference count will not be executed.


Regards,

Derek.
# HG changeset patch
# User [EMAIL PROTECTED]
# Date 1196860382 0
# Node ID af26b3dd23822190acbec1872a47259e1fed88b8
# Parent  b2768401db943e66af9d64bd610ffa225f560c0b
Set gntdev VMA to be VM_PFNMAP.

diff -r b2768401db94 -r af26b3dd2382 drivers/xen/gntdev/gntdev.c
--- a/drivers/xen/gntdev/gntdev.c	Mon Dec 03 08:50:12 2007 +
+++ b/drivers/xen/gntdev/gntdev.c	Wed Dec 05 13:13:02 2007 +
@@ -501,6 +501,17 @@ static int gntdev_mmap (struct file *fli
 
 	/* The VM area contains pages from another VM. */
 	vma-vm_flags |= VM_FOREIGN;
+
+	/* The VM area contains pages that are not backed by page_structs in
+	 * this domain's memory map.
+	 *
+	 * TODO/FIXME?: We should probably use the VM_FOREIGN workaround as
+	 *  used by get_user_pages() to provide access to the
+	 *  page_structs for each page, but I'm not sure if that's
+	 *  necessary.
+	 */
+	vma-vm_flags |= VM_PFNMAP;
+
 	vma-vm_private_data = kzalloc(size * sizeof(struct page_struct *), 
    GFP_KERNEL);
 	if (vma-vm_private_data == NULL) {
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization

Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-05 Thread Derek Murray

Keir Fraser wrote:

Is this patch to go into linux-2.6.18-xen.hg then?


Yes, even if it doesn't fix the exact bug we're seeing here, I think it 
should go in. I've attached a version with my signed-off-by and a better 
commit comment.


Cheers,

Derek.
# HG changeset patch
# User [EMAIL PROTECTED]
# Date 1196860382 0
# Node ID af26b3dd23822190acbec1872a47259e1fed88b8
# Parent  b2768401db943e66af9d64bd610ffa225f560c0b
Add VM_PFNMAP flag to gntdev-mmaped VM areas. This prevents an attempt in
zap_pte_range to decrement the reverse-mapping count of the non-existant
(but occasionally spuriously present) page_struct associated with the
granted PFN.

Signed-off-by: Derek Murray [EMAIL PROTECTED]

diff -r b2768401db94 -r af26b3dd2382 drivers/xen/gntdev/gntdev.c
--- a/drivers/xen/gntdev/gntdev.c	Mon Dec 03 08:50:12 2007 +
+++ b/drivers/xen/gntdev/gntdev.c	Wed Dec 05 13:13:02 2007 +
@@ -501,6 +501,17 @@ static int gntdev_mmap (struct file *fli
 
 	/* The VM area contains pages from another VM. */
 	vma-vm_flags |= VM_FOREIGN;
+
+	/* The VM area contains pages that are not backed by page_structs in
+	 * this domain's memory map.
+	 *
+	 * TODO/FIXME?: We should probably use the VM_FOREIGN workaround as
+	 *  used by get_user_pages() to provide access to the
+	 *  page_structs for each page, but I'm not sure if that's
+	 *  necessary.
+	 */
+	vma-vm_flags |= VM_PFNMAP;
+
 	vma-vm_private_data = kzalloc(size * sizeof(struct page_struct *), 
    GFP_KERNEL);
 	if (vma-vm_private_data == NULL) {
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization

[PATCH 4/9] remove references to cr8 register

2007-12-05 Thread Glauber de Oliveira Costa
As pointed out by Andi, linux never really uses this register
so saving and restoring is not really necessary. This patch
removes all references to it.

Signed-off-by: Glauber de Oliveira Costa [EMAIL PROTECTED]
---
 arch/x86/kernel/asm-offsets_64.c |1 -
 arch/x86/kernel/suspend_64.c |2 --
 include/asm-x86/suspend_64.h |2 +-
 include/asm-x86/system_64.h  |   12 
 4 files changed, 1 insertions(+), 16 deletions(-)

Index: linux-2.6-x86/arch/x86/kernel/asm-offsets_64.c
===
--- linux-2.6-x86.orig/arch/x86/kernel/asm-offsets_64.c
+++ linux-2.6-x86/arch/x86/kernel/asm-offsets_64.c
@@ -107,7 +107,6 @@ int main(void)
ENTRY(cr2);
ENTRY(cr3);
ENTRY(cr4);
-   ENTRY(cr8);
BLANK();
 #undef ENTRY
DEFINE(TSS_ist, offsetof(struct tss_struct, ist));
Index: linux-2.6-x86/arch/x86/kernel/suspend_64.c
===
--- linux-2.6-x86.orig/arch/x86/kernel/suspend_64.c
+++ linux-2.6-x86/arch/x86/kernel/suspend_64.c
@@ -53,7 +53,6 @@ void __save_processor_state(struct saved
ctxt-cr2 = read_cr2();
ctxt-cr3 = read_cr3();
ctxt-cr4 = read_cr4();
-   ctxt-cr8 = read_cr8();
 }
 
 void save_processor_state(void)
@@ -75,7 +74,6 @@ void __restore_processor_state(struct sa
 * control registers
 */
wrmsrl(MSR_EFER, ctxt-efer);
-   write_cr8(ctxt-cr8);
write_cr4(ctxt-cr4);
write_cr3(ctxt-cr3);
write_cr2(ctxt-cr2);
Index: linux-2.6-x86/include/asm-x86/suspend_64.h
===
--- linux-2.6-x86.orig/include/asm-x86/suspend_64.h
+++ linux-2.6-x86/include/asm-x86/suspend_64.h
@@ -20,7 +20,7 @@ struct saved_context {
struct pt_regs regs;
u16 ds, es, fs, gs, ss;
unsigned long gs_base, gs_kernel_base, fs_base;
-   unsigned long cr0, cr2, cr3, cr4, cr8;
+   unsigned long cr0, cr2, cr3, cr4;
unsigned long efer;
u16 gdt_pad;
u16 gdt_limit;
Index: linux-2.6-x86/include/asm-x86/system_64.h
===
--- linux-2.6-x86.orig/include/asm-x86/system_64.h
+++ linux-2.6-x86/include/asm-x86/system_64.h
@@ -95,18 +95,6 @@ static inline void write_cr4(unsigned lo
asm volatile(movq %0,%%cr4 :: r (val) : memory);
 }
 
-static inline unsigned long read_cr8(void)
-{
-   unsigned long cr8;
-   asm volatile(movq %%cr8,%0 : =r (cr8));
-   return cr8;
-}
-
-static inline void write_cr8(unsigned long val)
-{
-   asm volatile(movq %0,%%cr8 :: r (val) : memory);
-}
-
 #define stts() write_cr0(8 | read_cr0())
 
 #define wbinvd() \
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-05 Thread Derek Murray

Keir Fraser wrote:

Yes, this would work okay I suspect. Good enough as a stop-gap measure? Are
there any other responsibilities that you acquire if you make use of
VM_FOREIGN (in particular, how would this affect get_user_pages)?


VM_FOREIGN is already set for the gntdev VMA (mostly because it's 
directly based on the blktap code). That means that it has the array of 
page_structs in its vm_private_data, which can be used to fulfill a 
get_user_pages call. I've attached a patch based on this fix.


Regards,

Derek.
# HG changeset patch
# User [EMAIL PROTECTED]
# Date 1196878124 0
# Node ID df7d0555ec3847bd5915063d8ee79123d6ebc67a
# Parent  ba918cb2cf7520604dee724dd80dad5ce4bee8a1
Changed vm_normal_page to return NULL when presented with a VMA marked
as being VM_FOREIGN.

Signed-off-by: Derek Murray [EMAIL PROTECTED]

diff -r ba918cb2cf75 -r df7d0555ec38 mm/memory.c
--- a/mm/memory.c	Tue Dec 04 11:54:22 2007 +
+++ b/mm/memory.c	Wed Dec 05 18:08:44 2007 +
@@ -395,6 +395,9 @@ struct page *vm_normal_page(struct vm_ar
 		if (!is_cow_mapping(vma-vm_flags))
 			return NULL;
 	}
+
+	if (unlikely(vma-vm_flags  VM_FOREIGN))
+		return NULL;
 
 	/*
 	 * Add some anal sanity checks for now. Eventually,
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization

Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-05 Thread Derek Murray

Jeremy Fitzhardinge wrote:

Could we use one of the software-defined bits in the PTE to indicate
that this is a foreign/granted PTE, and have set_pte_at behave
differently if you pass it a pte with this bit set?


Actually, as Gerd pointed out in his answer to his own question, the use 
of VM_DONTCOPY cuts out this entire code path, so we don't need to worry 
about it.


Mind you, it looks like we're going to go ahead and use one of the PTE 
bits to signify foreign PTEs anyway, per Keir's suggestion. Either way, 
it's going to involve making Xen-specific changes to the mm code... have 
you any ideas how we can either (i) get rid of the zap_pte hook in the 
vm_operations_struct, or (ii) make a really compelling case to the 
kernel maintainers that it really should get in?


Regards,

Derek.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-05 Thread Derek Murray

Keir Fraser wrote:


Actually I'm not so sure now. Presumably you add VM_PFNMAP to make
vm_normal_page() return NULL? But actually I would expect pte_pfn() to
return max_mapnr because the mapped page is not a local page. And that
should cause vm_normal_page() to return NULL always, regardless of whether
you assert VM_PFNMAP. Is gntdev being used to grant-and-map local pages in
the test that causes the crash?


That's right (gntdev is being used to map (but not grant) a local page). 
The test case creates a virtual block device in Dom0, and attempts to 
map its ring buffer in a user-space daemon in Dom0. Therefore pte_pfn 
succeeds.


Regards,

Derek.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-05 Thread Derek Murray

Keir Fraser wrote:

Need to bite the bullet and fix this properly by setting a software flag in
ptes that are not subject to reference counting.


Could we get away with testing the VM_FOREIGN flag in vm_normal_page()? 
Although I get the impression that this wouldn't be easily justified if 
trying to merge with upstream Linux



Unfortunately that also needs a hypervisor interface change, to allow
setting of those pte flags. Easily done though, and we should definitely get
that piece in for 3.2.0.


Alternatively, could we use the _PAGE_GNTTAB PTE flag that is used for 
debugging? Indeed, if we did this, could be obviate the need for the 
PTE-zapping hook, by instead catching the case where this flag is set, 
and unmapping the grant implicitly?


Otherwise, what would the semantics of this new flag be?

Regards,

Derek.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-05 Thread Jeremy Fitzhardinge
Derek Murray wrote:
 Ultimately, fork calls dup_mm, which calls, dup_mmap, which calls
 copy_{page,pud,pmd,pte}_range, which calls copy_one_pte, which calls
 set_pte_at, which hypercalls HYPERVISOR_update_va_mapping.

 The hypercall will not succeed and will return an error code
 indicating the reason for this. Therefore the PTE will not be set.
 There appears to be no way to propagate this error through the Linux
 VM code, because there is no concept of a PTE update failing. I could
 add return codes to all those functions, but I don't fancy their
 chances upstream

Could we use one of the software-defined bits in the PTE to indicate
that this is a foreign/granted PTE, and have set_pte_at behave
differently if you pass it a pte with this bit set?

J
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-05 Thread Jeremy Fitzhardinge
Derek Murray wrote:
 Jeremy Fitzhardinge wrote:
 Could we use one of the software-defined bits in the PTE to indicate
 that this is a foreign/granted PTE, and have set_pte_at behave
 differently if you pass it a pte with this bit set?

 Actually, as Gerd pointed out in his answer to his own question, the
 use of VM_DONTCOPY cuts out this entire code path, so we don't need to
 worry about it.

 Mind you, it looks like we're going to go ahead and use one of the PTE
 bits to signify foreign PTEs anyway, per Keir's suggestion. Either
 way, it's going to involve making Xen-specific changes to the mm code... 

Sneaking in a user for the otherwise completely unused PTE bits should
be fairly straightforward.

 have you any ideas how we can either (i) get rid of the zap_pte hook
 in the vm_operations_struct, or (ii) make a really compelling case to
 the kernel maintainers that it really should get in? 

Hm, I haven't spent much time looking at how grant tables and their
mappings work yet, so I can't say I really understand all this myself. 
Hence, questions:

Can we take a different approach from the zap_pte hook?  Given that
we're 1) planning on claiming a pte bit for grant mappings, and 2) need
to hook ptep_get_and_clear anyway to solve the mprotect performance
problems, couldn't we just special-case grant mapping pte_clears?

In 2.6.18-xen the only two implementations of zap_pte are
blktap_clear_pte and gntdev_clear_pte.  Given a ptep with the
grant-mapping bit set, could we determine which of these need calling
and do the appropriate thing?  Do we even need separate implementations
of the core pte-clearing functionality?  Could we just say something like:

if (pte  _PAGE_XEN_FOREIGN)
HYPERVISOR_grant_table_op(GNTTABOP_unmap_grant_ref, ...);
else
xen_set_pte_at(...);


blktap_clear_pte and gntdev_clear_pte do other housekeeping, but do they
have to be done at the same instant as the grant mapping clear?  Could
they be done via some other hook?

(I see Gerd just proposed this, pretty much.)

J
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-05 Thread Keir Fraser
On 5/12/07 17:17, Derek Murray [EMAIL PROTECTED] wrote:

 Actually I'm not so sure now. Presumably you add VM_PFNMAP to make
 vm_normal_page() return NULL? But actually I would expect pte_pfn() to
 return max_mapnr because the mapped page is not a local page. And that
 should cause vm_normal_page() to return NULL always, regardless of whether
 you assert VM_PFNMAP. Is gntdev being used to grant-and-map local pages in
 the test that causes the crash?
 
 That's right (gntdev is being used to map (but not grant) a local page).
 The test case creates a virtual block device in Dom0, and attempts to
 map its ring buffer in a user-space daemon in Dom0. Therefore pte_pfn
 succeeds.

Need to bite the bullet and fix this properly by setting a software flag in
ptes that are not subject to reference counting.

Unfortunately that also needs a hypervisor interface change, to allow
setting of those pte flags. Easily done though, and we should definitely get
that piece in for 3.2.0.

Setting VM_PFNMAP is bogus. We used to do that for privcmd mappings too, but
we stopped because IIRC it had other unwanted side effects.

 -- Keir


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH 3/9] unify load_segment macro

2007-12-05 Thread Glauber de Oliveira Costa
This patch unifies the load_segment() macro, making them equal in both
x86_64 and i386 architectures. The common version goes to system.h,
and the old are deleted.

Signed-off-by: Glauber de Oliveira Costa [EMAIL PROTECTED]
---
 include/asm-x86/system.h|   21 +
 include/asm-x86/system_32.h |   22 --
 include/asm-x86/system_64.h |   20 
 3 files changed, 21 insertions(+), 42 deletions(-)

diff --git a/include/asm-x86/system.h b/include/asm-x86/system.h
index 6e9491d..1ac6088 100644
--- a/include/asm-x86/system.h
+++ b/include/asm-x86/system.h
@@ -39,6 +39,27 @@ __asm__ __volatile__ (movw %%dx,%1\n\t \
 #define set_limit(ldt, limit) _set_limit(((char *)(ldt)) , ((limit)-1))
 
 /*
+ * Load a segment. Fall back on loading the zero
+ * segment if something goes wrong..
+ */
+#define loadsegment(seg, value)\
+   asm volatile(\n   \
+   1:\t  \
+   movl %k0,%% #seg \n \
+   2:\n  \
+   .section .fixup,\ax\\n  \
+   3:\t  \
+   movl %k1, %% #seg \n\t  \
+   jmp 2b\n  \
+   .previous\n   \
+   .section __ex_table,\a\\n\t \
+   _ASM_ALIGN \n\t   \
+   _ASM_PTR  1b,3b\n \
+   .previous \
+   : :r (value), r (0))
+
+
+/*
  * Save a segment register away
  */
 #define savesegment(seg, value) \
diff --git a/include/asm-x86/system_32.h b/include/asm-x86/system_32.h
index 717aeb9..a0641a3 100644
--- a/include/asm-x86/system_32.h
+++ b/include/asm-x86/system_32.h
@@ -34,28 +34,6 @@ extern struct task_struct * FASTCALL(__switch_to(struct 
task_struct *prev, struc
  2 (prev), d (next));  \
 } while (0)
 
-/*
- * Load a segment. Fall back on loading the zero
- * segment if something goes wrong..
- */
-#define loadsegment(seg,value) \
-   asm volatile(\n   \
-   1:\t  \
-   mov %0,%% #seg \n   \
-   2:\n  \
-   .section .fixup,\ax\\n  \
-   3:\t  \
-   pushl $0\n\t  \
-   popl %% #seg \n\t   \
-   jmp 2b\n  \
-   .previous\n   \
-   .section __ex_table,\a\\n\t \
-   .align 4\n\t  \
-   .long 1b,3b\n \
-   .previous \
-   : :rm (value))
-
-
 static inline void native_clts(void)
 {
asm volatile (clts);
diff --git a/include/asm-x86/system_64.h b/include/asm-x86/system_64.h
index f340060..da46059 100644
--- a/include/asm-x86/system_64.h
+++ b/include/asm-x86/system_64.h
@@ -43,26 +43,6 @@
 extern void load_gs_index(unsigned); 
 
 /*
- * Load a segment. Fall back on loading the zero
- * segment if something goes wrong..
- */
-#define loadsegment(seg,value) \
-   asm volatile(\n   \
-   1:\t  \
-   movl %k0,%% #seg \n \
-   2:\n  \
-   .section .fixup,\ax\\n  \
-   3:\t  \
-   movl %1,%% #seg \n\t\
-   jmp 2b\n  \
-   .previous\n   \
-   .section __ex_table,\a\\n\t \
-   .align 8\n\t  \
-   .quad 1b,3b\n \
-   .previous \
-   : :r (value), r (0))
-
-/*
  * Clear and set 'TS' bit respectively
  */
 #define clts() __asm__ __volatile__ (clts)
-- 
1.4.4.2

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 0/9 - v2] Integrate system.h

2007-12-05 Thread Ingo Molnar

* Glauber de Oliveira Costa [EMAIL PROTECTED] wrote:

 At Ingo's request, here it goes a new patchset, that actually applies 
 ontop of the x86 tree (mm branch). Besides this issue, I've also 
 included a patch that remove the cr8 references, as Andi suggested.

thanks - i've picked them up.

Ingo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-05 Thread Keir Fraser
On 5/12/07 20:15, Jeremy Fitzhardinge [EMAIL PROTECTED] wrote:

 In 2.6.18-xen the only two implementations of zap_pte are
 blktap_clear_pte and gntdev_clear_pte.  Given a ptep with the
 grant-mapping bit set, could we determine which of these need calling
 and do the appropriate thing?  Do we even need separate implementations
 of the core pte-clearing functionality?  Could we just say something like:
 
 if (pte  _PAGE_XEN_FOREIGN)
 HYPERVISOR_grant_table_op(GNTTABOP_unmap_grant_ref, ...);
 else
 xen_set_pte_at(...);

You'd need to track pte-grant_handle mappings somewhere, but it could
certainly be done this way, yes.

 -- Keir


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH 7/9] unify smp parts of system.h

2007-12-05 Thread Glauber de Oliveira Costa
The memory barrier parts of system.h are not very different between
i386 and x86_64, the main difference being the availability of
instructions, which we handle with the use of ifdefs.

They are consolidated in system.h file, and then removed from
the arch-specific headers.

Signed-off-by: Glauber de Oliveira Costa [EMAIL PROTECTED]
---
 include/asm-x86/system.h|  105 +++
 include/asm-x86/system_32.h |   99 
 include/asm-x86/system_64.h |   25 --
 3 files changed, 105 insertions(+), 124 deletions(-)

Index: linux-2.6-x86/include/asm-x86/system.h
===
--- linux-2.6-x86.orig/include/asm-x86/system.h
+++ linux-2.6-x86/include/asm-x86/system.h
@@ -202,4 +202,109 @@ extern void free_init_pages(char *what, 
 
 void default_idle(void);
 
+/*
+ * Force strict CPU ordering.
+ * And yes, this is required on UP too when we're talking
+ * to devices.
+ */
+#ifdef CONFIG_X86_32
+/*
+ * For now, wmb() doesn't actually do anything, as all
+ * Intel CPU's follow what Intel calls a *Processor Order*,
+ * in which all writes are seen in the program order even
+ * outside the CPU.
+ *
+ * I expect future Intel CPU's to have a weaker ordering,
+ * but I'd also expect them to finally get their act together
+ * and add some real memory barriers if so.
+ *
+ * Some non intel clones support out of order store. wmb() ceases to be a
+ * nop for these.
+ */
+#define mb() alternative(lock; addl $0,0(%%esp), mfence, X86_FEATURE_XMM2)
+#define rmb() alternative(lock; addl $0,0(%%esp), lfence, X86_FEATURE_XMM2)
+#define wmb() alternative(lock; addl $0,0(%%esp), sfence, X86_FEATURE_XMM)
+#else
+#define mb()   asm volatile(mfence:::memory)
+#define rmb()  asm volatile(lfence:::memory)
+#define wmb()  asm volatile(sfence ::: memory)
+#endif
+
+/**
+ * read_barrier_depends - Flush all pending reads that subsequents reads
+ * depend on.
+ *
+ * No data-dependent reads from memory-like regions are ever reordered
+ * over this barrier.  All reads preceding this primitive are guaranteed
+ * to access memory (but not necessarily other CPUs' caches) before any
+ * reads following this primitive that depend on the data return by
+ * any of the preceding reads.  This primitive is much lighter weight than
+ * rmb() on most CPUs, and is never heavier weight than is
+ * rmb().
+ *
+ * These ordering constraints are respected by both the local CPU
+ * and the compiler.
+ *
+ * Ordering is not guaranteed by anything other than these primitives,
+ * not even by data dependencies.  See the documentation for
+ * memory_barrier() for examples and URLs to more information.
+ *
+ * For example, the following code would force ordering (the initial
+ * value of a is zero, b is one, and p is a):
+ *
+ * programlisting
+ * CPU 0   CPU 1
+ *
+ * b = 2;
+ * memory_barrier();
+ * p = b; q = p;
+ * read_barrier_depends();
+ * d = *q;
+ * /programlisting
+ *
+ * because the read of *q depends on the read of p and these
+ * two reads are separated by a read_barrier_depends().  However,
+ * the following code, with the same initial values for a and b:
+ *
+ * programlisting
+ * CPU 0   CPU 1
+ *
+ * a = 2;
+ * memory_barrier();
+ * b = 3;  y = b;
+ * read_barrier_depends();
+ * x = a;
+ * /programlisting
+ *
+ * does not enforce ordering, since there is no data dependency between
+ * the read of a and the read of b.  Therefore, on some CPUs, such
+ * as Alpha, y could be set to 3 and x to 0.  Use rmb()
+ * in cases like this where there are no data dependencies.
+ **/
+
+#define read_barrier_depends() do { } while (0)
+
+#ifdef CONFIG_SMP
+#define smp_mb()   mb()
+#ifdef CONFIG_X86_PPRO_FENCE
+# define smp_rmb() rmb()
+#else
+# define smp_rmb() barrier()
+#endif
+#ifdef CONFIG_X86_OOSTORE
+# define smp_wmb() wmb()
+#else
+# define smp_wmb() barrier()
+#endif
+#define smp_read_barrier_depends() read_barrier_depends()
+#define set_mb(var, value) do { (void) xchg(var, value); } while (0)
+#else
+#define smp_mb()   barrier()
+#define smp_rmb()  barrier()
+#define smp_wmb()  barrier()
+#define smp_read_barrier_depends() do { } while (0)
+#define set_mb(var, value) do { var = value; barrier(); } while (0)
+#endif
+
+
 #endif
Index: linux-2.6-x86/include/asm-x86/system_32.h
===
--- linux-2.6-x86.orig/include/asm-x86/system_32.h
+++ linux-2.6-x86/include/asm-x86/system_32.h
@@ -36,105 +36,6 @@ extern struct task_struct * FASTCALL(__s
 #endif /* __KERNEL__ */
 
 
-/*
- * Force strict CPU ordering.
- * And yes, this is required on UP too when we're talking
- 

Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-05 Thread Gerd Hoffmann
 Alternatively, could we use the _PAGE_GNTTAB PTE flag that is used for
 debugging? Indeed, if we did this, could be obviate the need for the
 PTE-zapping hook, by instead catching the case where this flag is set,
 and unmapping the grant implicitly?
 
 Well, in the general case you don't have enough info to know which grant to
 release (a single page can be granted multiple times).

You'll also get the mm and the addr which should make it sufficiently
unique, so this looks like a doable approach to me.

ptep_get_and_clear_full() in include/asm-x86/pgtable_32.h needs to be
changed take care, but that should be possible to do and the change is
local to x86 paravirt_ops, which looks much better to me than touching
generic mm code.

cheers,
  Gerd


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-05 Thread Keir Fraser
On 5/12/07 14:30, Derek Murray [EMAIL PROTECTED] wrote:

 Keir Fraser wrote:
 Is this patch to go into linux-2.6.18-xen.hg then?
 
 Yes, even if it doesn't fix the exact bug we're seeing here, I think it
 should go in. I've attached a version with my signed-off-by and a better
 commit comment.

Actually I'm not so sure now. Presumably you add VM_PFNMAP to make
vm_normal_page() return NULL? But actually I would expect pte_pfn() to
return max_mapnr because the mapped page is not a local page. And that
should cause vm_normal_page() to return NULL always, regardless of whether
you assert VM_PFNMAP. Is gntdev being used to grant-and-map local pages in
the test that causes the crash?

 -- Keir


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-05 Thread Derek Murray

Stephen C. Tweedie wrote:

So... the interface (a) cannot be used on the Linux VM without at least
one invasive VM modification, due to the requirement of ptes being
explicitly unmapped via hypercall;


Also there is the use of VM_FOREIGN 
(http://xenbits.xensource.com/linux-2.6.18-xen.hg?file/b2768401db94/mm/memory.c 
lines 1040--1059), which has been used quite happily in blktap since 
2005 
(http://lists.xensource.com/archives/html/xen-changelog/2005-07/msg00053.html). 
While it may not be a priority to get gntdev into pv-ops Linux, I should 
imagine that blktap would be fairly critical.



I can't help wondering if this is a hint that now is the time to find a
better API, which doesn't have the requirement (a) that seems to be
causing such trouble?  Are other PV guests --- *BSD, Solaris --- going
to have the same problems with their VM layers if they try to implement
this API?  Upstream Linux pv_ops certainly will, and it would be good if
we could avoid tying unprivileged guests to ABIs which cannot hope to be
merged into pv_ops.


I'm open to suggestions... but I think it always reduces to needing a 
hook that is called on process exit before the PTEs are zapped.



(Just what is the cost of not having this functionality in blktap,
anyway?)


If tapdisk dies whilst holding a granted page, the page can never be 
ungranted, so we leak that page.


Regards,

Derek.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: Next steps with pv_ops for Xen

2007-12-05 Thread Gerd Hoffmann
  Hi,

 gntdev doesn't even try to handle forking.  I wouldn't be surprised if
 that is a great way to kill Domain-0.  The xen hypervisor will most
 likely not be amused to find a pte refering to a granted (but foreign)
 page which wasn't established using the grant table interface.  Pinning
 the pgd of the child process will most likely fail and make the kernel
 BUG().

Ok, isn't that bad thanks to the VM_DONTCOPY.  The child just doesn't
get the grant mapping.

cheers,
  Gerd

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization